Methodology
Every score, classification, and ranking on What The Vote is computed from public data using open, reproducible algorithms. No cloud AI, no black boxes. This page explains how each one works.
- Bill Search
- Topic Classification
- Balance Score
- Fiscal Entity Analysis
- Independence Score
- Voting Similarity
- Participation Rate
- Topic Passage Rates
- Data Sources
- WTF Score
- Limitations
Bill Search
When you search for something like "health insurance" on the Discover page, the system finds bills that discuss related concepts — even if they don't contain those exact words. It converts your query and every bill's text into numerical representations, then ranks bills by how closely they match.
The same technique powers the "Similar Bills" section on individual bill pages: given one bill, it finds the five most similar bills in the same congress.
Technical details
Model: scikit-learn TfidfVectorizer fitted on the full bill corpus (~100K bills across multiple congresses). The trained vectorizer and sparse TF-IDF matrix are stored as .joblib and .npz files.
Search algorithm: The query is vectorized using the pre-fitted vectorizer. Cosine similarity is computed via sparse matrix multiplication:
similarities = (tfidf_matrix @ query_vector.T) / ||query_vector||
Results are extracted using np.argpartition() for efficient top-K selection, then sorted by score. Scores range from 0 (no match) to 1 (identical text).
Similar bills: Uses the same approach but filters to the active congress and excludes the source bill. A minimum similarity threshold of 0.05 prevents noise.
Lazy loading: The corpus (84 MB), vectorizer, and TF-IDF matrix (122 MB) are loaded once per server process on first request, then cached in memory.
Topic Classification
Each bill is automatically assigned a policy topic (e.g., "Healthcare," "Defense & Veterans," "Economy & Taxes"). The classifier learns from Congress.gov's own policy_area labels, then predicts topics for bills that lack official labels.
Technical details
Model: scikit-learn LinearSVC (Support Vector Classifier) wrapped in a pipeline with its own TF-IDF vectorization stage. Trained on bills that have official Congress.gov policy_area labels.
Features: TF-IDF vectors of bill title + summary text.
Output: A single predicted topic label per bill (e.g., "Public Lands and Natural Resources").
Model size: 128 MB (topic_clf.joblib). Lazy-loaded only on demand.
Integrity: SHA-256 checksums verify that model files haven't been tampered with between builds.
Balance Score (People vs. Power)
The balance score answers a simple question: does this bill primarily benefit ordinary people or governmental/corporate entities? It scans bill text for mentions of specific groups — veterans, families, small businesses on one side; federal agencies, corporations, defense contractors on the other — and produces a score from −1 (all power) to +1 (all people).
The score is displayed on bill detail pages and aggregated on the Discover page to show trends across congresses and parties.
Technical details
Model: Rule-based regex extraction with weighted entity counting (not machine learning).
Two-layer extraction:
- Layer A (Direct): Regex patterns detect mentions of 28 entity types in the first 8,000 characters of bill text
- Layer B (Contextual): Action-entity patterns capture intent (e.g., "grant credit for [veterans]" receives 1.5x weight)
Entity categories:
- Body entities (22 types): veterans, small businesses, workers, students, seniors, families, consumers, patients, farmers, low-income, minorities, disabled, tribal, rural, immigrants, first responders, homeowners, renters, women, youth, communities, uninsured
- Ruling entities (6 types): members of Congress, federal agencies/departments, executive branch, intelligence agencies, corporations/lobbyists, defense contractors
Scoring formula:
balance = (body_weight - ruling_weight) / (body_weight + ruling_weight)
range: [-1.0, +1.0]
Example: A bill mentioning veterans (2.0) and families (1.5) with defense contractors (1.2) scores (3.5 − 1.2) / (3.5 + 1.2) = +0.49 (moderately pro-people).
Fiscal Entity Analysis
This analysis finds dollar amounts mentioned in bills (e.g., "$500 million") and identifies who is closest to each dollar figure in the text. It answers questions like "How much funding goes to veterans?" by pairing every monetary reference with the nearest named group within a fixed text window.
Technical details
Model: Rule-based proximity matching (not machine learning).
Dollar detection: Regex captures amounts like $500, $1.2 million, $500M, with scale normalization (billion, million, thousand). Minimum threshold: $1,000.
Entity association: For each dollar amount, the algorithm searches ±200 characters for the nearest entity mention (using the same 28 entity types as balance scoring). The closest entity wins. Deduplication prevents double-counting of (entity, rounded_amount) pairs.
Aggregation: Per-bill (top 5 body + top 5 ruling entities with totals) and per-congress (aggregate dollar totals per entity type).
Independence Score
The independence score measures how often a member of Congress votes against their own party's majority position. A higher score means the member breaks from their party more often. It's shown on the Scorecard page and individual member profiles.
Only members with at least 10 party-line votes are scored, to avoid misleading percentages from small samples.
Technical details
Input: Roll-call vote CSVs (one per bill). Each records every member's Yea/Nay/Not Voting.
Algorithm:
- For each roll call, determine the majority position per party (Yea if ≥50% of party members voted Yea, Nay otherwise)
- For each member, tally loyal votes (matched party majority) and defections (opposed party majority)
- Independence rate =
defections / (loyal + defections) × 100
Threshold: Minimum 10 party-line votes required.
Output: Top 25 most independent and top 25 most loyal members, ranked by defection percentage.
Voting Similarity
Voting similarity shows which pairs of members vote the same way most often. The most interesting pairs are cross-party — a Democrat and Republican who agree on 80% of votes tells you something about the political center. Results appear on the Scorecard page.
Technical details
Algorithm (fully vectorized):
- Build a vote matrix: members × bills, values +1 (Yea), −1 (Nay), 0 (absent)
- Compute shared vote count:
shared = voted @ voted.T(matrix multiplication counts bills both members voted on) - Compute agreement count:
agree = yea @ yea.T + nay @ nay.T(counts votes in the same direction) - Agreement percentage:
100 × agree[i,j] / shared[i,j]
Filters: Only pairs with ≥10 shared votes. Results split into cross-party pairs (top 15) and same-party pairs (top 10).
Participation Rate
A straightforward measure: what percentage of available roll-call votes did a member actually vote on? Missing votes can mean anything from illness to strategic abstention — we present the number without editorializing.
Technical details
Formula:
participation = (yea_count + nay_count) / total_rollcalls_in_chamber × 100
Computed per chamber (House and Senate have different roll-call totals). Shown on the Scorecard as top 25 most active and bottom 25 least active.
Topic Passage Rates
How often do bills on each policy topic actually become law? This analysis tracks passage rates by topic across congresses — for example, do defense bills pass more often than environmental bills? Shown on the Discover page.
Technical details
Model: Deterministic aggregation (not machine learning).
Law detection: A bill is counted as "became law" if its latest_status field contains any of: Became, Signed, Public Law, or Enacted (case-insensitive).
Grouping: Bills grouped by (congress, policy_area). Passage rate = laws / total bills per group.
Output: Per-congress and all-time passage rates per topic, used for trend visualization.
Data Sources
All data on What The Vote comes from official, publicly available government sources:
Known Limitations
We believe in showing our work — including where it falls short:
- Balance scores are approximations. The algorithm uses keyword matching, not deep reading comprehension. A bill "helping corporations" might actually help small businesses. Context is lost.
- TF-IDF search misses synonyms. Searching "healthcare" may not surface bills that only use "medical care" or "wellness." The model matches words, not concepts.
- Voting similarity measures alignment, not ideology. Two members who agree on 90% of votes may have very different reasons for those votes.
- Text windows are limited. Balance and fiscal analysis use the first 8,000 characters of bill text. Long amendments and riders later in the bill may be missed.
- Bill data may lag. Updates run multiple times daily but can trail official sources by hours.
- Topic labels depend on Congress.gov. Only ~24% of bills have official
policy_arealabels. The classifier fills the gap, but imperfectly. - No LLM involved. All models are classical machine learning (scikit-learn). This means they're fast and reproducible, but lack the nuance of large language models.
WTF Score (What The Facts)
The WTF Score measures the gap between a representative's voting record and their district's measurable needs. It is not an editorial judgment — it is a mathematical output of public data. The identical formula is applied to every member of Congress regardless of party.
The score ranges from 0 to 100. A higher score indicates a larger gap between voting behavior and district alignment. The score is composed of five weighted pillars:
| Pillar | Weight | What It Measures |
|---|---|---|
| District Disconnect | 30% | Votes on bills whose subject areas don't match the district's economic profile (industry employment from Census ACS) |
| Institutional Tilt | 25% | Voting YES on bills where spending favors institutional recipients over individual beneficiaries (from Balance Score analysis) |
| Sponsor Contradiction | 20% | Sponsoring bills in one policy area while voting against similar legislation in the same area |
| Ghost Voting | 15% | Absent ("Not Voting") on bills whose subject areas are relevant to the district's economy |
| Anomaly Complicity | 10% | Voting YES on bills flagged for procedural anomalies (gut-and-replace rewrites, stealth fast-tracking, etc.) by The Honest Copy's detectors |
Data sources: Roll-call votes from congress.gov, CRS bill subjects, Census Bureau ACS 5-year estimates at the congressional district level, and fiscal entity analysis from the WTVote ML pipeline.
Non-partisan by design: The formula uses no party labels, ideological scores, or editorial input. A member's score depends only on the alignment between their votes and their district's demographics. Both parties are scored identically.
Score interpretation:
- 0–30: Voting record broadly aligns with district profile
- 31–60: Notable gaps between voting behavior and district needs
- 61–100: Significant disconnect between votes and district
What it is not: The WTF Score does not measure whether a legislator is "good" or "bad." Members of Congress serve both their district and the nation. A high score means only that the gap between voting patterns and local demographics is measurable — voters can decide what to do with that information.
Update Cadence
- Bill data: Updated three times daily (incremental) with a full rebuild weekly
- Member rosters: Updated daily via automated pipeline
- ML models: Retrained weekly or when the corpus grows significantly
- Balance & fiscal scores: Recomputed incrementally with each bill data update
- Census data: Updated when new ACS vintages are released (annually)
Questions about our methodology? Get in touch. All code is available for review upon request.