Methodology
Every score, classification, and ranking on What The Vote is computed from public data using open, reproducible algorithms. No cloud AI, no black boxes. This page explains how each one works.
- Bill Search
- Topic Classification
- Balance Score
- Fiscal Entity Analysis
- Independence Score
- Voting Similarity
- Participation Rate
- Topic Passage Rates
- Data Sources
- Limitations
Bill Search
When you search for something like "health insurance" on the Discover page, the system finds bills that discuss related concepts — even if they don't contain those exact words. It converts your query and every bill's text into numerical representations, then ranks bills by how closely they match.
The same technique powers the "Similar Bills" section on individual bill pages: given one bill, it finds the five most similar bills in the same congress.
Technical details
Model: scikit-learn TfidfVectorizer fitted on the full bill corpus (~100K bills across multiple congresses). The trained vectorizer and sparse TF-IDF matrix are stored as .joblib and .npz files.
Search algorithm: The query is vectorized using the pre-fitted vectorizer. Cosine similarity is computed via sparse matrix multiplication:
similarities = (tfidf_matrix @ query_vector.T) / ||query_vector||
Results are extracted using np.argpartition() for efficient top-K selection, then sorted by score. Scores range from 0 (no match) to 1 (identical text).
Similar bills: Uses the same approach but filters to the active congress and excludes the source bill. A minimum similarity threshold of 0.05 prevents noise.
Lazy loading: The corpus (84 MB), vectorizer, and TF-IDF matrix (122 MB) are loaded once per server process on first request, then cached in memory.
Topic Classification
Each bill is automatically assigned a policy topic (e.g., "Healthcare," "Defense & Veterans," "Economy & Taxes"). The classifier learns from Congress.gov's own policy_area labels, then predicts topics for bills that lack official labels.
Technical details
Model: scikit-learn LinearSVC (Support Vector Classifier) wrapped in a pipeline with its own TF-IDF vectorization stage. Trained on bills that have official Congress.gov policy_area labels.
Features: TF-IDF vectors of bill title + summary text.
Output: A single predicted topic label per bill (e.g., "Public Lands and Natural Resources").
Model size: 128 MB (topic_clf.joblib). Lazy-loaded only on demand.
Integrity: SHA-256 checksums verify that model files haven't been tampered with between builds.
Balance Score (People vs. Power)
The balance score answers a simple question: does this bill primarily benefit ordinary people or governmental/corporate entities? It scans bill text for mentions of specific groups — veterans, families, small businesses on one side; federal agencies, corporations, defense contractors on the other — and produces a score from −1 (all power) to +1 (all people).
The score is displayed on bill detail pages and aggregated on the Discover page to show trends across congresses and parties.
Technical details
Model: Rule-based regex extraction with weighted entity counting (not machine learning).
Two-layer extraction:
- Layer A (Direct): Regex patterns detect mentions of 28 entity types in the first 8,000 characters of bill text
- Layer B (Contextual): Action-entity patterns capture intent (e.g., "grant credit for [veterans]" receives 1.5x weight)
Entity categories:
- Body entities (22 types): veterans, small businesses, workers, students, seniors, families, consumers, patients, farmers, low-income, minorities, disabled, tribal, rural, immigrants, first responders, homeowners, renters, women, youth, communities, uninsured
- Ruling entities (6 types): members of Congress, federal agencies/departments, executive branch, intelligence agencies, corporations/lobbyists, defense contractors
Scoring formula:
balance = (body_weight - ruling_weight) / (body_weight + ruling_weight)
range: [-1.0, +1.0]
Example: A bill mentioning veterans (2.0) and families (1.5) with defense contractors (1.2) scores (3.5 − 1.2) / (3.5 + 1.2) = +0.49 (moderately pro-people).
Fiscal Entity Analysis
This analysis finds dollar amounts mentioned in bills (e.g., "$500 million") and identifies who is closest to each dollar figure in the text. It answers questions like "How much funding goes to veterans?" by pairing every monetary reference with the nearest named group within a fixed text window.
Technical details
Model: Rule-based proximity matching (not machine learning).
Dollar detection: Regex captures amounts like $500, $1.2 million, $500M, with scale normalization (billion, million, thousand). Minimum threshold: $1,000.
Entity association: For each dollar amount, the algorithm searches ±200 characters for the nearest entity mention (using the same 28 entity types as balance scoring). The closest entity wins. Deduplication prevents double-counting of (entity, rounded_amount) pairs.
Aggregation: Per-bill (top 5 body + top 5 ruling entities with totals) and per-congress (aggregate dollar totals per entity type).
Independence Score
The independence score measures how often a member of Congress votes against their own party's majority position. A higher score means the member breaks from their party more often. It's shown on the Scorecard page and individual member profiles.
Only members with at least 10 party-line votes are scored, to avoid misleading percentages from small samples.
Technical details
Input: Roll-call vote CSVs (one per bill). Each records every member's Yea/Nay/Not Voting.
Algorithm:
- For each roll call, determine the majority position per party (Yea if ≥50% of party members voted Yea, Nay otherwise)
- For each member, tally loyal votes (matched party majority) and defections (opposed party majority)
- Independence rate =
defections / (loyal + defections) × 100
Threshold: Minimum 10 party-line votes required.
Output: Top 25 most independent and top 25 most loyal members, ranked by defection percentage.
Voting Similarity
Voting similarity shows which pairs of members vote the same way most often. The most interesting pairs are cross-party — a Democrat and Republican who agree on 80% of votes tells you something about the political center. Results appear on the Scorecard page.
Technical details
Algorithm (fully vectorized):
- Build a vote matrix: members × bills, values +1 (Yea), −1 (Nay), 0 (absent)
- Compute shared vote count:
shared = voted @ voted.T(matrix multiplication counts bills both members voted on) - Compute agreement count:
agree = yea @ yea.T + nay @ nay.T(counts votes in the same direction) - Agreement percentage:
100 × agree[i,j] / shared[i,j]
Filters: Only pairs with ≥10 shared votes. Results split into cross-party pairs (top 15) and same-party pairs (top 10).
Participation Rate
A straightforward measure: what percentage of available roll-call votes did a member actually vote on? Missing votes can mean anything from illness to strategic abstention — we present the number without editorializing.
Technical details
Formula:
participation = (yea_count + nay_count) / total_rollcalls_in_chamber × 100
Computed per chamber (House and Senate have different roll-call totals). Shown on the Scorecard as top 25 most active and bottom 25 least active.
Topic Passage Rates
How often do bills on each policy topic actually become law? This analysis tracks passage rates by topic across congresses — for example, do defense bills pass more often than environmental bills? Shown on the Discover page.
Technical details
Model: Deterministic aggregation (not machine learning).
Law detection: A bill is counted as "became law" if its latest_status field contains any of: Became, Signed, Public Law, or Enacted (case-insensitive).
Grouping: Bills grouped by (congress, policy_area). Passage rate = laws / total bills per group.
Output: Per-congress and all-time passage rates per topic, used for trend visualization.
Data Sources
All data on What The Vote comes from official, publicly available government sources:
Known Limitations
We believe in showing our work — including where it falls short:
- Balance scores are approximations. The algorithm uses keyword matching, not deep reading comprehension. A bill "helping corporations" might actually help small businesses. Context is lost.
- TF-IDF search misses synonyms. Searching "healthcare" may not surface bills that only use "medical care" or "wellness." The model matches words, not concepts.
- Voting similarity measures alignment, not ideology. Two members who agree on 90% of votes may have very different reasons for those votes.
- Text windows are limited. Balance and fiscal analysis use the first 8,000 characters of bill text. Long amendments and riders later in the bill may be missed.
- Bill data may lag. Updates run multiple times daily but can trail official sources by hours.
- Topic labels depend on Congress.gov. Only ~24% of bills have official
policy_arealabels. The classifier fills the gap, but imperfectly. - No LLM involved. All models are classical machine learning (scikit-learn). This means they're fast and reproducible, but lack the nuance of large language models.
Update Cadence
- Bill data: Updated three times daily (incremental) with a full rebuild weekly
- Member rosters: Updated daily via automated pipeline
- ML models: Retrained weekly or when the corpus grows significantly
- Balance & fiscal scores: Recomputed incrementally with each bill data update
- Census data: Updated when new ACS vintages are released (annually)
Questions about our methodology? Get in touch. All code is available for review upon request.