Methodology

Every score, classification, and ranking on What The Vote is computed from public data using open, reproducible algorithms. No cloud AI, no black boxes. This page explains how each one works.

Topic Classification

Each bill is automatically assigned a policy topic (e.g., "Healthcare," "Defense & Veterans," "Economy & Taxes"). The classifier learns from Congress.gov's own policy_area labels, then predicts topics for bills that lack official labels.

Technical details

Model: scikit-learn LinearSVC (Support Vector Classifier) wrapped in a pipeline with its own TF-IDF vectorization stage. Trained on bills that have official Congress.gov policy_area labels.

Features: TF-IDF vectors of bill title + summary text.

Output: A single predicted topic label per bill (e.g., "Public Lands and Natural Resources").

Model size: 128 MB (topic_clf.joblib). Lazy-loaded only on demand.

Integrity: SHA-256 checksums verify that model files haven't been tampered with between builds.

Balance Score (People vs. Power)

The balance score answers a simple question: does this bill primarily benefit ordinary people or governmental/corporate entities? It scans bill text for mentions of specific groups — veterans, families, small businesses on one side; federal agencies, corporations, defense contractors on the other — and produces a score from −1 (all power) to +1 (all people).

The score is displayed on bill detail pages and aggregated on the Discover page to show trends across congresses and parties.

Technical details

Model: Rule-based regex extraction with weighted entity counting (not machine learning).

Two-layer extraction:

  • Layer A (Direct): Regex patterns detect mentions of 28 entity types in the first 8,000 characters of bill text
  • Layer B (Contextual): Action-entity patterns capture intent (e.g., "grant credit for [veterans]" receives 1.5x weight)

Entity categories:

  • Body entities (22 types): veterans, small businesses, workers, students, seniors, families, consumers, patients, farmers, low-income, minorities, disabled, tribal, rural, immigrants, first responders, homeowners, renters, women, youth, communities, uninsured
  • Ruling entities (6 types): members of Congress, federal agencies/departments, executive branch, intelligence agencies, corporations/lobbyists, defense contractors

Scoring formula:

balance = (body_weight - ruling_weight) / (body_weight + ruling_weight)
range: [-1.0, +1.0]

Example: A bill mentioning veterans (2.0) and families (1.5) with defense contractors (1.2) scores (3.5 − 1.2) / (3.5 + 1.2) = +0.49 (moderately pro-people).

Fiscal Entity Analysis

This analysis finds dollar amounts mentioned in bills (e.g., "$500 million") and identifies who is closest to each dollar figure in the text. It answers questions like "How much funding goes to veterans?" by pairing every monetary reference with the nearest named group within a fixed text window.

Technical details

Model: Rule-based proximity matching (not machine learning).

Dollar detection: Regex captures amounts like $500, $1.2 million, $500M, with scale normalization (billion, million, thousand). Minimum threshold: $1,000.

Entity association: For each dollar amount, the algorithm searches ±200 characters for the nearest entity mention (using the same 28 entity types as balance scoring). The closest entity wins. Deduplication prevents double-counting of (entity, rounded_amount) pairs.

Aggregation: Per-bill (top 5 body + top 5 ruling entities with totals) and per-congress (aggregate dollar totals per entity type).

Independence Score

The independence score measures how often a member of Congress votes against their own party's majority position. A higher score means the member breaks from their party more often. It's shown on the Scorecard page and individual member profiles.

Only members with at least 10 party-line votes are scored, to avoid misleading percentages from small samples.

Technical details

Input: Roll-call vote CSVs (one per bill). Each records every member's Yea/Nay/Not Voting.

Algorithm:

  1. For each roll call, determine the majority position per party (Yea if ≥50% of party members voted Yea, Nay otherwise)
  2. For each member, tally loyal votes (matched party majority) and defections (opposed party majority)
  3. Independence rate = defections / (loyal + defections) × 100

Threshold: Minimum 10 party-line votes required.

Output: Top 25 most independent and top 25 most loyal members, ranked by defection percentage.

Voting Similarity

Voting similarity shows which pairs of members vote the same way most often. The most interesting pairs are cross-party — a Democrat and Republican who agree on 80% of votes tells you something about the political center. Results appear on the Scorecard page.

Technical details

Algorithm (fully vectorized):

  1. Build a vote matrix: members × bills, values +1 (Yea), −1 (Nay), 0 (absent)
  2. Compute shared vote count: shared = voted @ voted.T (matrix multiplication counts bills both members voted on)
  3. Compute agreement count: agree = yea @ yea.T + nay @ nay.T (counts votes in the same direction)
  4. Agreement percentage: 100 × agree[i,j] / shared[i,j]

Filters: Only pairs with ≥10 shared votes. Results split into cross-party pairs (top 15) and same-party pairs (top 10).

Participation Rate

A straightforward measure: what percentage of available roll-call votes did a member actually vote on? Missing votes can mean anything from illness to strategic abstention — we present the number without editorializing.

Technical details

Formula:

participation = (yea_count + nay_count) / total_rollcalls_in_chamber × 100

Computed per chamber (House and Senate have different roll-call totals). Shown on the Scorecard as top 25 most active and bottom 25 least active.

Topic Passage Rates

How often do bills on each policy topic actually become law? This analysis tracks passage rates by topic across congresses — for example, do defense bills pass more often than environmental bills? Shown on the Discover page.

Technical details

Model: Deterministic aggregation (not machine learning).

Law detection: A bill is counted as "became law" if its latest_status field contains any of: Became, Signed, Public Law, or Enacted (case-insensitive).

Grouping: Bills grouped by (congress, policy_area). Passage rate = laws / total bills per group.

Output: Per-congress and all-time passage rates per topic, used for trend visualization.

Data Sources

All data on What The Vote comes from official, publicly available government sources:

Congress.gov / GovInfo Bills, roll-call votes, bill status, sponsor data, committee assignments
House Clerk / Senate.gov Member rosters, party, state, district, official contact information
U.S. Census Bureau American Community Survey 5-year estimates: demographics, economics, housing, education
Federal Election Commission Campaign finance filings (self-reported by campaigns)
congress-legislators (unitedstates project) Biographical data: birthdays, gender, historical terms
House Statement of Disbursements Members' Representational Allowance (MRA) office spending

Known Limitations

We believe in showing our work — including where it falls short:

  • Balance scores are approximations. The algorithm uses keyword matching, not deep reading comprehension. A bill "helping corporations" might actually help small businesses. Context is lost.
  • TF-IDF search misses synonyms. Searching "healthcare" may not surface bills that only use "medical care" or "wellness." The model matches words, not concepts.
  • Voting similarity measures alignment, not ideology. Two members who agree on 90% of votes may have very different reasons for those votes.
  • Text windows are limited. Balance and fiscal analysis use the first 8,000 characters of bill text. Long amendments and riders later in the bill may be missed.
  • Bill data may lag. Updates run multiple times daily but can trail official sources by hours.
  • Topic labels depend on Congress.gov. Only ~24% of bills have official policy_area labels. The classifier fills the gap, but imperfectly.
  • No LLM involved. All models are classical machine learning (scikit-learn). This means they're fast and reproducible, but lack the nuance of large language models.

Update Cadence

  • Bill data: Updated three times daily (incremental) with a full rebuild weekly
  • Member rosters: Updated daily via automated pipeline
  • ML models: Retrained weekly or when the corpus grows significantly
  • Balance & fiscal scores: Recomputed incrementally with each bill data update
  • Census data: Updated when new ACS vintages are released (annually)

Questions about our methodology? Get in touch. All code is available for review upon request.