unitedstates / congress Goto Github PK

View Code? Open in Web Editor NEW

871.0 871.0 191.0 1.92 MB

Public domain data collectors for the work of Congress, including legislation, amendments, and votes.

Home Page: https://github.com/unitedstates/congress/wiki

License: Creative Commons Zero v1.0 Universal

Python 98.75% Shell 0.71% Dockerfile 0.54%

congress's People

Contributors

Stargazers

Watchers

Forkers

notthatbreezy gphemsley milimetric web5design arlingtonhouse coderich favila mylesdc imclab dvogel bchartoff jt5d terraeclipse spclops hourback shawful priestd09 theoryno3 dcloud tribble dagilmore polianainc michaelblyons fireside21 richardbx willvanwazer ttren hugovk adammendoza ajmarcus crdunwel paultag humbertoroa rajankrkharel jonathanstrong markwalls trecor patrickford solomon christopher-churnick potatochip melnann74 jforms nolawee khrome chriscondon boblannon trcook haithemaraissia alexschiller chishaku sahildahiya ahminous jeffj900 bpd1069 yuvarajmuthu umby steigw divergentdave cliftonmcintosh albatinti pajamaw krusynth stevenhaddox jefferyq101 illegalnumbers gitteri rivingtondigital jacobmoe montanarchism iamjoshbinder alejandrox1 jradice kevinschaul alkami-io babelshift erickpeirson meli-lewis chrisgarbs mitchellmackin artyemsee pqnelson waggertron mikes-nth eprikazc vmadxxxxxx djrobinson jefferyq quorumus tribbettz alextay09 salikwarsi severedsurvival camisatx pri8771 nathan-gilbert emilywdeng tb0x0 petershan1119 anthias

congress's Issues

Touch up download resiliency

Make it so the script will continue after a timeout or other network error, and return an appropriate status dict back to the bills task when it does. Also, the actual timeout logging code raises an exception because it's broken!

Bug in summary 'as' parsing

On PPACA, HR 3590 (111th):

"summary": {
    "as": "Public Law.\u00a0\u00a0\u00a0\u00a0(There are 3 <a href=\"/cgi-bin/bdquery/z?d111:HR03590:@@@D&summ1&\">other summaries</a>)", 
   ...
}

Cloture vote on PPACA not getting parsed

Not sure why:

{
  "acted_at": "2009-12-23", 
  "references": [], 
  "text": "Cloture invoked in Senate by Yea-Nay Vote. 60 - 39. Record Vote Number: 395.", 
  "type": "action"
}

Catch THOMAS' occasional DB bugs

I get occasional warnings about the introduction date not being found, even though the HTML is not truncated (the end html tag appears). It turns out, there's a DB error that is ephemeral (the next night, the same bill works fine).

Catch these and swallow (don't email) the warnings on this event, unless there are more than some number (20?) during a sweep that would indicate a systemic issue.

Hinge on "DBNAME not found in file".

XML generator uses the wrong Congress number for law enactment

https://github.com/unitedstates/congress/blob/master/tasks/bill_info.py#L286

This should use action["congress"], not bill["congress"].

See, for example, 85-hjres336, which was enacted as Public Law 86-11.

Error reports should always include the bill ID in question

If the introduction date is missing or something, and it spawns a report, then the bill_id should be included. This may mean passing the bill_id in to each helper function.

Rsync + Govtrack

Josh, It looks like your new /congress folder (http://www.govtrack.us/data/congress/) is where you are storing the files generated by the scrapers here in this repo. It also seems that those files are not available via rsync, as I keep getting a "no such file or directory" error. Can you unlock them so they can be downloaded?

Set up very basic email notifications for when the script has an exception or fails on some bills

Make sure the script can email someone if it's being run on an automatic schedule. Create a config file whose values are not stored in source control (store an example file in source control). It should be okay if the config file's not there (someone checking out the repo should be able to run the main commands immediately), it just won't email anyone.

Congressional Record Data

GovTrack.us includes XML markup of the Congressional Record data delivered by the THOMAS system. I wonder if this project maintains or is going to maintain that feature and publishes the source code of the parser.

New Data: Nominations

Not that it's of any particular urgency, but filing this ticket to reflect a conversation over at unitedstates/wish-list#7 by @wilson428 and @dwillis about getting nominations from THOMAS.

Text of amendments

I saw Gordon and Josh discussing this on the govtrack side of things, but am I right that this does not currently fetch amendment text from Congressional Record? If so, I can take a stab. Did you ever convert your Perl to Python, @JoshData?

Some All Info pages are too big, detect this and get individual pages

An example: S 1867 has so much summary and activity that the TITLES section is truncated and the sections below it are missing. Super lame - hopefully the same algorithms work on the individual pages as on the All Info page.

sres5-113 should be marked as failed

It's a simple Senate resolution that failed. It's left as "INTRODUCED", but should be "FAIL:ORIGINATING:SENATE".

I'll grab this.

Skip bills that are Reserved for the Speaker

They shouldn't appear in the output at all, even if THOMAS has pages for them.

Feature Request: Scraping Local and State Legislation

We have a bit of a problem, make that a somewhat larger problem. That problem is finding information on the our local and state of DE's websites is very difficult to do for the average user. I would be very much interested in scraping data from State Legislation:
http://legis.delaware.gov/BillTracking

As well as local legislation and meetings:
https://sussexcountyde.gov/docs/agendasMinutes/index.cfm?resource=council

Even if this is outside of the scope of the current govtrack project, I came to you guys first because I would be interested in chipping in a bit to get a project like this kickstarted for our state. The lack of transparency and public awareness on anything is astounding. We need a source of information where people can go in, in layman's terms, and track current issues.

If there's anything I can do to help or you guys can provide some assistance, I would be very grateful, as well as I believe most citizens in my state would be. Thanks.

Rename 'state' to 'status'?

I introduced 'state' nodes to supersede the older 'status' nodes a few years back. Since the 'status' nodes have been deprecated in my XML, what do you think of renaming 'state' to 'status' throughout? It removes the confusion with state/district.

Pagination is incomplete on 5-digit bill lists

In the 94th Congress, the House bills got up past 10,000, and the paginator isn't getting them all.

CSV output for bills

It seems like CSV should be easy enough to produce as an output format for bills, for a common subset of the information. This would make the bulk data this scraper provides useful to a much wider variety of people.

We certainly couldn't capture all the data for a bill in CSV, but basic stuff like current title, bill code, sponsor, summary if available, introduction date, etc., we can export that.

The only complication I see is that our output currently does one file per bill, and this would really need to be one file per Congress to be useful.

Use roll call votes to reverse engineer voting records for bills from Statutes At Large

I just realized (and perhaps this has been discussed before) that roll call vote results are available for most, if not all, of the Congresses for which we have obtained bill information from the Statutes at Large (#34). We should, then, be able to reverse-engineer the voting records for these bills.

Scrape Failures on Senate Votes

I found I was missing a bunch of Senate votes, and it looks like I'm having an error in the scrape process. It might be a problem w/ my environment missing some dependencies or something, since I'm not a Python guy. Maybe you guys can see the problem.

Here is an example...

Command:

./run votes --vote_id=s195-110.2008 --force

Result:

Going to fetch 1 votes from congress #110 session 2008

Errors for 1 items:
[s195-110.2008] Exception:

Traceback (most recent call last):

  File "tasks/utils.py", line 100, in process_set
    results = fetch_func(id, options, *extra_args)

  File "tasks/vote_info.py", line 57, in fetch_vote
    parse_senate_vote(dom, vote)

  File "tasks/vote_info.py", line 174, in parse_senate_vote
    "congress": int(dom.xpath("number(document/document_congress)")),

ValueError: cannot convert float NaN to integer

When I scrape an entire Congress worth of votes, the errors look different:

ValueError: time data '' does not match format '%B %d, %Y, %I:%M %p'
[s2-110.2008] Exception:

Traceback (most recent call last):

  File "tasks/utils.py", line 100, in process_set
    results = fetch_func(id, options, *extra_args)

  File "tasks/vote_info.py", line 57, in fetch_vote
    parse_senate_vote(dom, vote)

  File "tasks/vote_info.py", line 148, in parse_senate_vote
    vote["record_modified"] = parse_date(dom.xpath("string(modify_date)"))

  File "tasks/vote_info.py", line 145, in parse_date
    return datetime.datetime.strptime(d, "%B %d, %Y, %I:%M %p")

  File "/usr/local/lib/python2.7/_strptime.py", line 325, in _strptime
    (data_string, format))

Parse: related committees

Any related committees for a bill.

New Data: American Memory

American Memory has dates, titles, keywords, and committee information for bills from the 6th through 42nd Congresses:

http://memory.loc.gov/ammem/amlaw/lwhblink.html
http://memory.loc.gov/ammem/amlaw/lwsblink.html
http://memory.loc.gov/ammem/amlaw/lwsrlink.html

It also has images of the bills, including multiple versions of the same bill in some cases (i.e. due to changes made to the bill).

New Data: Treaties

Take a similar approach to nominations, and get these. Super valuable.

Pull out public law numbers on enacted action

Missing from here:

{
  "acted_at": "2010-03-23", 
  "references": [], 
  "state": "ENACTED:SIGNED", 
  "text": "Became Public Law No: 111-148.", 
  "type": "enacted"
}

Parse: related amendments

The IDs of all related amendments.

Adding history.active_at and history.active flags

One of the fundamental ways of setting bills apart is whether they've received any action at all beyond formal introduction and referral to a committee. If a committee so much as considers them, or they're going right to the floor or something, that's a big deal.

These two fields would fit in with the other history flags, and would be calculated by looking for an action beyond the type that all introduced bills are guaranteed to have. The active_at timestamp would remain the date of the first non-SOP action even as other actions were taken.

These fields would facilitate easy filtering and sorting of "active bills", a helpful and user-friendly view of what Congress is up to.

Better split up sponsor and cosponsor information

Names and districts - even separately from normalization efforts.

Parse: date of introduction

Extract the introduction date.

New data: Bill versions, with text links

Sync version information with GPO using their sitemaps for all the years they have it available.

Write a new bill_versions.py task, which deposits a versions.json file for every bill that is available.

This file should contain an array of information on each version, including:

Version code (e.g. "ih", "enr")
Version name ("Introduced in House", "Enrolled"
Date issued
PDF text link
XML text link
Plain text link
GPO landing page link
MODS XML link
PREMIS XML link

Some bills can appear in GPO first, or in THOMAS first, so neither bill_info.py nor bill_versions.py should depend on each other's output in any way.

Not handling two-digit congresses yet

Some errors on paginating through 99th Congress results.

Small Hiccup in utils.py

Hello all:

I'm quite new to Python, but I keep getting this error when trying to grab bills:

[agave]$ ./run bills --congress=113
Traceback (most recent call last):
File "./run", line 47, in
import utils
File "tasks/utils.py", line 312
thomas_types_2 = { v[0]: k for (k, v) in thomas_types.items() }

SyntaxError: invalid syntax

It goes away if I take out line 312, but I'm sure this is probably screwing something up, and I don't know enough about Python yet to understand exactly what it doesn't like. There's a carrot (^) pointing at the space after the "for".

Great work! Hope you can help!

Guidance on Parsing to a DB?

Hello all:

Sorry for the noob post. I'm prety confident with the script, getting it to run, but having two issues:

How do I get this to run as a cronjob and update regularly? Able to get it to run in a terminal but lost in cron.
Not even sure where to begin with making this data useful in a MySQL database. Can anyone recommend any good reading for where I should start to translate JSON/XML files into a usable database?

Thanks! Sorry to pester!

Done with branches?

We have fdsys and votesparser branches. I'll delete them?

New data: Amendments

Also from THOMAS.gov. Can be switched to Congress.gov later, after it achieves parity with THOMAS.gov. Most of the code for bills should be reusable for amendments.

Extract information from action description

The action type (similar to how GovTrack does it, including vote and vote2), roll call numbers, associated bill IDs, anything.

Parse: related committees from a standalone page fallback

For truncated bills.

Instead of submodules, use the cache dir/flags to manage download of people data

Ideally, no one should have to remember to also manually keep a git submodule in sync when working with the THOMAS scraper, or to work it into their automated sync process. Instead of using utils.download, we can actually call out to the system to do a "git clone", deleting the folder if it already exists (or not doing this, if it's there and the cache flag is on).

Unit tests for bill results for bills in various conditions

In test_bill_history.py, cover (at least) the following cases:

health care reconciliation act
(lots of back and forth, possibly different action language)
Veto override failure
Veto override success
Awaiting signature (fictional)
Vetoed bill, no override attempt
Bill that passed senate and failed House
Bill that passed House and failed Senate
Concurrent resolution that passed both chambers
Simple resolution that passed House
Simple resolution that passed Senate

Check the history flags, the sponsors, etc. This should exercise most of the code path of the scraper. Don't need to be very comprehensive in testing individual action parsing - that's handled thoroughly in a separate test file already.

Issue with data downloaded from 99th and below

The data seems to download correctly, but when examined has clearly not.

General exception handler in runner

Catch all exceptions in runner.py and handle them gracefully.

Test THOMAS info back to 1973

Right now it's been tested to the mid-1990's, but there are almost certainly more corner cases to find.

Allow for related bills that are amendments

S. 1766 in the 107th Congress has a related bill of S. Amdt. 2917.

http://thomas.loc.gov/cgi-bin/bdquery/z?d107:S.1766:
http://thomas.loc.gov/cgi-bin/bdquery/z?d107:SP02917:

./run bills --bill_id=s1766-107 --debug

When done, "samdt2917-107" should appear in the related_bills array, and the docs should be updated accordingly.

Move withdrawn cosponsors into their own array(s)

Separate them from the "cosponsors" array.

Separate common utilities from source-specific scripts

(A spin-off from #34.)

As the project is growing, it is starting to feel growing pains from the utilities that have been added. Common utilities that do not rely on outside sources should be split into their own separate file(s) so that new scripts can import them without importing methods that aren't needed.

For example, bill_info.py contains a lot methods useful for outputting bill data, but also contains a lot of methods for getting bill data from THOMAS. Also, utils.py might be better split off into multiple files grouped by function.

Switch to a proper command line options parser

I'm using a janky split rule on equals signs - we should use the traditional *nix-style command line syntax.

Bill versions?

Do we want to include info about different versions of a bill? Some bills don't have more than one, but for others, Thomas shows something like this:

http://thomas.loc.gov/cgi-bin/query/z?c112:S.978:

Historical: Legislator IDs are 0000000 in some historical House votes

In the vote data for the 110th Congress there are many legislator ids that say "0000000", which is invalid. I guess it's a problem w/ the source data and nothing we can do about it. A workaround might be to compare the date, lastname, state, and chamber to the Legislators table to try and fill in the missing ID value.