Git Product home page Git Product logo

congress's People

Contributors

acxz avatar boblannon avatar camisatx avatar connorjoleary avatar crdunwel avatar dcloud avatar divergentdave avatar dwillis avatar elianull avatar gphemsley avatar hugovk avatar jamesa avatar jamesturk avatar jonathanstrong avatar joshdata avatar konklone avatar lorien avatar michaelblyons avatar paultag avatar plantfansam avatar richardbx avatar ryparker avatar s4njee avatar stevesdawg avatar trentmercer avatar treymo avatar tribble avatar willvanwazer avatar wilson428 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

congress's Issues

Touch up download resiliency

Make it so the script will continue after a timeout or other network error, and return an appropriate status dict back to the bills task when it does. Also, the actual timeout logging code raises an exception because it's broken!

Bug in summary 'as' parsing

On PPACA, HR 3590 (111th):

"summary": {
    "as": "Public Law.\u00a0\u00a0\u00a0\u00a0(There are 3 <a href=\"/cgi-bin/bdquery/z?d111:HR03590:@@@D&summ1&\">other summaries</a>)", 
   ...
}

Cloture vote on PPACA not getting parsed

Not sure why:

{
  "acted_at": "2009-12-23", 
  "references": [], 
  "text": "Cloture invoked in Senate by Yea-Nay Vote. 60 - 39. Record Vote Number: 395.", 
  "type": "action"
}

Catch THOMAS' occasional DB bugs

I get occasional warnings about the introduction date not being found, even though the HTML is not truncated (the end html tag appears). It turns out, there's a DB error that is ephemeral (the next night, the same bill works fine).

Catch these and swallow (don't email) the warnings on this event, unless there are more than some number (20?) during a sweep that would indicate a systemic issue.

Hinge on "DBNAME not found in file".

Rsync + Govtrack

Josh, It looks like your new /congress folder (http://www.govtrack.us/data/congress/) is where you are storing the files generated by the scrapers here in this repo. It also seems that those files are not available via rsync, as I keep getting a "no such file or directory" error. Can you unlock them so they can be downloaded?

Congressional Record Data

GovTrack.us includes XML markup of the Congressional Record data delivered by the THOMAS system. I wonder if this project maintains or is going to maintain that feature and publishes the source code of the parser.

Text of amendments

I saw Gordon and Josh discussing this on the govtrack side of things, but am I right that this does not currently fetch amendment text from Congressional Record? If so, I can take a stab. Did you ever convert your Perl to Python, @JoshData?

Feature Request: Scraping Local and State Legislation

We have a bit of a problem, make that a somewhat larger problem. That problem is finding information on the our local and state of DE's websites is very difficult to do for the average user. I would be very much interested in scraping data from State Legislation:
http://legis.delaware.gov/BillTracking

As well as local legislation and meetings:
https://sussexcountyde.gov/docs/agendasMinutes/index.cfm?resource=council

Even if this is outside of the scope of the current govtrack project, I came to you guys first because I would be interested in chipping in a bit to get a project like this kickstarted for our state. The lack of transparency and public awareness on anything is astounding. We need a source of information where people can go in, in layman's terms, and track current issues.

If there's anything I can do to help or you guys can provide some assistance, I would be very grateful, as well as I believe most citizens in my state would be. Thanks.

Rename 'state' to 'status'?

I introduced 'state' nodes to supersede the older 'status' nodes a few years back. Since the 'status' nodes have been deprecated in my XML, what do you think of renaming 'state' to 'status' throughout? It removes the confusion with state/district.

CSV output for bills

It seems like CSV should be easy enough to produce as an output format for bills, for a common subset of the information. This would make the bulk data this scraper provides useful to a much wider variety of people.

We certainly couldn't capture all the data for a bill in CSV, but basic stuff like current title, bill code, sponsor, summary if available, introduction date, etc., we can export that.

The only complication I see is that our output currently does one file per bill, and this would really need to be one file per Congress to be useful.

Scrape Failures on Senate Votes

I found I was missing a bunch of Senate votes, and it looks like I'm having an error in the scrape process. It might be a problem w/ my environment missing some dependencies or something, since I'm not a Python guy. Maybe you guys can see the problem.

Here is an example...

Command:

./run votes --vote_id=s195-110.2008 --force

Result:

Going to fetch 1 votes from congress #110 session 2008

Errors for 1 items:
[s195-110.2008] Exception:

Traceback (most recent call last):

  File "tasks/utils.py", line 100, in process_set
    results = fetch_func(id, options, *extra_args)

  File "tasks/vote_info.py", line 57, in fetch_vote
    parse_senate_vote(dom, vote)

  File "tasks/vote_info.py", line 174, in parse_senate_vote
    "congress": int(dom.xpath("number(document/document_congress)")),

ValueError: cannot convert float NaN to integer

When I scrape an entire Congress worth of votes, the errors look different:

ValueError: time data '' does not match format '%B %d, %Y, %I:%M %p'
[s2-110.2008] Exception:

Traceback (most recent call last):

  File "tasks/utils.py", line 100, in process_set
    results = fetch_func(id, options, *extra_args)

  File "tasks/vote_info.py", line 57, in fetch_vote
    parse_senate_vote(dom, vote)

  File "tasks/vote_info.py", line 148, in parse_senate_vote
    vote["record_modified"] = parse_date(dom.xpath("string(modify_date)"))

  File "tasks/vote_info.py", line 145, in parse_date
    return datetime.datetime.strptime(d, "%B %d, %Y, %I:%M %p")

  File "/usr/local/lib/python2.7/_strptime.py", line 325, in _strptime
    (data_string, format))

New Data: American Memory

American Memory has dates, titles, keywords, and committee information for bills from the 6th through 42nd Congresses:

http://memory.loc.gov/ammem/amlaw/lwhblink.html
http://memory.loc.gov/ammem/amlaw/lwsblink.html
http://memory.loc.gov/ammem/amlaw/lwsrlink.html

It also has images of the bills, including multiple versions of the same bill in some cases (i.e. due to changes made to the bill).

See also:
http://memory.loc.gov/ammem/amlaw/lawhome.html
http://memory.loc.gov/ammem/amlaw/lwsp.html
http://memory.loc.gov/ammem/amlaw/lwss.html

Adding history.active_at and history.active flags

One of the fundamental ways of setting bills apart is whether they've received any action at all beyond formal introduction and referral to a committee. If a committee so much as considers them, or they're going right to the floor or something, that's a big deal.

These two fields would fit in with the other history flags, and would be calculated by looking for an action beyond the type that all introduced bills are guaranteed to have. The active_at timestamp would remain the date of the first non-SOP action even as other actions were taken.

These fields would facilitate easy filtering and sorting of "active bills", a helpful and user-friendly view of what Congress is up to.

New data: Bill versions, with text links

Sync version information with GPO using their sitemaps for all the years they have it available.

Write a new bill_versions.py task, which deposits a versions.json file for every bill that is available.

This file should contain an array of information on each version, including:

  • Version code (e.g. "ih", "enr")
  • Version name ("Introduced in House", "Enrolled"
  • Date issued
  • PDF text link
  • XML text link
  • Plain text link
  • GPO landing page link
  • MODS XML link
  • PREMIS XML link

Some bills can appear in GPO first, or in THOMAS first, so neither bill_info.py nor bill_versions.py should depend on each other's output in any way.

Small Hiccup in utils.py

Hello all:

I'm quite new to Python, but I keep getting this error when trying to grab bills:

[agave]$ ./run bills --congress=113
Traceback (most recent call last):
File "./run", line 47, in
import utils
File "tasks/utils.py", line 312
thomas_types_2 = { v[0]: k for (k, v) in thomas_types.items() }

SyntaxError: invalid syntax

It goes away if I take out line 312, but I'm sure this is probably screwing something up, and I don't know enough about Python yet to understand exactly what it doesn't like. There's a carrot (^) pointing at the space after the "for".

Great work! Hope you can help!

Guidance on Parsing to a DB?

Hello all:

Sorry for the noob post. I'm prety confident with the script, getting it to run, but having two issues:

  1. How do I get this to run as a cronjob and update regularly? Able to get it to run in a terminal but lost in cron.
  2. Not even sure where to begin with making this data useful in a MySQL database. Can anyone recommend any good reading for where I should start to translate JSON/XML files into a usable database?

Thanks! Sorry to pester!

New data: Amendments

Also from THOMAS.gov. Can be switched to Congress.gov later, after it achieves parity with THOMAS.gov. Most of the code for bills should be reusable for amendments.

Instead of submodules, use the cache dir/flags to manage download of people data

Ideally, no one should have to remember to also manually keep a git submodule in sync when working with the THOMAS scraper, or to work it into their automated sync process. Instead of using utils.download, we can actually call out to the system to do a "git clone", deleting the folder if it already exists (or not doing this, if it's there and the cache flag is on).

Unit tests for bill results for bills in various conditions

In test_bill_history.py, cover (at least) the following cases:

  • health care reconciliation act
    (lots of back and forth, possibly different action language)
  • Veto override failure
  • Veto override success
  • Awaiting signature (fictional)
  • Vetoed bill, no override attempt
  • Bill that passed senate and failed House
  • Bill that passed House and failed Senate
  • Concurrent resolution that passed both chambers
  • Simple resolution that passed House
  • Simple resolution that passed Senate

Check the history flags, the sponsors, etc. This should exercise most of the code path of the scraper. Don't need to be very comprehensive in testing individual action parsing - that's handled thoroughly in a separate test file already.

Separate common utilities from source-specific scripts

(A spin-off from #34.)

As the project is growing, it is starting to feel growing pains from the utilities that have been added. Common utilities that do not rely on outside sources should be split into their own separate file(s) so that new scripts can import them without importing methods that aren't needed.

For example, bill_info.py contains a lot methods useful for outputting bill data, but also contains a lot of methods for getting bill data from THOMAS. Also, utils.py might be better split off into multiple files grouped by function.

Historical: Legislator IDs are 0000000 in some historical House votes

In the vote data for the 110th Congress there are many legislator ids that say "0000000", which is invalid. I guess it's a problem w/ the source data and nothing we can do about it. A workaround might be to compare the date, lastname, state, and chamber to the Legislators table to try and fill in the missing ID value.

New Data: Statutes at Large

Just a heads-up that Gordon and I are working on pulling info out of the new Statutes at Large MODS files.

Parse: assigned subjects

All LOC-assigned subjects. This has to come from the subjects page, not the All Information page.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.