Comments (20)
Is someone doing this? I'd be willing to do it, but don't want to duplicate someone else's effort.
from congress.
I was planning on tackling it in January, but not this year. If you'd like to jump on it, feel free. I've written a bulk downloader for bill text through GPO's sitemaps for Sunlight's Congress API already, in Ruby, and was going to port it to Python for this project.
I've also changed my thinking a bit since originally filing the ticket - instead of downloading the actual bill text (which is a large amount of data, and not particularly useful, since GPO has them in bulk at reliable URLs), the task should just fetch basic information for each version of each bill, along with links to bill text for each version.
So what I'd like to see for this project is a bill_versions.py task that iterates through GPO's sitemaps and produces a versions.json file for each bill (it'd sit alongside the data.json file for that bill, if it existed). That versions JSON would basically be an array of hashes, where each hash contains that version's version code and code name (I have a mapping for them), the date the version was issued, and then links to the bill's PDF, XML, and plaintext versions of its text - and also links to that version's landing pages on GPO, and its MODS and PREMIS metadata files.
Since I've done this all once before, have a specific idea of how it should be done, and would basically be porting a script over, I am very happy to do it. :) But you're welcome to start it too if you want, and I'll offer help however I can.
from congress.
I need the actual PDF, MODS, etc., so a command-line flag to mirror those files locally would help me. (It should be smart and only download files if a hash has changed, or something.) Or I can always write my own mirroring script later, of course.
There's also HTML from THOMAS, which is harder to scrape. I can post my old Perl code for that if anyone wants to tackle that.
from congress.
Yeah, I'll need them too. I actually think that might be best implemented as another .py file, separate from bill_versions.py, whose sole goal is to use the data in any present versions.json files to download all the requested material to a local cache. That script would take in command line parameters dictating which kinds of URLs would be downloaded to disk, for instance, and maybe other useful parameters like rate limiting.
Is the HTML version on THOMAS useful in any way that's distinct from the value you get from what GPO has? GPO's XML versions of bills come with display stylesheets that render them as usefully (and much more official-looking) than THOMAS'.
from congress.
I got errors when I tried to run the XSLT stylesheet against the bill text XML. The HTML is also a simpler structure which is good for doing comparisons and other analysis. (If they gave us the raw GPO locator codes files.....)
from congress.
Ok cool. I was just trying to look for a way to contribute - though I'm not sure I can get to it before January either (currently trying to get StateRepMe ready to launch the first week of next year).
I might leave this to you then, but if there's another task that I could help with or contribute to let me know. I haven't contributed to many open source projects (trying to change that now), but in the process of graduate school I have a lot of experience writing web scrapers for THOMAS and more generally.
from congress.
@tauberer - Just checked out Congress.gov, apparently they have a plaintext view, their own PDF copy, and an XML version with CSS. They may just be mirroring GPO's data exactly, I don't know. But either way, since THOMAS is closing next year sometime, it may not be a good idea to build display stuff around its HTML structure.
from congress.
@notthatbreezy Hope my information blast wasn't discouraging, you just got me thinking about things. :) Besides this task, there are also a couple of bugs that need addressing in summary parsing and in handling THOMAS' instability, that you may have already handled in your own scraper.
from congress.
@konklone Ha, not at all - but you're right, might be easier to work on some of the bugs to start out with.
I don't think I've had these issues with Thomas yet, but I'll see what I can do. My webscrapers for THOMAS are actually some legacy code I wrote a couple of years ago before Capitol Words was around to grab the Congressional Record and parse it to identify speakers. That stuff is written in Perl, though I've been using Python for the last couple of years to do NLP stuff so everything I write now is in that for the most part.
from congress.
@notthatbreezy Agreed. In the meanwhile I'll keep using my Perl scripts, and hopefully some day I'll figure out how to use the bill text XML in a useful way.
from congress.
I did a first pass in cfaafd3. This adds a new task call fdsys which has two parts. The first updates a local cache of the entire FDSys sitemap, which has value beyond bill text. The second part creates text-versions.json files next to each bill data.json which look like this:
{
"ih": {
"lastmod": "2013-01-09T05:54:00.347Z",
"url": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr30ih/content-detail.html"
}
}
I'll give this a second pass and extend this dict soonish. (Eric, feel free to jump in if you have particular logic you want to add, but I'll get to it too.)
from congress.
Oh, awesome. Yeah, I think I'll jump in after I grab some lunch, just to split this out a bit. I think there should be a bill_versions.py task that makes use of a fdsys.py file full of generic FDSys goodness, that other tasks can use. It will probably operate as a peer to bills.py, since it uses its own method of iteration, and if one were running regular syncs of this data, you might want to do different sync intervals to THOMAS and to GPO.
Unless you have strong feelings, I'll also rename text-versions.json to versions.json, just cause version implies more than just a text change. There isn't always a text change between versions, and depending on the version code it has different legal significance independent from the text value, the bill XML structure/style will be different, etc. The best is when GPO includes CSS in their XML to make engrossed bills' background look like 1700's-style parchment.
I'll also import this mapping of version codes to version names, which used to exist at GPO Access and I don't know if they ever reproduced it in FDSys anywhere after GPO Access closed down:
https://github.com/sunlightlabs/congress/blob/master/tasks/utils.rb#L258
from congress.
I don't see the point in either, really. I don't mind the versions code being split off, but I think it fits naturally with the other routines that use the same sitemap files and that are updated from the sitemap files. Unless there's a functional difference or something besides aesthetics, let's just leave it for now?
For the naming- 'versions' seems ambiguous (what's being versioned? could be change tracking of data.json) and "version" isn't a term used in Congress. I don't think it's understandable unless "text" is in there. (Also not entirely sure it makes sense. There aren't really multiple versions of a bill. It's the same bill throughout.)
from congress.
Oh and on status. There have been weird ad hoc codes like eas2 for a 2nd print at EAS status. Most recently hr2608-112. I think any status code can be followed by an integer... assuming there are any rules at all.
from congress.
Let me see if I can show you what I mean, about a bill_versions.py - it
preserves fdsys.py and has it do all the FDSys-specific stuff, while
letting a bill_versions task do things specific to bills. It's more than
aesthetics, because it will let us more easily make other tasks that source
their information from FDSys - or even to make fdsys.py its own standalone
lib that other non-Congressional projects can use.
I don't feel super strongly about text-versions vs versions, since it
really is a pretty murky relationship between the text and the version (and
this is just aesthetics), but I do view what is being versioned as
including, but not being limited to, the raw text.
That's real good to know about the ad hoc codes, I'll try to work that in.
On Sun, Jan 20, 2013 at 4:03 PM, Joshua Tauberer
[email protected]:
Oh and on status. There have been weird ad hoc codes like eas2 for a 2nd
print at EAS status. Most recently hr2608-112. I think any status code can
be followed by an integer... assuming there are any rules at all.—
Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-12477393.
Developer | sunlightfoundation.com
from congress.
Just FYI - I am mid-refactor, working on an fdsys branch. It doesn't actually disturb much of the code you wrote, it just layers a bit on top - for example, I just merged your recent commit without any real trouble.
from congress.
Also, one of the things this refactor will let us do is make use of the process_set utils function you carved out that expects response status codes from a task for standard logging, which is helpful. It makes FDSys an implementation detail, rather than a path of its own - while preserving fdsys.py as a general purpose task and lib for whatever you or anyone else wants to do with it.
from congress.
OK, it's done in 5082e62, 75bd99e, and b249015. All the fdsys task commands work that you added, and I make use of some of the code in it to do bill_versions.py. bill_versions.py works a lot like bills.py, and can be filtered by a congress, a bill, or a specific version of a bill.
Because of how our set processing works, it actually was a lot easier for me to deposit a file per version, instead of a file per bill. So the version info is in the bill's data dir, at text-versions/[version_code].json. Someone who was reading those versions in would be able to sort them in order by their issued_on date (which is the only reasonable way to order them anyhow - that's how I'd have sorted them if I'd saved them as an array).
It doesn't have a --fast mode or equivalent yet -- I'd like to add a --since flag to limit it just to bills with their lastmod in the last X days, 7 days by default. I'll get to this as I work to integrate this data into my own system.
I also didn't add a --store flag, though I could - I was anticipating downloading the actual bill text files separately, but maybe it makes more sense to include that here. I guess in that case I'd probably change it from text-versions/[version_code].json to text-versions/[version_code]/data.json and put the documents in there too - very similar to how your generic FDsys mirror-er does it.
from congress.
Cool.
from congress.
I can already tell I'm going to want to re-use some of your fdsys.py work in other contexts, like downloading Congressional reports and possibly even court opinions. Probably not worth jumping the gun and breaking it out into its own library yet, but I could see a unitedstates/fdsys someday.
from congress.
Related Issues (20)
- Python 3 support HOT 6
- Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities' HOT 1
- Vote format has changed for House 2020? HOT 6
- [Bug] Error handling in govinfo.py line 73 HOT 5
- [Bug] Votes scraper not pulling in most recent vote, until I cleared cache HOT 2
- [Bug] Bad zip file HOT 1
- Newbie Q: Pulling bills for only one topic HOT 2
- Is there any interest in using govinfo's bulkdata zip files HOT 1
- Error: ImportError: No module named html.entities after the Feb 28th update HOT 4
- Unable to scrape Committee meetings HOT 1
- Downloading House votes in 2001 and 1991 raises exception HOT 5
- Error in parsing sponsor & byRequest HOT 4
- Discrepancies on amendment roll call votes
- Update PyPI Package HOT 8
- (votes, committee_meetings): senate.gov and clerk.house.gov not redirecting to https
- Correct Virtual Env Suggestion
- Request - Include Mastodon ID for members of congress HOT 2
- Error from lxml when parsing amendments "purpose" field HOT 1
- Bills & data.json HOT 1
- Errors when parsing amendments for 118th Congress
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from congress.