Git Product home page Git Product logo

Comments (20)

notthatbreezy avatar notthatbreezy commented on June 18, 2024

Is someone doing this? I'd be willing to do it, but don't want to duplicate someone else's effort.

from congress.

konklone avatar konklone commented on June 18, 2024

I was planning on tackling it in January, but not this year. If you'd like to jump on it, feel free. I've written a bulk downloader for bill text through GPO's sitemaps for Sunlight's Congress API already, in Ruby, and was going to port it to Python for this project.

I've also changed my thinking a bit since originally filing the ticket - instead of downloading the actual bill text (which is a large amount of data, and not particularly useful, since GPO has them in bulk at reliable URLs), the task should just fetch basic information for each version of each bill, along with links to bill text for each version.

So what I'd like to see for this project is a bill_versions.py task that iterates through GPO's sitemaps and produces a versions.json file for each bill (it'd sit alongside the data.json file for that bill, if it existed). That versions JSON would basically be an array of hashes, where each hash contains that version's version code and code name (I have a mapping for them), the date the version was issued, and then links to the bill's PDF, XML, and plaintext versions of its text - and also links to that version's landing pages on GPO, and its MODS and PREMIS metadata files.

Since I've done this all once before, have a specific idea of how it should be done, and would basically be porting a script over, I am very happy to do it. :) But you're welcome to start it too if you want, and I'll offer help however I can.

from congress.

JoshData avatar JoshData commented on June 18, 2024

I need the actual PDF, MODS, etc., so a command-line flag to mirror those files locally would help me. (It should be smart and only download files if a hash has changed, or something.) Or I can always write my own mirroring script later, of course.

There's also HTML from THOMAS, which is harder to scrape. I can post my old Perl code for that if anyone wants to tackle that.

from congress.

konklone avatar konklone commented on June 18, 2024

Yeah, I'll need them too. I actually think that might be best implemented as another .py file, separate from bill_versions.py, whose sole goal is to use the data in any present versions.json files to download all the requested material to a local cache. That script would take in command line parameters dictating which kinds of URLs would be downloaded to disk, for instance, and maybe other useful parameters like rate limiting.

Is the HTML version on THOMAS useful in any way that's distinct from the value you get from what GPO has? GPO's XML versions of bills come with display stylesheets that render them as usefully (and much more official-looking) than THOMAS'.

from congress.

JoshData avatar JoshData commented on June 18, 2024

I got errors when I tried to run the XSLT stylesheet against the bill text XML. The HTML is also a simpler structure which is good for doing comparisons and other analysis. (If they gave us the raw GPO locator codes files.....)

from congress.

notthatbreezy avatar notthatbreezy commented on June 18, 2024

Ok cool. I was just trying to look for a way to contribute - though I'm not sure I can get to it before January either (currently trying to get StateRepMe ready to launch the first week of next year).

I might leave this to you then, but if there's another task that I could help with or contribute to let me know. I haven't contributed to many open source projects (trying to change that now), but in the process of graduate school I have a lot of experience writing web scrapers for THOMAS and more generally.

from congress.

konklone avatar konklone commented on June 18, 2024

@tauberer - Just checked out Congress.gov, apparently they have a plaintext view, their own PDF copy, and an XML version with CSS. They may just be mirroring GPO's data exactly, I don't know. But either way, since THOMAS is closing next year sometime, it may not be a good idea to build display stuff around its HTML structure.

from congress.

konklone avatar konklone commented on June 18, 2024

@notthatbreezy Hope my information blast wasn't discouraging, you just got me thinking about things. :) Besides this task, there are also a couple of bugs that need addressing in summary parsing and in handling THOMAS' instability, that you may have already handled in your own scraper.

from congress.

notthatbreezy avatar notthatbreezy commented on June 18, 2024

@konklone Ha, not at all - but you're right, might be easier to work on some of the bugs to start out with.

I don't think I've had these issues with Thomas yet, but I'll see what I can do. My webscrapers for THOMAS are actually some legacy code I wrote a couple of years ago before Capitol Words was around to grab the Congressional Record and parse it to identify speakers. That stuff is written in Perl, though I've been using Python for the last couple of years to do NLP stuff so everything I write now is in that for the most part.

from congress.

JoshData avatar JoshData commented on June 18, 2024

@notthatbreezy Agreed. In the meanwhile I'll keep using my Perl scripts, and hopefully some day I'll figure out how to use the bill text XML in a useful way.

from congress.

JoshData avatar JoshData commented on June 18, 2024

I did a first pass in cfaafd3. This adds a new task call fdsys which has two parts. The first updates a local cache of the entire FDSys sitemap, which has value beyond bill text. The second part creates text-versions.json files next to each bill data.json which look like this:

{
"ih": {
"lastmod": "2013-01-09T05:54:00.347Z",
"url": "http://www.gpo.gov/fdsys/pkg/BILLS-113hr30ih/content-detail.html"
}
}

I'll give this a second pass and extend this dict soonish. (Eric, feel free to jump in if you have particular logic you want to add, but I'll get to it too.)

from congress.

konklone avatar konklone commented on June 18, 2024

Oh, awesome. Yeah, I think I'll jump in after I grab some lunch, just to split this out a bit. I think there should be a bill_versions.py task that makes use of a fdsys.py file full of generic FDSys goodness, that other tasks can use. It will probably operate as a peer to bills.py, since it uses its own method of iteration, and if one were running regular syncs of this data, you might want to do different sync intervals to THOMAS and to GPO.

Unless you have strong feelings, I'll also rename text-versions.json to versions.json, just cause version implies more than just a text change. There isn't always a text change between versions, and depending on the version code it has different legal significance independent from the text value, the bill XML structure/style will be different, etc. The best is when GPO includes CSS in their XML to make engrossed bills' background look like 1700's-style parchment.

I'll also import this mapping of version codes to version names, which used to exist at GPO Access and I don't know if they ever reproduced it in FDSys anywhere after GPO Access closed down:
https://github.com/sunlightlabs/congress/blob/master/tasks/utils.rb#L258

from congress.

JoshData avatar JoshData commented on June 18, 2024

I don't see the point in either, really. I don't mind the versions code being split off, but I think it fits naturally with the other routines that use the same sitemap files and that are updated from the sitemap files. Unless there's a functional difference or something besides aesthetics, let's just leave it for now?

For the naming- 'versions' seems ambiguous (what's being versioned? could be change tracking of data.json) and "version" isn't a term used in Congress. I don't think it's understandable unless "text" is in there. (Also not entirely sure it makes sense. There aren't really multiple versions of a bill. It's the same bill throughout.)

from congress.

JoshData avatar JoshData commented on June 18, 2024

Oh and on status. There have been weird ad hoc codes like eas2 for a 2nd print at EAS status. Most recently hr2608-112. I think any status code can be followed by an integer... assuming there are any rules at all.

from congress.

konklone avatar konklone commented on June 18, 2024

Let me see if I can show you what I mean, about a bill_versions.py - it
preserves fdsys.py and has it do all the FDSys-specific stuff, while
letting a bill_versions task do things specific to bills. It's more than
aesthetics, because it will let us more easily make other tasks that source
their information from FDSys - or even to make fdsys.py its own standalone
lib that other non-Congressional projects can use.

I don't feel super strongly about text-versions vs versions, since it
really is a pretty murky relationship between the text and the version (and
this is just aesthetics), but I do view what is being versioned as
including, but not being limited to, the raw text.

That's real good to know about the ad hoc codes, I'll try to work that in.

On Sun, Jan 20, 2013 at 4:03 PM, Joshua Tauberer
[email protected]:

Oh and on status. There have been weird ad hoc codes like eas2 for a 2nd
print at EAS status. Most recently hr2608-112. I think any status code can
be followed by an integer... assuming there are any rules at all.


Reply to this email directly or view it on GitHubhttps://github.com//issues/18#issuecomment-12477393.

Developer | sunlightfoundation.com

from congress.

konklone avatar konklone commented on June 18, 2024

Just FYI - I am mid-refactor, working on an fdsys branch. It doesn't actually disturb much of the code you wrote, it just layers a bit on top - for example, I just merged your recent commit without any real trouble.

from congress.

konklone avatar konklone commented on June 18, 2024

Also, one of the things this refactor will let us do is make use of the process_set utils function you carved out that expects response status codes from a task for standard logging, which is helpful. It makes FDSys an implementation detail, rather than a path of its own - while preserving fdsys.py as a general purpose task and lib for whatever you or anyone else wants to do with it.

from congress.

konklone avatar konklone commented on June 18, 2024

OK, it's done in 5082e62, 75bd99e, and b249015. All the fdsys task commands work that you added, and I make use of some of the code in it to do bill_versions.py. bill_versions.py works a lot like bills.py, and can be filtered by a congress, a bill, or a specific version of a bill.

Because of how our set processing works, it actually was a lot easier for me to deposit a file per version, instead of a file per bill. So the version info is in the bill's data dir, at text-versions/[version_code].json. Someone who was reading those versions in would be able to sort them in order by their issued_on date (which is the only reasonable way to order them anyhow - that's how I'd have sorted them if I'd saved them as an array).

It doesn't have a --fast mode or equivalent yet -- I'd like to add a --since flag to limit it just to bills with their lastmod in the last X days, 7 days by default. I'll get to this as I work to integrate this data into my own system.

I also didn't add a --store flag, though I could - I was anticipating downloading the actual bill text files separately, but maybe it makes more sense to include that here. I guess in that case I'd probably change it from text-versions/[version_code].json to text-versions/[version_code]/data.json and put the documents in there too - very similar to how your generic FDsys mirror-er does it.

from congress.

JoshData avatar JoshData commented on June 18, 2024

Cool.

from congress.

konklone avatar konklone commented on June 18, 2024

I can already tell I'm going to want to re-use some of your fdsys.py work in other contexts, like downloading Congressional reports and possibly even court opinions. Probably not worth jumping the gun and breaking it out into its own library yet, but I could see a unitedstates/fdsys someday.

from congress.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.