Comments (6)
Hi. My (GovTrack) CR scraper hasn't worked properly in a long time.
@konklone can probably point you to Sunlight's Capitol Words parser.
from congress.
Last I knew, @drinks was working on detethering the Capitol Words parser from Solr and making it work as a standalone thing so we could offer bulk data, but I'm not sure how far he's gotten. You might try opening a ticket there, or perhaps my mentioning his name will summon him like a ghost.
from congress.
But of course. That work is long finished, at least to the degree that I'd intended for Capitol Words' needs, though the data hasn't been re-parsed and uploaded yet. The endgame artifacts will be raw text, some flavor of XML (lacking bioguide resolution), and finally json that's ready to be ingested by elasticsearch. It may make sense, though, to provide a more space-thrifty version of the final output documents in json or yaml; ours will have a bunch of nlp cruft. What data points would you be interested in?
from congress.
Thank you for your responses.
What I like about GovTrack data is that each speaking section has the 'speaker' attribute (e.g., here), and there is a separate people.xml file with a lot of attributes, such as party, role, state, district, etc., which help uniquely identify legislators. That is, one can construct a mapping between speakers and their speeches. In this respect GovTrack is great. However, not all documents from the THOMAS system have been parsed.
Does this kind of parsing require a lot of manual coding?
from congress.
GovTrack's people.xml has been transformed and expanded into this project:
https://github.com/unitedstates/congress-legislators
Also, take a look at Capitol Words' API and see if it can't do what you need. It provides speaker IDs (bioguide IDs) which you can link to the congress-legislators project I linked to above.
from congress.
OK, thanks for the discussion.
from congress.
Related Issues (20)
- Python 3 support HOT 6
- Some bills (maybe 1/7 of them) give module 'lxml.html' has no attribute 'entities' HOT 1
- Vote format has changed for House 2020? HOT 6
- [Bug] Error handling in govinfo.py line 73 HOT 5
- [Bug] Votes scraper not pulling in most recent vote, until I cleared cache HOT 2
- [Bug] Bad zip file HOT 1
- Newbie Q: Pulling bills for only one topic HOT 2
- Is there any interest in using govinfo's bulkdata zip files HOT 1
- Error: ImportError: No module named html.entities after the Feb 28th update HOT 4
- Unable to scrape Committee meetings HOT 1
- Downloading House votes in 2001 and 1991 raises exception HOT 5
- Error in parsing sponsor & byRequest HOT 4
- Discrepancies on amendment roll call votes
- Update PyPI Package HOT 8
- (votes, committee_meetings): senate.gov and clerk.house.gov not redirecting to https
- Correct Virtual Env Suggestion
- Request - Include Mastodon ID for members of congress HOT 2
- Error from lxml when parsing amendments "purpose" field HOT 1
- Bills & data.json HOT 1
- Errors when parsing amendments for 118th Congress
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from congress.