Comments (6)
I think there is no good, dependable OCR software that can do this task. And a dedicated team to just do this, is not feasible.
I just thought crowdsourcing it is the best way, if we are expecting more such pdf documents. In the long run, one could also decouple the data-collection, curation (tasks which can be crowdsourced to non-math, non-coders), from the modeling, analysis work. To help the crowd, can migrate to a less-geeky google docs kind of alternative.
Just giving a try. I collated the table pages alone in Guinea dataset pdf, and have created a google spreadsheet with the pivot column/row information (from 26th Aug onwards, the format is the same). It is accessible at ( bit.ly/ebola_guinea ). Have also added the 16th Sept, and 1st Oct .csv information. A few moderators could proof-read and 'freeze' cells which are confirmed (revision histories help too).
We've seen it on Wiki. We've seen it on reddit. Can we expect the Internet to do its magic here again?
from ebola.
@cmrivers Have you looked into Tabula (http://tabula.nerdpower.org/)? If your PDFs aren't scanned images it may be able to help you parse the data into tables faster. I'm familiarizing myself with your process -- I am a journalist in NY and am hoping to build some visualizations and an open API for this data. Do you have a list of your data sources -- I may built a scraper to fetch new data every 15 minutes to power the API for this.
from ebola.
Yes I use Tabula for most my digitization efforts. Some of the Guinea data
are images though not data-embedded PDFs, and the tables are irregular from
day to day. Data sources are linked on the top level README.
On Tue, Oct 14, 2014 at 5:45 PM, TC McCarthy [email protected]
wrote:
Have you looked into Tabula (http://tabula.nerdpower.org/)? I'm
familiarizing myself with your process -- I am a journalist in NY and am
hoping to build some visualizations and an open API for this data. Do you
have a list of your data sources -- I may built a scraper to fetch new data
every 15 minutes to power the API for this.—
Reply to this email directly or view it on GitHub
https://github.com/cmrivers/ebola/issues/37#issuecomment-59122174.[image:
Web Bug from
https://github.com/notifications/beacon/1302262__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyODk0MjM1MCwiZGF0YSI6eyJpZCI6NDQzMjY1Mzh9fQ==--e5054fe740bfd27104c7296387aac2bce1f428df.gif]
from ebola.
Cool, thanks. I was clicking through those -- just wanted to make sure that list was exhaustive. Ugh government data makes me nuts lol -- no consistency. Thanks again!
from ebola.
Agree completely. I should add that efforts to build an API are underway.
You can email me if you need more details.
On Tue, Oct 14, 2014 at 5:49 PM, TC McCarthy [email protected]
wrote:
Cool, thanks. I was clicking through those -- just wanted to make sure
that list was exhaustive. Ugh government data makes me nuts lol -- no
consistency. Thanks again!—
Reply to this email directly or view it on GitHub
https://github.com/cmrivers/ebola/issues/37#issuecomment-59122752.[image:
Web Bug from
https://github.com/notifications/beacon/1302262__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyODk0MjU4NiwiZGF0YSI6eyJpZCI6NDQzMjY1Mzh9fQ==--5879ba01f162fa7cfbe90a65133fb9e38cb988ba.gif]
from ebola.
Make sure you first look at this file: with tons of detailed Guinea sub-national records: https://data.hdx.rwlabs.org/dataset/rowca-ebola-cases#
from ebola.
Related Issues (20)
- Universal variables? HOT 4
- SL date issues HOT 5
- add citation to root readme.md HOT 7
- png images in data_products README are 404 not found HOT 1
- Cooperate with you HOT 1
- Worth adding main local and global responses? HOT 9
- rename ebola_analyses.md to README.md for github
- untrack and remove .DS_Store
- Investigate automated table pdf scraping with pdftables HOT 5
- Some Guinea feature coordinates are off HOT 5
- Standardized filenames HOT 7
- test proofread.py HOT 1
- mali data? HOT 1
- guinea bissau data
- European case data? HOT 1
- Add dates to guinea csvs HOT 1
- Revise Sierra Leone data handling
- repo license HOT 2
- untracked file after fastforward HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ebola.