Git Product home page Git Product logo

Comments (6)

srinivvenkat avatar srinivvenkat commented on August 28, 2024

I think there is no good, dependable OCR software that can do this task. And a dedicated team to just do this, is not feasible.

I just thought crowdsourcing it is the best way, if we are expecting more such pdf documents. In the long run, one could also decouple the data-collection, curation (tasks which can be crowdsourced to non-math, non-coders), from the modeling, analysis work. To help the crowd, can migrate to a less-geeky google docs kind of alternative.

Just giving a try. I collated the table pages alone in Guinea dataset pdf, and have created a google spreadsheet with the pivot column/row information (from 26th Aug onwards, the format is the same). It is accessible at ( bit.ly/ebola_guinea ). Have also added the 16th Sept, and 1st Oct .csv information. A few moderators could proof-read and 'freeze' cells which are confirmed (revision histories help too).

We've seen it on Wiki. We've seen it on reddit. Can we expect the Internet to do its magic here again?

from ebola.

tc-mccarthy avatar tc-mccarthy commented on August 28, 2024

@cmrivers Have you looked into Tabula (http://tabula.nerdpower.org/)? If your PDFs aren't scanned images it may be able to help you parse the data into tables faster. I'm familiarizing myself with your process -- I am a journalist in NY and am hoping to build some visualizations and an open API for this data. Do you have a list of your data sources -- I may built a scraper to fetch new data every 15 minutes to power the API for this.

from ebola.

cmrivers avatar cmrivers commented on August 28, 2024

Yes I use Tabula for most my digitization efforts. Some of the Guinea data
are images though not data-embedded PDFs, and the tables are irregular from
day to day. Data sources are linked on the top level README.

On Tue, Oct 14, 2014 at 5:45 PM, TC McCarthy [email protected]
wrote:

Have you looked into Tabula (http://tabula.nerdpower.org/)? I'm
familiarizing myself with your process -- I am a journalist in NY and am
hoping to build some visualizations and an open API for this data. Do you
have a list of your data sources -- I may built a scraper to fetch new data
every 15 minutes to power the API for this.


Reply to this email directly or view it on GitHub
https://github.com/cmrivers/ebola/issues/37#issuecomment-59122174.[image:
Web Bug from
https://github.com/notifications/beacon/1302262__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyODk0MjM1MCwiZGF0YSI6eyJpZCI6NDQzMjY1Mzh9fQ==--e5054fe740bfd27104c7296387aac2bce1f428df.gif]

from ebola.

tc-mccarthy avatar tc-mccarthy commented on August 28, 2024

Cool, thanks. I was clicking through those -- just wanted to make sure that list was exhaustive. Ugh government data makes me nuts lol -- no consistency. Thanks again!

from ebola.

cmrivers avatar cmrivers commented on August 28, 2024

Agree completely. I should add that efforts to build an API are underway.
You can email me if you need more details.

On Tue, Oct 14, 2014 at 5:49 PM, TC McCarthy [email protected]
wrote:

Cool, thanks. I was clicking through those -- just wanted to make sure
that list was exhaustive. Ugh government data makes me nuts lol -- no
consistency. Thanks again!


Reply to this email directly or view it on GitHub
https://github.com/cmrivers/ebola/issues/37#issuecomment-59122752.[image:
Web Bug from
https://github.com/notifications/beacon/1302262__eyJzY29wZSI6Ik5ld3NpZXM6QmVhY29uIiwiZXhwaXJlcyI6MTcyODk0MjU4NiwiZGF0YSI6eyJpZCI6NDQzMjY1Mzh9fQ==--5879ba01f162fa7cfbe90a65133fb9e38cb988ba.gif]

from ebola.

olarosling avatar olarosling commented on August 28, 2024

Make sure you first look at this file: with tons of detailed Guinea sub-national records: https://data.hdx.rwlabs.org/dataset/rowca-ebola-cases#

from ebola.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.