Git Product home page Git Product logo

Comments (8)

maelle avatar maelle commented on August 23, 2024 1

@leeper new terrible example, http://photos.state.gov/libraries/india/231771/PDFs/jan-dec_2015.pdf (the csv here being incomplete). It's US data, 187 pages, I'll report tomorrow once I've scraped it. Have I already said your pkg is awesome? 😁

from tabulapdf.

psychemedia avatar psychemedia commented on August 23, 2024

The area argument is available. For example:

extract_tables('Lap Analysis.pdf', pages=8, guess=F, area=list(c(178, 10, 550, 50)))

The area parameter appears to take co-ordinates in the form: top, left, width, height.

You can find the necessary co-ordinates using the tabula app: if you select an area and preview the data, the selected co-ordinates are viewable in the browser developer tools console area.

However, the tabula app console output gives co-ordinates in the form: top, left, bottom, right so you need to do some sums to convert these numbers to the arguments that the tabulizer area parameter wants.

/via

from tabulapdf.

leeper avatar leeper commented on August 23, 2024

@psychemedia The area specification is a bug in my code. I'm pushing a fix for it right now. It should be top,left,bottom,right just like in Tabula.

from tabulapdf.

maelle avatar maelle commented on August 23, 2024

I have used the tabulizer package here https://github.com/masalmon/usaqmindia/blob/master/inst/pm25_consulate.R but it's a pretty boring example.

from tabulapdf.

leeper avatar leeper commented on August 23, 2024

This IRS document might work well as an example: https://www.irs.gov/pub/irs-soi/14databk.pdf

> extract_areas(tmp, pages = c(14, 15, 17, 18), method = "data.frame")
> str(.Last.value)
List of 6
 $ :'data.frame':       54 obs. of  8 variables:
  ..$ X   : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.1.: chr [1:54] "239,874,741 " "3,074,293 " "584,480 " "4,485,975 " ...
  ..$ X.2.: chr [1:54] "2,220,921 " "17,613 " "3,362 " "33,844 " ...
  ..$ X.3.: chr [1:54] "4,642,817 " "50,438 " "9,160 " "83,945 " ...
  ..$ X.4.: chr [1:54] "3,799,428 " "45,905 " "7,383 " "84,956 " ...
  ..$ X.5.: chr [1:54] "147,444,789 " "2,048,463 " "357,733 " "2,805,861 " ...
  ..$ X.6.: chr [1:54] "23,608,340 " "252,431 " "47,482 " "430,138 " ...
  ..$ X.7.: chr [1:54] "3,205,595" "29,602" "4,178" "49,609" ...
 $ :'data.frame':       54 obs. of  8 variables:
  ..$ X    : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.8. : chr [1:54] "617,649 " "5,365 " "1,067 " "7,563 " ...
  ..$ X.9. : chr [1:54] "30,065,749 " "353,564 " "79,939 " "508,257 " ...
  ..$ X.10.: chr [1:54] "34,132 " "255 " "38 " "410 " ...
  ..$ X.11.: chr [1:54] "334,641 " "3,163 " "567 " "4,626 " ...
  ..$ X.12.: chr [1:54] "987,238 " "15,016 " "3,433 " "9,225 " ...
  ..$ X.13.: chr [1:54] "1,467,402 " "16,792 " "4,682 " "19,344 " ...
  ..$ X.14.: chr [1:54] "21,446,040" "235,686" "65,456" "448,197" ...
 $ :'data.frame':       54 obs. of  7 variables:
  ..$ X   : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.1.: chr [1:54] "157,187,971 " "2,122,412 " "371,057 " "2,939,657 " ...
  ..$ X.2.: chr [1:54] "1,173,505 " "10,456 " "1,524 " "12,059 " ...
  ..$ X.3.: chr [1:54] "3,439,645 " "40,500 " "6,851 " "50,573 " ...
  ..$ X.4.: chr [1:54] "2,813,102 " "36,809 " "5,205 " "49,203 " ...
  ..$ X.5.: chr [1:54] "124,585,594 " "1,785,868 " "301,830 " "2,339,074 " ...
  ..$ X.6.: chr [1:54] "47,309,667" "612,321" "151,349" "977,840" ...
 $ :'data.frame':       54 obs. of  8 variables:
  ..$ X    : chr [1:54] "United States, total " "Alabama " "Alaska " "Arizona " ...
  ..$ X.7. : chr [1:54] "3,261,248 " "39,515 " "6,909 " "64,940 " ...
  ..$ X.8. : chr [1:54] "77,275,927 " "1,173,547 " "150,481 " "1,361,234 " ...
  ..$ X.9. : chr [1:54] "2,334,249 " "21,674 " "2,840 " "35,140 " ...
  ..$ X.10.: chr [1:54] "9,615,578 " "66,424 " "11,088 " "186,577 " ...
  ..$ X.11.: chr [1:54] "253,158 " "4,431 " "258 " "2,748 " ...
  ..$ X.12.: chr [1:54] "837,997 " "11,547 " "2,966 " "11,454 " ...
  ..$ X.13.: chr [1:54] "12,135,143" "144,703" "38,495" "252,829" ...

from tabulapdf.

maelle avatar maelle commented on August 23, 2024

I like that it's called data book, hehe.

BTW do you think there would be way to automatically recognize all tables in a pdf?

from tabulapdf.

leeper avatar leeper commented on August 23, 2024

@masalmon The default behavior of extract_tables() should do this, as long as guess = TRUE.

from tabulapdf.

maelle avatar maelle commented on August 23, 2024

Ah cool -- sorry I had missed that.

from tabulapdf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.