Git Product home page Git Product logo

table-information-extraction's Introduction

Table Information Extraction Based on PDFMiner

Our goal is to extract the tabular data from the PDF file in the field of Biochar. The reason for doing this is that there are often a lot of papers in this field that contain tabular data and therefore it is often time consuming to extract data manually. Besides, unlike the field in some medical field, most forms in the bioenergy field are presented as a three-line table so we could consider a general approach to parsing and extracting these tables.

Our preparatory work was based on PDFminer which has various capabilities for working with PDF files (we are only interested in one of them, the ability to parse them into HTML)

Workflow

  • Obtain the related paper from DOCX file (2326587T_LeiyuTian_ENG5059P_FinalReport_2018.docx)
  • Using web crawler or artificial extraction to extract the corresponding papers in batches.
  • Using PDFminer convert it to HTML files in bulk.
  • Recognize the parts of the table in the HTML and extract them separately (As figure shown below).
  • We use our regulation and method to get table information.
  • Exoport the extracted infomation to an Excel file

Our work mainly focuses on the last two items.

Compare with Other Software

PDFTables

When converting a PDF to a spreadsheet, an algorithm is used which examines the structures in the PDF. It understands the spacing between items to identify the rows and columns, much as human’s eye does when scanning a page. It is designed to work reliably with large quantities of data to produce the best output from given PDF.

PDFtoXLS

This includes several steps. Quality Conversions. Table Recovery. Headers and Footers. Form Recognition. Rotated Text Recovery. Hyper- link Detection. Merging logical tables. List detection. OCR. It is able to extract comparatively complete information from pdf files.

SimplyPDF

It is a free and easy way to use online PDF to XLS converter to extract tables trapped in PDF files, without having to install any software. This software is capable of extracting relatively complicated tables.

  • All the testing result can be seen in the paperfile.

  • All software here are not open-source and require pricing except SimplyPDF.

COPYRIGHT

Copyright 2019, 2019 Table-mining group
Contact: [email protected]

you can redistribute our method and/or modify
it under the terms of the MIT License as published 

our method is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
MIT License for more details.

You should have received a copy of the MIT license
along with it.  If not, see <https://github.com/text-mining-project/Table-information-extraction/blob/master/LICENSE>.

table-information-extraction's People

Contributors

changgang-zheng avatar lomogmy avatar luchiacc avatar

Stargazers

 avatar  avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.