Git Product home page Git Product logo

Comments (8)

modesty avatar modesty commented on August 28, 2024

the biggest pdf files in unit tests are under 8 pages, never tested it with 'large' file. If performance is an issue, I'd recommend to split it into smaller ones before parsing, since smaller pdfs are well tested and well performed.

from pdf2json.

barneydunning avatar barneydunning commented on August 28, 2024

Hi there... thanks for the reply.

With so many downloads I am surprised no one else has hit this issue. The PDF files that we need to import are outwith our control, so we cannot lessen their size. They can be anything from one page to 1500 pages.

Are there any input options that cuts down the amount of work this plugin does when preparing the data? The only information we require is the textual data along with it's x and y coordinates.

Looking forward to your response.

Many thanks, Barney

from pdf2json.

modesty avatar modesty commented on August 28, 2024

one option is to update the stream implementation from file to page, so the process starts to flow when a single page data is ready. It would improve responsiveness, but won't reduce the total processing time for large PDFs.

from pdf2json.

barneydunning avatar barneydunning commented on August 28, 2024

Yep that's a shame. I take it there is no way of speeding up the process by limiting what it ends up outputting? So for example, asking it to only do specific types of work when loading the PDF document.

What would be the cause of the slowness... is it string manipulation or something similar in the inner workings of the module?

from pdf2json.

kishorsharma avatar kishorsharma commented on August 28, 2024

We can use child process to process pages parallel. This will not only improve responsiveness but also reduce time for such large file. I would love to contribute and create PR for it if you think the same.

from pdf2json.

AshishGogna avatar AshishGogna commented on August 28, 2024

I don't seem to have this issue.
I have tried parsing 11mb PDF, and the dataReady callback fires in under a minute.

I am running the node application on my macbook pro, i5, 8GB.

Here's the PDF that i tested - https://drive.google.com/file/d/0BzR-ZOIycHumX3hsbTVWbFMyQlU/view?usp=sharing

from pdf2json.

barneydunning avatar barneydunning commented on August 28, 2024

Sorry for the delay... damn holidays huh?! Well I am back now, so here goes...

Although the PDFs I am using are only ~4mb, each page (~1,300 pages) have a grid of tabular data (about 8x8)... and some "cells" can have up to 6x text items in - vertically placed. So it might not be about the size of the PDF, but rather the contents and their structure.

kishorsharma - if you could look into speeding this up using child processes, then I would be happy to test your code. Any advance on 10 minutes would be a big bonus!

Please let me know your thoughts.

from pdf2json.

wanghaisheng avatar wanghaisheng commented on August 28, 2024

anything update?

from pdf2json.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.