adrienjoly / npm-pdfreader Goto Github PK

View Code? Open in Web Editor NEW

609.0 10.0 81.0 2.07 MB

🚜 Parse text and tables from PDF files.

Home Page: https://www.npmjs.com/package/pdfreader

License: MIT License

JavaScript 5.45% HTML 93.82% Rich Text Format 0.73%

data-extraction pdf-converter parsing javascript tabular-data pdf-reader parse-tables rule-based-parsing

npm-pdfreader's Introduction

pdfreader

Read text and parse tables from PDF files.

Supports tabular data with automatic column detection, and rule-based parsing.

Dependencies: it is based on pdf2json, which itself relies on Mozilla's pdf.js.

🆕 Now includes TypeScript type definitions!

ℹ️ Important notes:

This module is meant to be run using Node.js only. It does not work from a web browser.
This module extracts text entries from PDF files. It does not support photographed text. If you cannot select text from the PDF file, you may need to use OCR software first.

Summary:

Installation, tests and CLI usage
Raw PDF reading (incl. examples)
Rule-based data extraction
Troubleshooting & FAQ

Installation, tests and CLI usage

After installing Node.js:

git clone https://github.com/adrienjoly/npm-pdfreader.git
cd npm-pdfreader
npm install
npm test
node parse.js test/sample.pdf

Installation into an existing project

To install pdfreader as a dependency of your Node.js project:

npm install pdfreader

Then, see below for examples of use.

Raw PDF reading

This module exposes the PdfReader class, to be instantiated. You can pass { debug: true } to the constructor, in order to log debugging information. (useful for troubleshooting)

Your instance has two methods for parsing a PDF. They return the same output and differ only in input: PdfReader.parseFileItems (as below) for a filename, and PdfReader.parseBuffer (see: "Raw PDF reading from a PDF already in memory (buffer)") from data that you don't want to reference from the filesystem.

Whichever method you choose, it asks for a callback, which gets called each time the instance finds what it denotes as a PDF item.

An item object can match one of the following objects:

null, when the parsing is over, or an error occured.
File metadata, {file:{path:string}}, when a PDF file is being opened, and is always the first item.
Page metadata, {page:integer, width:float, height:float}, when a new page is being parsed, provides the page number, starting at 1. This basically acts as a carriage return for the coordinates of text items to be processed.
Text items, {text:string, x:float, y:float, w:float, ...}, which you can think of as simple objects with a text property, and floating 2D AABB coordinates on the page.

It's up to your callback to process these items into a data structure of your choice, and also to handle any errors thrown to it.

For example:

import { PdfReader } from "pdfreader";

new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
  if (err) console.error("error:", err);
  else if (!item) console.warn("end of file");
  else if (item.text) console.log(item.text);
});

Parsing a password-protected PDF file

new PdfReader({ password: "YOUR_PASSWORD" }).parseFileItems(
  "test/sample-with-password.pdf",
  function (err, item) {
    if (err) console.error(err);
    else if (!item) console.warn("end of file");
    else if (item.text) console.log(item.text);
  }
);

Raw PDF reading from a PDF buffer

As above, but reading from a buffer in memory rather than from a file referenced by path. For example:

import fs from "fs";
import { PdfReader } from "pdfreader";

fs.readFile("test/sample.pdf", (err, pdfBuffer) => {
  // pdfBuffer contains the file content
  new PdfReader().parseBuffer(pdfBuffer, (err, item) => {
    if (err) console.error("error:", err);
    else if (!item) console.warn("end of buffer");
    else if (item.text) console.log(item.text);
  });
});

Other examples of use

Source code of the examples above: parsing a CV/résumé.

For more, see Examples of use.

Rule-based data extraction

The Rule class can be used to define and process data extraction rules, while parsing a PDF document.

Rule instances expose "accumulators": methods that defines the data extraction strategy to be used for each rule.

Example:

const processItem = Rule.makeItemProcessor([
  Rule.on(/^Hello \"(.*)\"$/)
    .extractRegexpValues()
    .then(displayValue),
  Rule.on(/^Value\:/)
    .parseNextItemValue()
    .then(displayValue),
  Rule.on(/^c1$/).parseTable(3).then(displayTable),
  Rule.on(/^Values\:/)
    .accumulateAfterHeading()
    .then(displayValue),
]);
new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
  if (err) console.error(err);
  else processItem(item);
});

Troubleshooting & FAQ

Is it possible to parse a PDF document from a web application?

Solutions exist, but this module cannot be run directly by a web browser. If you really want to use this module, you will have to integrate it into your back-end so that PDF files can be read from your server.

`Cannot read property 'userAgent' of undefined` error from an express-based node.js app

Dmitry found out that you may need to run these instructions before including the pdfreader module:

global.navigator = {
  userAgent: "node",
};

window.navigator = {
  userAgent: "node",
};

Source: express - TypeError: Cannot read property 'userAgent' of undefined error on node.js app run - Stack Overflow

npm-pdfreader's People

Contributors

Stargazers

Watchers

Forkers

zhhb simdrouin venky967 jbach ralucas asb14690 isarbits carlckw ermerson matteo-bombelli paulpascal codacy-badger heresandyboy noshadil linsen1983 sgnuggets davidsuarezcdo noduslabs savkaoleg dbsetyawan rosspeckomplekt dimaisgit cardoso222 narendrabuddiga softberry nbuddiga-xybion carloswalterbr onlyone0001 hharzer apieceofbart profintegra dariusstephen fernando-nog mattduboismatt june07 craig-walker-orennia mxs17 childmoon k2s magistersoftware atomictech cricard7 egolegegit 737893911 aplucas manoj-mac7 valentinzmeu mounika0536 eternalerrors asavienko zzeleznick while4 flexi-perk-developer kpranith haibuiorg lordcraquor mgilangjanuar ksarpotdar basavarajrp niemal geniegeist haoyitedaniu jonathanlehner kvnglvz stark1004 fmsrodrigues madtrick shivamstaq nhash46 epicstartups briankang0314 solucionespc-lab lcsouzamenezes jhonatas-matheus kamik zen-cronic zozo-yasuda

npm-pdfreader's Issues

loadMetaData error: TypeError: Cannot read property 'metadata' of null

Hello!

I have been using pdfreader for some time, both locally on MacOS and in Docker, Node 15.14.0, and everything worked flawlessly. After updating project dependencies, I am getting the following output for any PDF files:

Warning: Setting up fake worker.
loadMetaData error: TypeError: Cannot read property 'metadata' of null
{
  parserError: "loadMetaData error: TypeError: Cannot read property 'metadata' of null"
}

I have created a clean project with the only dependency "pdfreader": "^1.2.12" and an example code from the documentation:

const { PdfReader } = require("pdfreader");
const fs = require("fs");

fs.readFile("./sample.pdf", (err, pdfBuffer) => {
  new PdfReader().parseBuffer(pdfBuffer, function (err, item) {
    if (err) console.log(err);
    else if (!item) console.log("no item");
    else if (item.text) console.log(item.text);
  });
});

...and I am still getting the same message. I played with with versions of pdfkit and pdf2json, but did not solve the problem.

Update: this error appears on 1.2.12. With 1.2.11, it works properly.

I can not read this file

Fazenda Santo Antõnio – Gleba B1-Memorial.pdf

Can u help?
I am trying to read this file.

Check that error handling still works after upgrade to pdf2json v1.1.7 (PR #25)

In PR #25 (merged today without publishing a new version on npm), @noshadil upgraded pdf2json to v1.1.7.

This upgrade changes the parameters of the two top-level events triggered by pdf2json: pdfParser_dataError and pdfParser_dataReady, both handled by pdfreader.

In his PR, @noshadil did update the handler for pdfParser_dataReady accordingly, but not the one for pdfParser_dataError. => It may mean that top-level error handling is broken in the current build, but I don't have enough time to check this at that point.

Next steps:

create an automated tests to check that top-level errors can be caught. it should pass with pdf2json v1.1.2 (used in previous build of pdfreader)
if that test does not pass with pdf2json v1.1.7, fix error handling and propose a PR

Remote code execution

Lodash is vulnerable to remote code execution (RCE) due to the potential to modify the properties of objects in memory. A remote attacker could run arbitrary commands on a vulnerable server, or cause the server to crash, by maliciously crafting an object via the zip functionality of Lodash.

No Response for the file of 150 MB which has 10000 pages. Can you please help me.

Hi, I am using the npm pdfreader. The code is not responding for a file of 150 MB which has 10000 pages.

Below is my code.

var pdfreader = require('pdfreader');

new pdfreader.PdfReader().parseFileItems('demo.pdf', function(err, item){
console.log(item.text)
});

Thank you for the help in advance.

get content by page number?

How to get content by page number?Is there a function for this?

How to know when pdfreader finished reading a file?

I am not able to find out when a file finished your reading. I need call a function when a file finished your reading.

Data on each row get incorrect

Nhan_Thien_CV.pdf
When I parse this cv , It parse data on each row incorrect

Actual :
Emailthien.nhan2310@[email protected]

Expected :
Email [email protected]

DEV: Package it for Browserify

parseTable is not working

When I run "node parse.js test/sample.pdf" I see that parseTable is not working, and table from sample file is not parsed.

modifying properties of Obj

Minimist is a parse argument options module. Affected versions of this package are vulnerable to Prototype Pollution. The library could be tricked into adding or modifying properties of Object

Xmldom is used as parser and xml serializer.The library could be tricked into adding or modifying the xml

[QUESTION] How to get raw text from PDF

I am trying to parse a pdf and catagorize information based on text formatting/decoration. How do you suggest I do that?
For example, I have a pdf in which the structure is repeated:
S.No. BOLD+UNDERLINED TITLE para

How do I catagorize this data into an array of objects:
[ { sno: "", title: "", desc: "" }, ... ]

pdfreader.parseFileItems throw error can not catch ?

pdf url : http://www.hkexnews.hk/listedco/listconews/sehk/2018/0228/LTN20180228058_C.pdf

Error: Required "glyf" or "loca" tables are not found
at error (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :193:7)
at Font_checkAndRepair [as checkAndRepair] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :12213:11)
at new Font (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :10756:21)
at PartialEvaluator_translateFont [as translateFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :8161:14)
at PartialEvaluator_loadFont [as loadFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7311:29)
at PartialEvaluator_handleSetFont [as handleSetFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7154:23)
at PartialEvaluator_getOperatorList [as getOperatorList] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7470:37)
at Object.eval [as onResolve] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :4345:26)
at Object.runHandlers (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :864:35)

pdf2json has vulnerabilities, patched in >=0.5.0. Please update this library to resolve the issue.

NPM is showing me 4 vulnerabilities, (3 low, 1 high) for pdf2json. Patched after 0.5.0. I recommend updating dependencies list.

Pdfreader from Electron 15 not working

const fs = require('fs');
const pdfreader = require("pdfreader");
fs.readFile('./test.pdf', function (err, buffer) {
if (err) return console.log(err);
new pdfreader.PdfReader().parseBuffer(buffer, function (err, item) {
if (err) callback(err);
else if (!item) callback();
else if (item.text) console.log(item.text);
});
});
VM1448:195 Uncaught Error: No PDFJS.workerSrc specified
at error (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :195:9)
at new WorkerTransport (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :42961:9)
at Object.getDocument (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :42559:15)
at PDFJSClass.parsePDFData (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:224)
at PDFParser.#startParsingPDF (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\pdfparser.js:85)
at PDFParser.parseBuffer (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\pdfparser.js:142)
at PdfReader.parseBuffer (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\PdfReader.js:72)
at C:\Users\dell\agspdftoexcel\app.js:11
at FSReqCallback.readFileAfterClose [as oncomplete] (node:internal/fs/read_file_context:68)

Is it possible to extract form data?

Can this be used to extract form data so I can have the field name and value?

Doesn't work with latest updates

I don't know why both 'pdf-to-text' and 'pdfreader' doesn't work even the conditions are met as I know.

Briefly, it lacks of "require" function even I add "require, requirejs, and require.js" with npm and even after adding a line of script require.js to my html, it produces the error below. Here is the codePen or more explanatory Stackoverflow

PS: I tried to include /, ', and combination of them both at the beginning and end of the require functions inside but nothing worked yet.

hi

code：
new pdfreader.PdfReader().parseFileItems(
fileAllname,
function (err, item) {
if(item&&item.page){
item5.allpage=item.page
} else if (item.text) {
initlist.push(item.text)
}
}
console.log(initlist)

I assign a value to my custom data in the function of parsing pdf, and then output it. It is found that the data is still the initial value. How should I deal with the processed data?

trouble parsing files created by wkhtmltopdf

hi adrien

created a pdf file from your sample.html by wkhtmltopdf.
unfortunately i can not parse it proberly with your test.js.
just log item.text.

problem is, i have to parse generated files.
generated file is attached.

do you have any clue?

thanks in advance.

greets
zorla

sample3.pdf

replace Rule accumulators by an accumulate() promise function that takes the accumulator fct as param

What to do If I have a password on my pdf??

Doesn't work on PDFs converted online

Hi,

This npm module is awesome, I really love it :)

However I do have one thing of note. See here, this module works on pdfs exported via Word, Excel, PowerPoint but not on PDFs that were generated from online sources (e.g. online2pdf). Is there a reason for this?

Thanks.

The automated release is failing 🚨

🚨 The automated release from the `master` branch failed. 🚨

I recommend you give this issue a high priority, so other packages depending on you could benefit from your bug fixes and new features.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. I’m sure you can resolve this 💪.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the master branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here is some links that can help you:

If those don’t help, or if this issue is reporting something you think isn’t right, you can always ask the humans behind semantic-release.

No npm token specified.

An npm token must be created and set in the NPM_TOKEN environment variable on your CI environment.

Please make sure to create an npm token and to set it in the NPM_TOKEN environment variable on your CI environment. The token must allow to publish to the registry https://registry.npmjs.org/.

Good luck with your project ✨

Your semantic-release bot 📦🚀

How can i convert the output back to PDF

Hello, i would like to make some edit to the JSON output provided by the library then convert it back to PDF, please any help would be greatly appreciated

Async reading

I am using nodejs express server. On GET request I expect to receive read PDF file. But reading is async process.
`new pdfreader.PdfReader().parseFileItems('CHK.pdf', function (err, item) {

if (!item || item.page) {
res.send(printRows());
}
});`

Is there any way to wait, until all the pages will be read and only then send response back to client?

Update pdf2json to version 1.1.9

Pdf2json package is using old version of lodash 4.15 and it has some vulnerabilities. Please update the version of pdf2json to 1.1.9 and it will fix the issue.

I have to try catch this place, if not, sometimes, will throw error "Cannot read property '0' of undefined"

Does not work in node v16.10.0

Describe the bug
When running with node v16.10.0, the methods parseBuffer and parseFileItems do not work as expected.

To Reproduce
Use node version 16.10.0 and try to read the text of a pdf using parseBuffer or parseFileItems

Expected behavior
The callback passed to parseBuffer or parseFileItems should be called with each pdf item found.

Current behavior
The callback passed to parseBuffer or parseFileItems only gets called once with the file: { path } or file: { buffer } data, and never gets called with the pdf items or pages.

Screenshots, outputs or logs

Desktop (please complete the following information):

OS: macOS Big Sur 11.4
Browser: node.js
Version: 16.10.0

Aditional Info

The same code works correctly in at least node 16.3.0

can you release a new version on npm?

can you please release a new version to npm?

How can I extract images from pdf

Hi,

How can I extract images from pdf ?

Pdfreader cannot be used with Electron (because of `async` variable in pdf2json)

We wanted to test the library in Electron but it doesn't import properly the async library, pdfparser.js line 6:

async = require("async");

As you may know async is a reserved word in Chrome and you guys are trying to import async library by overwriting the async reserved word without var or let in front of the async var.

try real SAX parsing, using pdf.js

End of file parsing event PdfReader or parseFileItems

Hi,
i am unable to find done , complete event of
new pdfreader.PdfReader().parseFileItems

any points on same or get totol number of pages in pdf !

pdf2json dependency is broken

pdf2json update from 1.2.5 to 1.3.0 removed formImage, which pdfreader requires, without incrementing the major version number as they should with incompatible changes. NPM thinks that 1.2.5 and 1.3.* are compatible. As a temporary patch, you could set the dependency in package.json to something like <=1.2.5

Outdated version at NPM

Would it be possible to push latest version to NPM? The latest available there seems to be 1.0.7 while current one on github is 1.1.3.

How to get the metadata(author, name) from the file

how to get metadata from file example author, name etc.
I'm trying something like:

const { PdfReader } = require("pdfreader");
const fs = require("fs");

fs.readFile("example.pdf", (err, pdfBuffer) => {
  // pdfBuffer contains the file content
  new PdfReader().parseBuffer(pdfBuffer, function (err, item) {
    if (err) {
      console.log("Error", err);
      return false;
    }

    if (item && item.file) { // <-- item.file not exists
    }

    if (item) {
      console.log(Object.keys(item)); // <-- has nothing to do with metadata
    }

  });
});

makeItemProcessor() to call accumulate (or not), based on usage of Rule.on or Rule.after

locate absolute position of item

Im trying to find a string in a string in a pdf and want to get its x and y location on a page.
It seems item.x and item.y are relative to the item "above". it seems impossible to me to find out which x to add to get the absolute position of an item.
is there any way?

Wrong file content reading files with same path

I'm writing an application that reads the content of some files in a directory. Files are meant to be replaced (same filename but different content).

If I use parseFileItems two times with the same path but different files the result is always the content of the old file.

I solved reading the file content with fs.readFile and passing the buffer to parseBuffer.

Your source code looks fine to me, maybe it'is a problem with pdf2json/pdfparser but I'm not sure so I'm reporting to you.

Cannot parse text from PDF document with Chinese characters

HI，I have a PDF file, which can be opened and copied.
But it cannot be read, please help me, thank you!

4443.pdf

Can not read pdf file

I have a file can not read. Can you take a look?
cv_81_vietnamworks_11121.pdf

Missing text height

Hello, I am using the code snippet from the documentation to parse lines of text from pdf page.
The lines of text are getting parsed, however for some reason the height / h property is missing from item.
I need the height in order to detect the text getting out of bound of a certain box on the pdf page.

Here is the code snippet, that I used:
`
let rows = {};
let addressRows = [];

const printRows = () => {
  Object.keys(rows)
    .sort((y1, y2) => parseFloat(y1) - parseFloat(y2))
    .forEach((y) => {
      addressRows.push((rows[y] || []).join(''));
    });
}

new pdfreader.PdfReader().parseFileItems(tempPath, function (err, item) {
  if (!item || item.page) {
    printRows();
    if (!item) {
      console.log('addressRows: ', addressRows)
    }
  } else if (item.text) {
    console.log(item);

    // accumulate text items into rows object, per line
    (rows[item.y] = rows[item.y] || []).push(item.text);
  }

});`

Here is the log I received on terminal.

Any idea why the height/h property is missing?
Thank you.

Having some trouble with parseTable

Hi there,

Think you got a good idea here, but I'm trying to figure out how to correctly parse a table. I don't think the displayTable() you have in your test file is logging.. I'm just having trouble figuring out the pattern. Anyways, do you have any advice for me?

Thanks in advance and I hope you have good day :)

Troy

var _ = require('lodash');
var PdfReader = require('pdfreader').PdfReader;
var Rule = require('pdfreader').Rule;

function displayTable(table){
    console.log('Object.keys(table)',Object.keys(table));
    _.map(table.rows, function(row){
        console.log('row',row);
    });
}
var sampleRules = [
    Rule.on(/^c1$/).parseTable(3).then(displayTable)
  ];
var processItemSample = Rule.makeItemProcessor(sampleRules);

var samplePathToPdf = __dirname + '/sample.pdf';
new PdfReader().parseFileItems(samplePathToPdf, function(err, item){
    if (err){
        console.log(err);
    }
    else {
        processItemSample(item);
    }
});

Here is my output

Object.keys(table) [ 'items', 'rows', 'matrix' ]
row [ { x: 20.408,
    y: 10.501,
    w: 0.9436,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'c2' },
  { x: 28.299,
    y: 10.501,
    w: 0.9436,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'c3' },
  { x: 14.979,
    y: 11.447,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '1' },
  { x: 29.249,
    y: 11.447,
    w: 1.25,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '2.3' } ]
row [ { x: 19.513,
    y: 12.363,
    w: 2,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'hello' },
  { x: 27.068,
    y: 12.363,
    w: 2.333,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'world' },
  { x: 12.964,
    y: 13.248,
    w: 3.055,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'Values:' } ]
row [ { x: 12.964,
    y: 14.835,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '1' },
  { x: 12.964,
    y: 16.423,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '2' } ]
row [ { x: 12.964,
    y: 18.01,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '3' } ]

Fails when uploading file that contains comments within PDF

When uploading and processing a PDF that contain comments, pdfreader is unable to handle the request, and my backend node service fails. I'm able to use PdfReader().parseBuffer(file, function(err, item) to process the buffered file, and it's able to read the file and first item, but it fails going forward.

Is this a known bug, and if so, is there anyway I can handle this accordingly, or a way to detect the file has comments and return an error. I've tried some work arounds, but the service just fails every time.

Invalid XRef stream header

Describe the bug
A clear and concise description of what the bug is.
Unable to process PDF

To Reproduce
List the steps you followed and/or share your code to help us reproduce the bug

Feed in the PDF as a buffer
Attempt to extract text from the PDF

Expected behavior
A clear and concise description of what you expected to happen.

Extract text from PDF

Screenshots, outputs or logs
If applicable, add screenshots, outputs or logs to help explain your problem.

    (while reading XRef): Error: Invalid XRef stream header

      at XRef_readXRef [as readXRef] (eval at Object.<anonymous> (node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5682:9)

  console.log
    XRefParseException: 
        at XRefParseExceptionClosure (eval at Object.<anonymous> (/Users/tsopic/telegram_bot/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:379:34)
        at eval (eval at Object.<anonymous> (/Users/tsopic/repo/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:384:3)

Desktop (please complete the following information):

NODE - 14

tested on both mac and linux

Additional context
Add any other context about the problem here.

embarassingly slow

Describe the bug
Parsing a pdf file containing 235 pages takes up to 8 seconds (just doing nothing with the received tokens - apparently the lexer alone takes up that much time) :-p

To Reproduce

const parseStart = process.hrtime();
new PdfReader().parseBuffer(result.data, (err, item) => {
                        // the pdf reader signals the end of the parsing process
                        // by calling this function with the item set as undefined
                        if (!item) {
                            const parseEnd = process.hrtime(parseStart);
                            this.logger.log(`parse pdf completed in ${parseEnd[0]}.${Math.floor(parseEnd[1] / 10e6)}s`);

                            observer.next(table);
                            observer.complete();
                            return
                        }
});

Expected behavior
I would expect to have a pdf file of this size to have no longer than 2 seconds to parse.

Screenshots, outputs or logs
parse pdf completed in 7.42s SkybriefingDaylightAdapter

Desktop (please complete the following information):

OS: Windows 10, but it doesn't matter, its the same on a linux virtual machine.

Additional context
I have attached a sample file.
pdf.pdf

Unable to catch Parse error

Hi @adrienjoly ,

I am using pdfreader to parse ### pdf documents. However in my application if I bump into runtime error while parsing pdf I want to use a particular logic. Below is code and trace of exception while reading pdf document. Issue is that the error is not getting caught in if(err) condition. Am I missing anything in catching the exception shown below?

Thanks,
Ji

Code snippet and exception trace:

function readPDFPages(buffer, reader = (new PdfReader())) {

  console.log('reading pdf pages: ');
  console.log(buffer);

  return new Promise((resolve, reject) => {
    let pages = [];
    reader.parseBuffer(buffer, (err, item) => {

      if (err) {
        console.log("err in parsed buffer");
        console.log(err);
        reject(err)
      }
      else if (!item)
        resolve(pages);

      else if (item.page)
        resolve(pages);
    });
  });

}

Exception trace:
Error: Illegal character: 41
    at error (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:195:9)
    at Lexer_getObj [as getObj] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24616:11)
    at Parser_shift [as shift] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24038:32)
    at Parser_makeStream [as makeStream] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24195:12)
    at Parser_getObj [as getObj] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24079:18)
    at XRef_fetch [as fetch] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5753:22)
    at XRef_fetchIfRef [as fetchIfRef] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5699:19)
    at Dict_get [as get] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4759:28)
    at Page_getPageProp [as getPageProp] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4213:28)
    at Page.get content [as content] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4227:19)

having trouble parsing data that extends between pages

so far, usage of this library has been really good, but I've run into an issue. basically I have a table I'm parsing data from (not using pdfreader.TableParser right now) that is split between two pages.

when I parse through each page, I use logic that finds the heading of the title to determine where it begins, and the heading of the next table to determine where it ends.

if I cannot parse across both pages, I cannot get all the data from the table.

from my understanding, I am looping through each page in my below code. I would love any suggestions as I've sort of hit a roadblock here.

please note that I'm using the Serverless framework and invoking it that way; sorry if it's very unrepeatable for anybody.

code:

function getNextIndexItem(rows, num, currentItem) {
  /*
  get the item that appears next in the array passed into the rows param.
  param rows: array of rows parsed on the page
  param num: number of indexes past the current index (currentItem)
  param currentItem: the index of the current item in the array
  */ 
  let keys = Object.keys(rows);
  let nextIndex = keys.indexOf(currentItem) + num;
  let nextItem = keys[nextIndex];
  let nextField = (rows[nextItem] || []).join(' ');
  let finalStr = nextField.split(':')[nextField.split(':').length-1];

  return finalStr;
}

function pdfReader(pdfFilePath, parsedData, callback) {
  const pdfreader = require("pdfreader");

  let rows = {}; // indexed by y-position
  let tableIndexes = 0;

  new pdfreader.PdfReader().parseFileItems(pdfFilePath, (err, item) => {
    if (err) callback(err);

    if (item) {
      if (item.page) {
        // end of file, or page
        Object.keys(rows) // => array of y-positions (type: float)
          .sort((y1, y2) => parseFloat(y1) - parseFloat(y2)) // sort float positions
          .forEach(yValue => {
            // rows[y] is an array of text for a line.
            let line = (rows[yValue] || []).join('');  // construct line of text

            if (line.includes('Table Name')) {
              tableIndexes = 0;
              for (let i = 0; i < 500; i++) {
                if (getNextIndexItem(rows, i, yValue).includes('Next Table Name')) {
                  tableIndexes = i;  // get index of last table row
                  break;
                }
            }
            for (let i =2; i < tableIndexes; i++) {  // start at 2 to avoid the heading row of the table
              console.log(`List row #${i}: ${getNextIndexItem(rows, i, yValue)}`);
            }
          });
        rows = {}; // clear rows for next page
      } else if (item.text) {
        if (!rows[item.y]) {
          rows[item.y] = [];
        }
        rows[item.y].push(item.text);
      }
    } else {
      // we're done here
      callback(parsedData);
    }
  });
};

module.exports.test = function() {
  pdfReader('doc.pdf', (err, data) => {
    if (err) console.log(err);
  });
}

here's the table in the PDF:

DEV: add proper test case, with error handling and sample pdf file (ex: anonymized icade file)

Some characters are missing / corrupt (e.g. ligatures)

First, I just want to thank you for creating this package. It's really helped us.

Describe the bug
While most of the text is there, a few characters are missing from my PDF.

Here's the PDF. It was produced by using a headless Chrome 67.0.3396.87 on Ubuntu to print the screen to PDF.
Scenario-4.1-RiskTables-FQA.pdf

To Reproduce
Here's a minimalist test:

const PdfReader = require("pdfreader").PdfReader;
const fs = require("fs");
const path = require("path");

const filename = path.join("c:", "temp", "Scenario-4.1-RiskTables-FQA.pdf");
console.log("Reading " + filename + "...");

new Promise((resolve, reject) => {
    let pdfText = "";
    fs.readFile(filename, (err, pdfBuffer) => {
        console.log("Found buffer with " + pdfBuffer.length + " bytes.");
        new PdfReader().parseBuffer(pdfBuffer, function(err, item){
            if (err) {
                reject(err);
            } else if (!item) {
                resolve(pdfText);
            } else if (item.text) {
                //console.log("Found item: " + JSON.stringify(item));
                pdfText += item.text;
            }
        });
    });
}).then((pdfText) => {
    console.log("Found PDF Text: " + pdfText);
}).catch(e => {
    console.log("ERROR", e);
});

Expected behavior
I would expect to see all of the characters. Open the PDF and you'll notice the sentence "Effective RMP:" on the first page just above "Default 5x5 RMP V1.0". In the text that gets exported from the file, it says "E ective RMP".

Screenshots, outputs or logs
Here's the log of what this program produces for me:

Reading c:\temp\Scenario-4.1-RiskTables-FQA.pdf...
Found buffer with 69972 bytes.
Found PDF Text: 8/29/2018QbDVisionRiskTablesabout:blank1/3QbDVisionExportedBy:RyanRocketExportDate:Aug29,2018at1:44pm G MT C ompany:RocketsRUSProject:PRJ-6-PrintTestProjectReportDate:Aug29,2018RiskTablesReportFQARiskTableAsofAug29,2018at11:59pm G MTRiskTable:FQARiskTableDate:Aug29,2018E ectiveRMP:Default5x5RMPV1.08/29/2018QbDVisionRiskTablesabout:blank2/3FQA-32-Appearance[NOTAPPROVED]1(1%) C olor,shapeandappearancearenotdirectlylinkedtosafetyande cacy.Therefore,theyarenotcritical.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-40-Assay[NOTAPPROVED]100(100%)Processvariablesmaya ecttheassayofthedrugproduct.1000(100%)10000(100%)IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-95-Overdosage[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-52- C ontainer C losureSystem[NOTAPPROVED]100(100%)Packagingoptionshavenotbeenidenti ed1000(100%)10000(100%)SuitablepackagingoptionswillbeinvestigatedduringdevelopmentprocessNone C M-78-NA[NOTAPPROVED]TPP-101-HowSupplied/StorageandHandling[NOTAPPROVED]FQA-45- C ontentUniformity[NOTAPPROVED]100(100%)Variabilityincontentuniformitywilla ectsafetyande cacy.1000(100%)10000(100%)Bothformulationandprocessvariablesimpactcontentuniformity,sothis C QAwillbeevaluatedthroughoutproductandprocessdevelopment.ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-95-Overdosage[NOTAPPROVED]FQA-42-DegradationProducts[NOTAPPROVED]100(100%)Formulationandprocessvariablescanimpactdegradationproducts.1000(100%)10000(100%)Degradationproductswillbeassessedduringproductandprocessdevelopment.IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]TPP-101-HowSupplied/StorageandHandling[NOTAPPROVED]FQA-47-Dissolution[NOTAPPROVED]100(100%)Bothformulationandprocessvariablesa ectthedissolutionpro le.1000(100%)10000(100%)This C QAwillbeinvestigatedthroughoutformulationandprocessdevelopment.ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-97- C linicalPharmacology[NOTAPPROVED]FQA-37-Friability[NOTAPPROVED]25(25%)AtargetofNMT1.0%w/wofmeanweightlossassuresalowimpactonpatientsafetyande cacyandminimizescustomercomplaints.250(25%)2500(25%)ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-97- C linicalPharmacology[NOTAPPROVED]FQA-38-Identi cation[NOTAPPROVED]100(100%)Identi cationiscriticalforsafetyande cacy.1000(100%)10000(100%)IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]FQA-50-MicrobialLimits[NOTAPPROVED]10(10%)Non-compliancewithmicrobiallimitswillimpactpatientsafety.However,inthiscase,theriskofmicrobialgrowthisverylowbecauserollercompaction(drygranulation)isutilizedforthisproduct.Therefore,this C QAwillnotbediscussedindetailduringformulationandprocessdevelopment.100(10%)1000(10%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-33-Odor[NOTAPPROVED]1(1%)Ingeneral,anoticeableodorisnotdirectlylinkedtosafetyande cacy,butodorcana ectpatientacceptability.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-49-ResidualSolvents[NOTAPPROVED]5(5%)Residualsolventscanimpactsafety.However,nosolventisusedinthedrugproductmanufacturingprocessandthedrugproductcomplieswithUSP<467>Option1.Therefore,formulationandprocessvariablesareunlikelytoimpactthis C QA.50(5%)500(5%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-35-Score C on guration[NOTAPPROVED]1(1%)Scorecon gurationisnotcriticalfortheacetriptantablet.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-34-Size[NOTAPPROVED]1(1%)SeeTargetJusti cation10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA C riticality C riticalityJusti cationProcessRiskRPNRecommendedActions C ontrolStrategy C ontrolMethodsTPPLinks8/29/2018QbDVisionRiskTablesabout:blank3/3©2018 C herry C ircleSoftware,Inc.FQA-43-Water C ontent[NOTAPPROVED]25(25%)However,inthiscase,acetriptanisnotsensitivetohydrolysisandmoisturewillnotimpactstability.250(25%)2500(25%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]FQA C riticality C riticalityJusti cationProcessRiskRPNRecommendedActions C ontrolStrategy C ontrolMethodsTPPLinks

Desktop (please complete the following information):

OS: Ubuntu 14, running in a docker container
Browser Chome 67.0.3396.87
Version Pdfreader v 0.2.5

Additional context
Thank you again for creating this package.

adrienjoly / npm-pdfreader Goto Github PK

npm-pdfreader's Introduction

pdfreader

Installation, tests and CLI usage

Installation into an existing project

Raw PDF reading

Parsing a password-protected PDF file

Raw PDF reading from a PDF buffer

Other examples of use

Rule-based data extraction

Troubleshooting & FAQ

Is it possible to parse a PDF document from a web application?

Cannot read property 'userAgent' of undefined error from an express-based node.js app

npm-pdfreader's People

Contributors

Stargazers

Watchers

Forkers

npm-pdfreader's Issues

code： new pdfreader.PdfReader().parseFileItems( fileAllname, function (err, item) { if(item&&item.page){ item5.allpage=item.page } else if (item.text) { initlist.push(item.text) } } console.log(initlist)

🚨 The automated release from the master branch failed. 🚨

No npm token specified.

Recommend Projects

Recommend Topics

Recommend Org

`Cannot read property 'userAgent' of undefined` error from an express-based node.js app

code：
new pdfreader.PdfReader().parseFileItems(
fileAllname,
function (err, item) {
if(item&&item.page){
item5.allpage=item.page
} else if (item.text) {
initlist.push(item.text)
}
}
console.log(initlist)

🚨 The automated release from the `master` branch failed. 🚨