Git Product home page Git Product logo

npm-pdfreader's Introduction

pdfreader Node CI Code Quality

Read text and parse tables from PDF files.

Supports tabular data with automatic column detection, and rule-based parsing.

Dependencies: it is based on pdf2json, which itself relies on Mozilla's pdf.js.

šŸ†• Now includes TypeScript type definitions!

ā„¹ļø Important notes:

  • This module is meant to be run using Node.js only. It does not work from a web browser.
  • This module extracts text entries from PDF files. It does not support photographed text. If you cannot select text from the PDF file, you may need to use OCR software first.

Summary:

Installation, tests and CLI usage

After installing Node.js:

git clone https://github.com/adrienjoly/npm-pdfreader.git
cd npm-pdfreader
npm install
npm test
node parse.js test/sample.pdf

Installation into an existing project

To install pdfreader as a dependency of your Node.js project:

npm install pdfreader

Then, see below for examples of use.

Raw PDF reading

This module exposes the PdfReader class, to be instantiated. You can pass { debug: true } to the constructor, in order to log debugging information. (useful for troubleshooting)

Your instance has two methods for parsing a PDF. They return the same output and differ only in input: PdfReader.parseFileItems (as below) for a filename, and PdfReader.parseBuffer (see: "Raw PDF reading from a PDF already in memory (buffer)") from data that you don't want to reference from the filesystem.

Whichever method you choose, it asks for a callback, which gets called each time the instance finds what it denotes as a PDF item.

An item object can match one of the following objects:

  • null, when the parsing is over, or an error occured.
  • File metadata, {file:{path:string}}, when a PDF file is being opened, and is always the first item.
  • Page metadata, {page:integer, width:float, height:float}, when a new page is being parsed, provides the page number, starting at 1. This basically acts as a carriage return for the coordinates of text items to be processed.
  • Text items, {text:string, x:float, y:float, w:float, ...}, which you can think of as simple objects with a text property, and floating 2D AABB coordinates on the page.

It's up to your callback to process these items into a data structure of your choice, and also to handle any errors thrown to it.

For example:

import { PdfReader } from "pdfreader";

new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
  if (err) console.error("error:", err);
  else if (!item) console.warn("end of file");
  else if (item.text) console.log(item.text);
});

Parsing a password-protected PDF file

new PdfReader({ password: "YOUR_PASSWORD" }).parseFileItems(
  "test/sample-with-password.pdf",
  function (err, item) {
    if (err) console.error(err);
    else if (!item) console.warn("end of file");
    else if (item.text) console.log(item.text);
  }
);

Raw PDF reading from a PDF buffer

As above, but reading from a buffer in memory rather than from a file referenced by path. For example:

import fs from "fs";
import { PdfReader } from "pdfreader";

fs.readFile("test/sample.pdf", (err, pdfBuffer) => {
  // pdfBuffer contains the file content
  new PdfReader().parseBuffer(pdfBuffer, (err, item) => {
    if (err) console.error("error:", err);
    else if (!item) console.warn("end of buffer");
    else if (item.text) console.log(item.text);
  });
});

Other examples of use

example cv resume parse convert pdf to text

example cv resume parse convert pdf table to text

Source code of the examples above: parsing a CV/rƩsumƩ.

For more, see Examples of use.

Rule-based data extraction

The Rule class can be used to define and process data extraction rules, while parsing a PDF document.

Rule instances expose "accumulators": methods that defines the data extraction strategy to be used for each rule.

Example:

const processItem = Rule.makeItemProcessor([
  Rule.on(/^Hello \"(.*)\"$/)
    .extractRegexpValues()
    .then(displayValue),
  Rule.on(/^Value\:/)
    .parseNextItemValue()
    .then(displayValue),
  Rule.on(/^c1$/).parseTable(3).then(displayTable),
  Rule.on(/^Values\:/)
    .accumulateAfterHeading()
    .then(displayValue),
]);
new PdfReader().parseFileItems("test/sample.pdf", (err, item) => {
  if (err) console.error(err);
  else processItem(item);
});

Troubleshooting & FAQ

Is it possible to parse a PDF document from a web application?

Solutions exist, but this module cannot be run directly by a web browser. If you really want to use this module, you will have to integrate it into your back-end so that PDF files can be read from your server.

Cannot read property 'userAgent' of undefined error from an express-based node.js app

Dmitry found out that you may need to run these instructions before including the pdfreader module:

global.navigator = {
  userAgent: "node",
};

window.navigator = {
  userAgent: "node",
};

Source: express - TypeError: Cannot read property 'userAgent' of undefined error on node.js app run - Stack Overflow

npm-pdfreader's People

Contributors

adrienjoly avatar apieceofbart avatar aykarsi avatar copmerbenjamin avatar craig-walker-orennia avatar dariusstephen avatar dependabot[bot] avatar geniegeist avatar hack-tramp avatar johakr avatar k2s avatar mattduboismatt avatar mounika0536 avatar niemal avatar noshadil avatar semantic-release-bot avatar simdrouin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

npm-pdfreader's Issues

loadMetaData error: TypeError: Cannot read property 'metadata' of null

Hello!

I have been using pdfreader for some time, both locally on MacOS and in Docker, Node 15.14.0, and everything worked flawlessly. After updating project dependencies, I am getting the following output for any PDF files:

Warning: Setting up fake worker.
loadMetaData error: TypeError: Cannot read property 'metadata' of null
{
  parserError: "loadMetaData error: TypeError: Cannot read property 'metadata' of null"
}

I have created a clean project with the only dependency "pdfreader": "^1.2.12" and an example code from the documentation:

const { PdfReader } = require("pdfreader");
const fs = require("fs");

fs.readFile("./sample.pdf", (err, pdfBuffer) => {
  new PdfReader().parseBuffer(pdfBuffer, function (err, item) {
    if (err) console.log(err);
    else if (!item) console.log("no item");
    else if (item.text) console.log(item.text);
  });
});

...and I am still getting the same message. I played with with versions of pdfkit and pdf2json, but did not solve the problem.

Update: this error appears on 1.2.12. With 1.2.11, it works properly.

Check that error handling still works after upgrade to pdf2json v1.1.7 (PR #25)

In PR #25 (merged today without publishing a new version on npm), @noshadil upgraded pdf2json to v1.1.7.

This upgrade changes the parameters of the two top-level events triggered by pdf2json: pdfParser_dataError and pdfParser_dataReady, both handled by pdfreader.

In his PR, @noshadil did update the handler for pdfParser_dataReady accordingly, but not the one for pdfParser_dataError. => It may mean that top-level error handling is broken in the current build, but I don't have enough time to check this at that point.

Next steps:

  • create an automated tests to check that top-level errors can be caught. it should pass with pdf2json v1.1.2 (used in previous build of pdfreader)
  • if that test does not pass with pdf2json v1.1.7, fix error handling and propose a PR

Remote code execution

Lodash is vulnerable to remote code execution (RCE) due to the potential to modify the properties of objects in memory. A remote attacker could run arbitrary commands on a vulnerable server, or cause the server to crash, by maliciously crafting an object via the zip functionality of Lodash.

parseTable is not working

When I run "node parse.js test/sample.pdf" I see that parseTable is not working, and table from sample file is not parsed.

modifying properties of Obj

Minimist is a parse argument options module. Affected versions of this package are vulnerable to Prototype Pollution. The library could be tricked into adding or modifying properties of Object

Xmldom is used as parser and xml serializer.The library could be tricked into adding or modifying the xml

[QUESTION] How to get raw text from PDF

I am trying to parse a pdf and catagorize information based on text formatting/decoration. How do you suggest I do that?
For example, I have a pdf in which the structure is repeated:
S.No. BOLD+UNDERLINED TITLE para

How do I catagorize this data into an array of objects:
[ { sno: "", title: "", desc: "" }, ... ]

pdfreader.parseFileItems throw error can not catch ?

pdf url : http://www.hkexnews.hk/listedco/listconews/sehk/2018/0228/LTN20180228058_C.pdf

Error: Required "glyf" or "loca" tables are not found
at error (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :193:7)
at Font_checkAndRepair [as checkAndRepair] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :12213:11)
at new Font (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :10756:21)
at PartialEvaluator_translateFont [as translateFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :8161:14)
at PartialEvaluator_loadFont [as loadFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7311:29)
at PartialEvaluator_handleSetFont [as handleSetFont] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7154:23)
at PartialEvaluator_getOperatorList [as getOperatorList] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :7470:37)
at Object.eval [as onResolve] (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :4345:26)
at Object.runHandlers (eval at (E:\git\9w\node_modules\pdf2json\lib\pdf.js:60:1), :864:35)

Pdfreader from Electron 15 not working

const fs = require('fs');
const pdfreader = require("pdfreader");
fs.readFile('./test.pdf', function (err, buffer) {
if (err) return console.log(err);
new pdfreader.PdfReader().parseBuffer(buffer, function (err, item) {
if (err) callback(err);
else if (!item) callback();
else if (item.text) console.log(item.text);
});
});
VM1448:195 Uncaught Error: No PDFJS.workerSrc specified
at error (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :195:9)
at new WorkerTransport (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :42961:9)
at Object.getDocument (eval at (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:63), :42559:15)
at PDFJSClass.parsePDFData (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\lib\pdf.js:224)
at PDFParser.#startParsingPDF (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\pdfparser.js:85)
at PDFParser.parseBuffer (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\node_modules\pdf2json\pdfparser.js:142)
at PdfReader.parseBuffer (C:\Users\dell\agspdftoexcel\node_modules\pdfreader\PdfReader.js:72)
at C:\Users\dell\agspdftoexcel\app.js:11
at FSReqCallback.readFileAfterClose [as oncomplete] (node:internal/fs/read_file_context:68)

Doesn't work with latest updates

I don't know why both 'pdf-to-text' and 'pdfreader' doesn't work even the conditions are met as I know.

Briefly, it lacks of "require" function even I add "require, requirejs, and require.js" with npm and even after adding a line of script require.js to my html, it produces the error below. Here is the codePen or more explanatory Stackoverflow

ekran resmi 2017-03-14 14 32 09

PS: I tried to include /, ', and combination of them both at the beginning and end of the require functions inside but nothing worked yet.

hi

codeļ¼š
new pdfreader.PdfReader().parseFileItems(
fileAllname,
function (err, item) {
if(item&&item.page){
item5.allpage=item.page
} else if (item.text) {
initlist.push(item.text)
}
}
console.log(initlist)

I assign a value to my custom data in the function of parsing pdf, and then output it. It is found that the data is still the initial value. How should I deal with the processed data?

trouble parsing files created by wkhtmltopdf

hi adrien

created a pdf file from your sample.html by wkhtmltopdf.
unfortunately i can not parse it proberly with your test.js.
just log item.text.

problem is, i have to parse generated files.
generated file is attached.

do you have any clue?

thanks in advance.

greets
zorla

sample3.pdf

Doesn't work on PDFs converted online

Hi,

This npm module is awesome, I really love it :)

However I do have one thing of note. See here, this module works on pdfs exported via Word, Excel, PowerPoint but not on PDFs that were generated from online sources (e.g. online2pdf). Is there a reason for this?

Thanks.

The automated release is failing šŸšØ

šŸšØ The automated release from the master branch failed. šŸšØ

I recommend you give this issue a high priority, so other packages depending on you could benefit from your bug fixes and new features.

You can find below the list of errors reported by semantic-release. Each one of them has to be resolved in order to automatically publish your package. Iā€™m sure you can resolve this šŸ’Ŗ.

Errors are usually caused by a misconfiguration or an authentication problem. With each error reported below you will find explanation and guidance to help you to resolve it.

Once all the errors are resolved, semantic-release will release your package the next time you push a commit to the master branch. You can also manually restart the failed CI job that runs semantic-release.

If you are not sure how to resolve this, here is some links that can help you:

If those donā€™t help, or if this issue is reporting something you think isnā€™t right, you can always ask the humans behind semantic-release.


No npm token specified.

An npm token must be created and set in the NPM_TOKEN environment variable on your CI environment.

Please make sure to create an npm token and to set it in the NPM_TOKEN environment variable on your CI environment. The token must allow to publish to the registry https://registry.npmjs.org/.


Good luck with your project āœØ

Your semantic-release bot šŸ“¦šŸš€

Async reading

I am using nodejs express server. On GET request I expect to receive read PDF file. But reading is async process.
`new pdfreader.PdfReader().parseFileItems('CHK.pdf', function (err, item) {

if (!item || item.page) {
res.send(printRows());
}
});`

Is there any way to wait, until all the pages will be read and only then send response back to client?

Update pdf2json to version 1.1.9

Pdf2json package is using old version of lodash 4.15 and it has some vulnerabilities. Please update the version of pdf2json to 1.1.9 and it will fix the issue.

Does not work in node v16.10.0

Describe the bug
When running with node v16.10.0, the methods parseBuffer and parseFileItems do not work as expected.

To Reproduce
Use node version 16.10.0 and try to read the text of a pdf using parseBuffer or parseFileItems

Expected behavior
The callback passed to parseBuffer or parseFileItems should be called with each pdf item found.

Current behavior
The callback passed to parseBuffer or parseFileItems only gets called once with the file: { path } or file: { buffer } data, and never gets called with the pdf items or pages.

Screenshots, outputs or logs

Desktop (please complete the following information):

  • OS: macOS Big Sur 11.4
  • Browser: node.js
  • Version: 16.10.0

Aditional Info

The same code works correctly in at least node 16.3.0

Pdfreader cannot be used with Electron (because of `async` variable in pdf2json)

We wanted to test the library in Electron but it doesn't import properly the async library, pdfparser.js line 6:

async = require("async");

As you may know async is a reserved word in Chrome and you guys are trying to import async library by overwriting the async reserved word without var or let in front of the async var.

pdf2json dependency is broken

pdf2json update from 1.2.5 to 1.3.0 removed formImage, which pdfreader requires, without incrementing the major version number as they should with incompatible changes. NPM thinks that 1.2.5 and 1.3.* are compatible. As a temporary patch, you could set the dependency in package.json to something like <=1.2.5

Outdated version at NPM

Would it be possible to push latest version to NPM? The latest available there seems to be 1.0.7 while current one on github is 1.1.3.

How to get the metadata(author, name) from the file

how to get metadata from file example author, name etc.
I'm trying something like:

const { PdfReader } = require("pdfreader");
const fs = require("fs");

fs.readFile("example.pdf", (err, pdfBuffer) => {
  // pdfBuffer contains the file content
  new PdfReader().parseBuffer(pdfBuffer, function (err, item) {
    if (err) {
      console.log("Error", err);
      return false;
    }

    if (item && item.file) { // <-- item.file not exists
    }

    if (item) {
      console.log(Object.keys(item)); // <-- has nothing to do with metadata
    }

  });
});

locate absolute position of item

Im trying to find a string in a string in a pdf and want to get its x and y location on a page.
It seems item.x and item.y are relative to the item "above". it seems impossible to me to find out which x to add to get the absolute position of an item.
is there any way?

Wrong file content reading files with same path

I'm writing an application that reads the content of some files in a directory. Files are meant to be replaced (same filename but different content).

If I use parseFileItems two times with the same path but different files the result is always the content of the old file.

I solved reading the file content with fs.readFile and passing the buffer to parseBuffer.

Your source code looks fine to me, maybe it'is a problem with pdf2json/pdfparser but I'm not sure so I'm reporting to you.

Missing text height

Hello, I am using the code snippet from the documentation to parse lines of text from pdf page.
The lines of text are getting parsed, however for some reason the height / h property is missing from item.
I need the height in order to detect the text getting out of bound of a certain box on the pdf page.

Here is the code snippet, that I used:
`
let rows = {};
let addressRows = [];

const printRows = () => {
  Object.keys(rows)
    .sort((y1, y2) => parseFloat(y1) - parseFloat(y2))
    .forEach((y) => {
      addressRows.push((rows[y] || []).join(''));
    });
}

new pdfreader.PdfReader().parseFileItems(tempPath, function (err, item) {
  if (!item || item.page) {
    printRows();
    if (!item) {
      console.log('addressRows: ', addressRows)
    }
  } else if (item.text) {
    console.log(item);

    // accumulate text items into rows object, per line
    (rows[item.y] = rows[item.y] || []).push(item.text);
  }

});`

Here is the log I received on terminal.

Screen Shot 2020-12-12 at 5 34 23 PM

Any idea why the height/h property is missing?
Thank you.

Having some trouble with parseTable

Hi there,

Think you got a good idea here, but I'm trying to figure out how to correctly parse a table. I don't think the displayTable() you have in your test file is logging.. I'm just having trouble figuring out the pattern. Anyways, do you have any advice for me?

Thanks in advance and I hope you have good day :)

Troy

var _ = require('lodash');
var PdfReader = require('pdfreader').PdfReader;
var Rule = require('pdfreader').Rule;

function displayTable(table){
    console.log('Object.keys(table)',Object.keys(table));
    _.map(table.rows, function(row){
        console.log('row',row);
    });
}
var sampleRules = [
    Rule.on(/^c1$/).parseTable(3).then(displayTable)
  ];
var processItemSample = Rule.makeItemProcessor(sampleRules);

var samplePathToPdf = __dirname + '/sample.pdf';
new PdfReader().parseFileItems(samplePathToPdf, function(err, item){
    if (err){
        console.log(err);
    }
    else {
        processItemSample(item);
    }
});

Here is my output

Object.keys(table) [ 'items', 'rows', 'matrix' ]
row [ { x: 20.408,
    y: 10.501,
    w: 0.9436,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'c2' },
  { x: 28.299,
    y: 10.501,
    w: 0.9436,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'c3' },
  { x: 14.979,
    y: 11.447,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '1' },
  { x: 29.249,
    y: 11.447,
    w: 1.25,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '2.3' } ]
row [ { x: 19.513,
    y: 12.363,
    w: 2,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'hello' },
  { x: 27.068,
    y: 12.363,
    w: 2.333,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'world' },
  { x: 12.964,
    y: 13.248,
    w: 3.055,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: 'Values:' } ]
row [ { x: 12.964,
    y: 14.835,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '1' },
  { x: 12.964,
    y: 16.423,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '2' } ]
row [ { x: 12.964,
    y: 18.01,
    w: 0.5,
    clr: 0,
    A: 'left',
    R: [ [Object] ],
    text: '3' } ]

Fails when uploading file that contains comments within PDF

When uploading and processing a PDF that contain comments, pdfreader is unable to handle the request, and my backend node service fails. I'm able to use PdfReader().parseBuffer(file, function(err, item) to process the buffered file, and it's able to read the file and first item, but it fails going forward.

Is this a known bug, and if so, is there anyway I can handle this accordingly, or a way to detect the file has comments and return an error. I've tried some work arounds, but the service just fails every time.

Invalid XRef stream header

Describe the bug
A clear and concise description of what the bug is.
Unable to process PDF

To Reproduce
List the steps you followed and/or share your code to help us reproduce the bug

  1. Feed in the PDF as a buffer
  2. Attempt to extract text from the PDF

Expected behavior
A clear and concise description of what you expected to happen.

Extract text from PDF

Screenshots, outputs or logs
If applicable, add screenshots, outputs or logs to help explain your problem.

    (while reading XRef): Error: Invalid XRef stream header

      at XRef_readXRef [as readXRef] (eval at Object.<anonymous> (node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5682:9)

  console.log
    XRefParseException: 
        at XRefParseExceptionClosure (eval at Object.<anonymous> (/Users/tsopic/telegram_bot/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:379:34)
        at eval (eval at Object.<anonymous> (/Users/tsopic/repo/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:384:3)

Desktop (please complete the following information):

NODE - 14

tested on both mac and linux

Additional context
Add any other context about the problem here.

embarassingly slow

Describe the bug
Parsing a pdf file containing 235 pages takes up to 8 seconds (just doing nothing with the received tokens - apparently the lexer alone takes up that much time) :-p

To Reproduce

const parseStart = process.hrtime();
new PdfReader().parseBuffer(result.data, (err, item) => {
                        // the pdf reader signals the end of the parsing process
                        // by calling this function with the item set as undefined
                        if (!item) {
                            const parseEnd = process.hrtime(parseStart);
                            this.logger.log(`parse pdf completed in ${parseEnd[0]}.${Math.floor(parseEnd[1] / 10e6)}s`);

                            observer.next(table);
                            observer.complete();
                            return
                        }
});

Expected behavior
I would expect to have a pdf file of this size to have no longer than 2 seconds to parse.

Screenshots, outputs or logs
parse pdf completed in 7.42s SkybriefingDaylightAdapter

Desktop (please complete the following information):

  • OS: Windows 10, but it doesn't matter, its the same on a linux virtual machine.

Additional context
I have attached a sample file.
pdf.pdf

Unable to catch Parse error

Hi @adrienjoly ,

I am using pdfreader to parse ### pdf documents. However in my application if I bump into runtime error while parsing pdf I want to use a particular logic. Below is code and trace of exception while reading pdf document. Issue is that the error is not getting caught in if(err) condition. Am I missing anything in catching the exception shown below?

Thanks,
Ji

Code snippet and exception trace:

function readPDFPages(buffer, reader = (new PdfReader())) {

  console.log('reading pdf pages: ');
  console.log(buffer);

  return new Promise((resolve, reject) => {
    let pages = [];
    reader.parseBuffer(buffer, (err, item) => {

      if (err) {
        console.log("err in parsed buffer");
        console.log(err);
        reject(err)
      }
      else if (!item)
        resolve(pages);

      else if (item.page)
        resolve(pages);
    });
  });

}

Exception trace:
Error: Illegal character: 41
    at error (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:195:9)
    at Lexer_getObj [as getObj] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24616:11)
    at Parser_shift [as shift] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24038:32)
    at Parser_makeStream [as makeStream] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24195:12)
    at Parser_getObj [as getObj] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:24079:18)
    at XRef_fetch [as fetch] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5753:22)
    at XRef_fetchIfRef [as fetchIfRef] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:5699:19)
    at Dict_get [as get] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4759:28)
    at Page_getPageProp [as getPageProp] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4213:28)
    at Page.get content [as content] (eval at <anonymous> (/var/task/node_modules/pdf2json/lib/pdf.js:64:1), <anonymous>:4227:19)

having trouble parsing data that extends between pages

so far, usage of this library has been really good, but I've run into an issue. basically I have a table I'm parsing data from (not using pdfreader.TableParser right now) that is split between two pages.

when I parse through each page, I use logic that finds the heading of the title to determine where it begins, and the heading of the next table to determine where it ends.

if I cannot parse across both pages, I cannot get all the data from the table.

from my understanding, I am looping through each page in my below code. I would love any suggestions as I've sort of hit a roadblock here.

please note that I'm using the Serverless framework and invoking it that way; sorry if it's very unrepeatable for anybody.

code:

function getNextIndexItem(rows, num, currentItem) {
  /*
  get the item that appears next in the array passed into the rows param.
  param rows: array of rows parsed on the page
  param num: number of indexes past the current index (currentItem)
  param currentItem: the index of the current item in the array
  */ 
  let keys = Object.keys(rows);
  let nextIndex = keys.indexOf(currentItem) + num;
  let nextItem = keys[nextIndex];
  let nextField = (rows[nextItem] || []).join(' ');
  let finalStr = nextField.split(':')[nextField.split(':').length-1];

  return finalStr;
}

function pdfReader(pdfFilePath, parsedData, callback) {
  const pdfreader = require("pdfreader");

  let rows = {}; // indexed by y-position
  let tableIndexes = 0;

  new pdfreader.PdfReader().parseFileItems(pdfFilePath, (err, item) => {
    if (err) callback(err);

    if (item) {
      if (item.page) {
        // end of file, or page
        Object.keys(rows) // => array of y-positions (type: float)
          .sort((y1, y2) => parseFloat(y1) - parseFloat(y2)) // sort float positions
          .forEach(yValue => {
            // rows[y] is an array of text for a line.
            let line = (rows[yValue] || []).join('');  // construct line of text

            if (line.includes('Table Name')) {
              tableIndexes = 0;
              for (let i = 0; i < 500; i++) {
                if (getNextIndexItem(rows, i, yValue).includes('Next Table Name')) {
                  tableIndexes = i;  // get index of last table row
                  break;
                }
            }
            for (let i =2; i < tableIndexes; i++) {  // start at 2 to avoid the heading row of the table
              console.log(`List row #${i}: ${getNextIndexItem(rows, i, yValue)}`);
            }
          });
        rows = {}; // clear rows for next page
      } else if (item.text) {
        if (!rows[item.y]) {
          rows[item.y] = [];
        }
        rows[item.y].push(item.text);
      }
    } else {
      // we're done here
      callback(parsedData);
    }
  });
};

module.exports.test = function() {
  pdfReader('doc.pdf', (err, data) => {
    if (err) console.log(err);
  });
}

here's the table in the PDF:
Capture

Some characters are missing / corrupt (e.g. ligatures)

First, I just want to thank you for creating this package. It's really helped us.

Describe the bug
While most of the text is there, a few characters are missing from my PDF.

Here's the PDF. It was produced by using a headless Chrome 67.0.3396.87 on Ubuntu to print the screen to PDF.
Scenario-4.1-RiskTables-FQA.pdf

To Reproduce
Here's a minimalist test:

const PdfReader = require("pdfreader").PdfReader;
const fs = require("fs");
const path = require("path");

const filename = path.join("c:", "temp", "Scenario-4.1-RiskTables-FQA.pdf");
console.log("Reading " + filename + "...");

new Promise((resolve, reject) => {
    let pdfText = "";
    fs.readFile(filename, (err, pdfBuffer) => {
        console.log("Found buffer with " + pdfBuffer.length + " bytes.");
        new PdfReader().parseBuffer(pdfBuffer, function(err, item){
            if (err) {
                reject(err);
            } else if (!item) {
                resolve(pdfText);
            } else if (item.text) {
                //console.log("Found item: " + JSON.stringify(item));
                pdfText += item.text;
            }
        });
    });
}).then((pdfText) => {
    console.log("Found PDF Text: " + pdfText);
}).catch(e => {
    console.log("ERROR", e);
});

Expected behavior
I would expect to see all of the characters. Open the PDF and you'll notice the sentence "Effective RMP:" on the first page just above "Default 5x5 RMP V1.0". In the text that gets exported from the file, it says "E ective RMP".

Screenshots, outputs or logs
Here's the log of what this program produces for me:

Reading c:\temp\Scenario-4.1-RiskTables-FQA.pdf...
Found buffer with 69972 bytes.
Found PDF Text: 8/29/2018QbDVisionRiskTablesabout:blank1/3QbDVisionExportedBy:RyanRocketExportDate:Aug29,2018at1:44pm G MT C ompany:RocketsRUSProject:PRJ-6-PrintTestProjectReportDate:Aug29,2018RiskTablesReportī‚†FQARiskTableAsofAug29,2018at11:59pm G MTRiskTable:FQARiskTableDate:Aug29,2018E ectiveRMP:Default5x5RMPV1.08/29/2018QbDVisionRiskTablesabout:blank2/3FQA-32-Appearance[NOTAPPROVED]1(1%) C olor,shapeandappearancearenotdirectlylinkedtosafetyande cacy.Therefore,theyarenotcritical.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-40-Assay[NOTAPPROVED]100(100%)Processvariablesmaya ecttheassayofthedrugproduct.1000(100%)10000(100%)IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-95-Overdosage[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-52- C ontainer C losureSystem[NOTAPPROVED]100(100%)Packagingoptionshavenotbeenidenti ed1000(100%)10000(100%)SuitablepackagingoptionswillbeinvestigatedduringdevelopmentprocessNone C M-78-NA[NOTAPPROVED]TPP-101-HowSupplied/StorageandHandling[NOTAPPROVED]FQA-45- C ontentUniformity[NOTAPPROVED]100(100%)Variabilityincontentuniformitywilla ectsafetyande cacy.1000(100%)10000(100%)Bothformulationandprocessvariablesimpactcontentuniformity,sothis C QAwillbeevaluatedthroughoutproductandprocessdevelopment.ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-95-Overdosage[NOTAPPROVED]FQA-42-DegradationProducts[NOTAPPROVED]100(100%)Formulationandprocessvariablescanimpactdegradationproducts.1000(100%)10000(100%)Degradationproductswillbeassessedduringproductandprocessdevelopment.IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]TPP-101-HowSupplied/StorageandHandling[NOTAPPROVED]FQA-47-Dissolution[NOTAPPROVED]100(100%)Bothformulationandprocessvariablesa ectthedissolutionpro le.1000(100%)10000(100%)This C QAwillbeinvestigatedthroughoutformulationandprocessdevelopment.ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-97- C linicalPharmacology[NOTAPPROVED]FQA-37-Friability[NOTAPPROVED]25(25%)AtargetofNMT1.0%w/wofmeanweightlossassuresalowimpactonpatientsafetyande cacyandminimizescustomercomplaints.250(25%)2500(25%)ReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-97- C linicalPharmacology[NOTAPPROVED]FQA-38-Identi cation[NOTAPPROVED]100(100%)Identi cationiscriticalforsafetyande cacy.1000(100%)10000(100%)IPTandRelease C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]FQA-50-MicrobialLimits[NOTAPPROVED]10(10%)Non-compliancewithmicrobiallimitswillimpactpatientsafety.However,inthiscase,theriskofmicrobialgrowthisverylowbecauserollercompaction(drygranulation)isutilizedforthisproduct.Therefore,this C QAwillnotbediscussedindetailduringformulationandprocessdevelopment.100(10%)1000(10%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-33-Odor[NOTAPPROVED]1(1%)Ingeneral,anoticeableodorisnotdirectlylinkedtosafetyande cacy,butodorcana ectpatientacceptability.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-49-ResidualSolvents[NOTAPPROVED]5(5%)Residualsolventscanimpactsafety.However,nosolventisusedinthedrugproductmanufacturingprocessandthedrugproductcomplieswithUSP<467>Option1.Therefore,formulationandprocessvariablesareunlikelytoimpactthis C QA.50(5%)500(5%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-91-AdverseReactions[NOTAPPROVED]TPP-98-NonclinicalToxicology[NOTAPPROVED]FQA-35-Score C on guration[NOTAPPROVED]1(1%)Scorecon gurationisnotcriticalfortheacetriptantablet.10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQA-34-Size[NOTAPPROVED]1(1%)SeeTargetJusti cation10(1%)100(1%)None C M-78-NA[NOTAPPROVED]NoneFQAī…• C riticalityī… C riticalityJusti cationī…ProcessRiskī…RPNī…RecommendedActionsī… C ontrolStrategyī… C ontrolMethodsī…TPPLinksī…8/29/2018QbDVisionRiskTablesabout:blank3/3Ā©2018 C herry C ircleSoftware,Inc.FQA-43-Water C ontent[NOTAPPROVED]25(25%)However,inthiscase,acetriptanisnotsensitivetohydrolysisandmoisturewillnotimpactstability.250(25%)2500(25%)NoneReleaseTestOnly C M-79-Unknown[NOTAPPROVED]TPP-88-DosageFormsandStrengths[NOTAPPROVED]FQAī…• C riticalityī… C riticalityJusti cationī…ProcessRiskī…RPNī…RecommendedActionsī… C ontrolStrategyī… C ontrolMethodsī…TPPLinksī…

Desktop (please complete the following information):

  • OS: Ubuntu 14, running in a docker container
  • Browser Chome 67.0.3396.87
  • Version Pdfreader v 0.2.5

Additional context
Thank you again for creating this package.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    šŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. šŸ“ŠšŸ“ˆšŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ā¤ļø Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.