ffalt / pdf.js-extract Goto Github PK

View Code? Open in Web Editor NEW

190.0 10.0 46.0 3.26 MB

nodejs lib for extracting data from PDF files

License: Other

JavaScript 99.84% Shell 0.11% TypeScript 0.05%

pdf node-module extracting-data

pdf.js-extract's People

Contributors

Stargazers

Watchers

pdf.js-extract's Issues

Read binnary?

eg.

 pdfExtract.extract(Buffer.from(pdf.Body), options, (err, data) => {
        if (err) return console.log(err);
        console.log(data);
      });

Cyryllic script - supported?

I found that pdf.js-extract have a problem with cyryllic script. Is there any workaround for that?

Texts with red color are ignored

Impossible to find any red text color in the pdf file.

Can't we retrieve text coordinates with pdf.js-extract?

I was working with pdfjs-dist and then discovered another library called pdf.js-extract that I was curious to try out. However, the issue I encountered is that I can't seem to retrieve coordinates in pdf.js-extract like I can in pdfjs-dist.

With pdfjs-dist, we get information like this:

{
    "str": "MY TEXT",
    "dir": "ltr",
    "width": 35.11999999999999,
    "height": 8,
    "transform": [
        8,
        0,
        0,
        8,
        279.94,
        473.06
    ],
    "fontName": "g_d0_f11",
    "hasEOL": false
}

But with pdf.js-extract, I'm getting this:

{
    "x": 279.94,
    "y": 375.45,
    "str": "MY TEXT",
    "dir": "ltr",
    "width": 35.11999999999999,
    "height": 8,
    "fontName": "Helvetica"
}

Both outputs are different. In my use case, I need to obtain the Y-coordinate of words in a PDF file. In pdfjs-dist, transform[5] provides the Y-coordinate (473.06), which I use for PDF modification. However, in pdf.js-extract, the Y-value is (375.45) which is different.

Can you help me figure out what I can do in this situation?

Error

I try to pdfExtract.extract()
Message error: 'The Array.prototype contains unexpected enumerable properties: clone, last, first, max, min, insert; thus breaking e.g. for...in iteration of Arrays.',

Mismatch between no. of divs in PDF.js and 0.2.0 version

Hi,

Thanks a lot for creating this library. I am using this library more than a year now. I was working with 0.1.5 version. Everything was working fine, as i can match no. of divs created by PDF.js in web browser.

But after upgradation to 0.2.0 version, i noticed that no. of divs generated are significantly higher than 0.1.5 version. Hence, they are not matching with DIVs generated by PDF.js in web browser. Please see, if there is a bug or i am missing something.

Extract Arabic Words from PDF

Hi, this library is cool. But how can i extract arabic texts as a whole just like ordinary characters??

I'd appreciate any leads on this.

Thanks.

Convert back from json to pdf

Thank you for create this project.
It works perfectly.

Is there a way to convert the extracted json to pdf?

Error after build for util.js

(node:32392) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open '//../base/shared/util.js'
at Object.openSync (fs.js:462:3)
at Object.readFileSync (fs.js:364:35)
at D:\Users\DSingh69\MyWork\ProjectsCode\NELP\CART_T\API\dist\index.js:341:847420
at Array.forEach ()

It does not work for remote pdf files? ENOENT error

  import { PDFExtract, PDFExtractOptions } from 'pdf.js-extract';
  const pdfExtract = new PDFExtract();
  const options: PDFExtractOptions = { };
  async function main() {
    const res = await pdfExtract.extract('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf', options);
  }
  main()

[Error: ENOENT: no such file or directory, open 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf'] {
  errno: -2,
  code: 'ENOENT',
  syscall: 'open',
  path: 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf'
}

Node v20
"pdf.js-extract": "^0.2.1"

Am I missing something?

I also tried, but same error:

import { readFile } from 'fs/promises';
import { PDFExtract, PDFExtractOptions } from 'pdf.js-extract';
const buffer = await readFile('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf');
pdfExtract.extractBuffer(buffer, options, (err, data) => {
  if (err) {
    return console.error(err);
  }
  console.log(data);
});

Does it work only with local files?

Incomplete Extraction?

I am trying to extract individual words and their coordinates from a pdf file. I have tried it on a few files. On one of the files, it does the extraction perfectly. However, on the other file (which only has one line in it), it returns the entire sentence as a "word" without parsing out the individual words or their coordinates. I made this pdf myself by creating a file in Microsoft Word and exporting it as a pdf. Ive added a screenshot of what I get when I run
pdfExtract.extract(filePath, options, (err, data) => { if (err) { return console.log(err); } data.pages.forEach(function(page, index, object) { page.content.forEach(function(content, index, object) { console.log(content); }); }); });

5678.pdf

My Error

A month ago I used the version 0.2.0 library and it worked really well. But a week ago, when I used it again, it returned an error like this:
Error
at BaseExceptionClosure ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 565: 29)
at Array. ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 568: 2)
at w_pdfjs_require ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 24196: 41)
at \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 24425: 13
at \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 24469: 3
at \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 24472: 12
at webpackUniversalModuleDefinition ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 25: 20)
at Object. ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 32: 3)
at Module._compile (internal / modules / cjs / loader.js: 1063: 30)
at Module._extensions..js (internal / modules / cjs / loader.js: 1092: 10)
at Object.require.extensions. [as .js] ( \ node_modules \ babel-watch \ runner.js: 64: 7)
at Module.load (internal / modules / cjs / loader.js: 928: 32)
at Function.Module._load (internal / modules / cjs / loader.js: 769: 14)
at Module.require (internal / modules / cjs / loader.js: 952: 19)
at require (internal / modules / cjs / helpers.js: 88: 18)
at Object. ( \ node_modules \ pdf.js-extract \ lib \ cmap-reader.js: 2: 18) {
message: 'The Array.prototype contains unexpected enumerable properties: clone, last, first, max, min, insert; thus breaking e.g. for ... in iteration of Arrays. ',
name: 'UnknownErrorException',
details: 'Error: The Array.prototype contains unexpected enumerable properties: clone, last, first, max, min, insert; thus breaking e.g. for ... in iteration of Arrays.

My code:
const pdfExtract = new PDFExtract();
const options = {};
pdfExtract.extract(file, options, (err, extractResult) => {
if (err) {
console.error(err);
}
...do something
});

My project is using Node version 14.15.3

Cannot read property 'src' of null

TypeError: Cannot read property 'src' of null
at D:\Projects\PPT-PDF\ppt-pdf\node_modules\pdfjs-dist-for-node\build\pdf.combined.js:44786:53
at Object. (D:\Projects\PPT-PDF\ppt-pdf\node_modules\pdfjs-dist-for-node\build\pdf.combined.js:
at Module._compile (module.js:624:30)
at Object.Module._extensions..js (module.js:635:10)
at Module.load (module.js:545:32)
at tryModuleLoad (module.js:508:12)
at Function.Module._load (module.js:500:3)
at Module.require (module.js:568:17)
at require (internal/module.js:11:18)
at Object. (D:\Projects\PPT-PDF\ppt-pdf\node_modules\pdf.js-extract\lib\pdf-extract.js:15:16)

Is it Angular 6 ready?

When using in Angular 6, getting these error during compilation. Any idea, how to resolve it?

ERROR in ./node_modules/pdf.js-extract/lib/pdfjs/pdf.js
Module not found: Error: Can't resolve 'http' in 'F:.........\node_modules\pdf.js-extract\lib\pdfjs'
ERROR in ./node_modules/pdf.js-extract/lib/pdfjs/pdf.js
Module not found: Error: Can't resolve 'https' in 'F:.........\node_modules\pdf.js-extract\lib\pdfjs'

Thanks,
-Deepak

Setting up fake worker failed: "Cannot read property 'WorkerMessageHandler' of undefined"

Not really working. I just tried to copy the simple example into an electron application and I got the error: "Setting up fake worker failed: "Cannot read property 'WorkerMessageHandler' of undefined""

Can't resolve 'fs'

Getting the error below when trying to compile this.

error - ./node_modules/pdf.js-extract/lib/cmap-reader.js:1:0
Module not found: Can't resolve 'fs'

Installation as such is working fine, no missing dependencies are reported.

Windows Print pdf extract

Hello!
First of all great work!
I encountered an issue extracting data from a pdf download from print dialog. In metadata the producer is "Microsoft: Print To PDF". I did not have any issues with any other producer types such as "iText 4.2.0 by 1T3XT".
Is there any solution?
I am using PDFExtract class, extract method and get every pages content.
Version: "pdf.js-extract": "^0.2.1"

Thanks,
Vasilis

Some characters not being extracted

I use this package to get the text out of pdfs to make them searchable. Occasionally some characters are extracted as unknown symbols (displayed as boxes).

The following characters display as 003 on the pdf but for some reason are extracted as \x00 which is a hex code for null I think.

{
  x: 139.8924901141838,
  y: 192.57599427999992,
  str: '\x00',
  dir: 'ltr',
  width: 5.62011142035425,
  height: 10.504880634296622,
  fontName: 'g_d0_f2'
},
{
  x: 145.51259390011606,
  y: 192.57599427999992,
  str: '\x00',
  dir: 'ltr',
  width: 5.62011142035425,
  height: 10.504880634296622,
  fontName: 'g_d0_f2'
},
{
  x: 151.1326976860483,
  y: 192.57599427999992,
  str: '\x00',
  dir: 'ltr',
  width: 5.62011142035425,
  height: 10.504880634296622,
  fontName: 'g_d0_f2'
}

Setting up fake worker failed: "Cannot find module './vendors~pdfjsWorker.js'

Getting this while trying to use pdf.js-extract npm library from NodeJS. Using it to extract text from PDF from backend

Fix for Y coordinate with new version of pdfjs

Hi @ffalt thank you a lot for this project. I have successfully been using your extractBuffer function in a browser environment.

Working with pdfjs-dist V4.0.269 I noticed that the y coordinate is slightly wrong. If you consider upgrading pdfjs I had success calculating the y coordinate in the following way:

page.getTextContent().then((content) => {
	// Content contains lots of information about the text layout and styles, but we need only strings at the moment
	pag.content = content.items.map((item) => {
		const tx = Util.transform(viewport.transform, item.transform);
		return {
			x: tx[4],
			y: tx[5] - item.height,
			str: item.str,
			dir: item.dir,
			width: item.width,
			height: item.height,
			fontName: item.fontName
		};
	});
})

This would replace the block that you currently have here

I hope this will be of help

can't resolve "../build/Release/canvas.node" in NX mono repo project

Upon deployment of my lambda function to AWS I receive the following error:

[ERROR] Could not resolve "../build/Release/canvas.node"
    node_modules/.pnpm/[email protected]/node_modules/canvas/lib/bindings.js:3:25:
      3 │ const bindings = require('../build/Release/canvas.node')
        ╵                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I am currently working on a MacBook Ventura 13.4 in a NX Mono repo project. Installation with pnpm succeeded without trouble.

full logs:

pnpm nx deploy outlook-notification-service --all

   ✔    4/4 dependent project tasks succeeded [4 read from cache]

   Hint: you can run the command with --verbose to see the full dependent project outputs

 ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————


> nx run outlook-notification-service:deploy --all

OptionsParsedExecutorInterface {"all":true,"parseArgs":{"all":true},"sourceRoot":"apps/outlook-notification-service/src","root":"apps/outlook-notification-service"}
nodeCommandWithRelativePath node /Users/jopluysterburg/nn-si-apps/node_modules/aws-cdk/bin/cdk.js deploy
Executing command: node /Users/jopluysterburg/nn-si-apps/node_modules/aws-cdk/bin/cdk.js deploy --all true
Bundling asset jopluysterburg-outlook-notification-service-delta-query/run-cycle-lambda/Code/Stage...
  ...8fd9b885fff3a4aea42171ec022a7be290526cb840f39d5fde19b45/index.js  1.8mb ⚠️
⚡ Done in 320ms
Bundling asset jopluysterburg-outlook-notification-service-delta-query/restart-cycle-lambda/Code/Stage...
  ...d50faf401aaba372e1b18435da90c9514d6f56fb5fda6d330d5c458/index.js  1.8mb ⚠️
⚡ Done in 62ms
Bundling asset jopluysterburg-outlook-notification-service-moev-worker/moev-worker/Code/Stage...
✘ [ERROR] Could not resolve "../build/Release/canvas.node"
    node_modules/.pnpm/[email protected]/node_modules/canvas/lib/bindings.js:3:25:
      3 │ const bindings = require('../build/Release/canvas.node')
        ╵                          ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error
/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:2
`),localBundling=options.local?.tryBundle(bundleDir,options),!localBundling){const assetStagingOptions={sourcePath:this.sourcePath,bundleDir,...options};switch(options.bundlingFileAccess){case bundling_1.BundlingFileAccess.VOLUME_COPY:new asset_staging_1.AssetBundlingVolumeCopy(assetStagingOptions).run();break;case bundling_1.BundlingFileAccess.BIND_MOUNT:default:new asset_staging_1.AssetBundlingBindMount(assetStagingOptions).run();break}}}catch(err){const bundleErrorDir=bundleDir+"-error";throw fs.existsSync(bundleErrorDir)&&fs.removeSync(bundleErrorDir),fs.renameSync(bundleDir,bundleErrorDir),new Error(`Failed to bundle asset ${this.node.path}, bundle output is located at ${bundleErrorDir}: ${err}`)}if(fs_1.FileSystem.isEmpty(bundleDir)){const outputDir=localBundling?bundleDir:AssetStaging.BUNDLING_OUTPUT_DIR;throw new Error(`Bundling did not produce any output. Check that content is written to ${outputDir}.`)}}calculateHash(hashType,bundling,outputDir){if(hashType==assets_1.AssetHashType.CUSTOM||hashType==assets_1.AssetHashType.SOURCE&&bundling){const hash=crypto.createHash("sha256");return hash.update(this.customSourceFingerprint??fs_1.FileSystem.fingerprint(this.sourcePath,this.fingerprintOptions)),bundling&&hash.update(JSON.stringify(bundling)),hash.digest("hex")}switch(hashType){case assets_1.AssetHashType.SOURCE:return fs_1.FileSystem.fingerprint(this.sourcePath,this.fingerprintOptions);case assets_1.AssetHashType.BUNDLE:case assets_1.AssetHashType.OUTPUT:if(!outputDir)throw new Error(`Cannot use \`${hashType}\` hash type when \`bundling\` is not specified.`);return fs_1.FileSystem.fingerprint(outputDir,this.fingerprintOptions);default:throw new Error("Unknown asset hash type.")}}}_a=JSII_RTTI_SYMBOL_1,AssetStaging[_a]={fqn:"aws-cdk-lib.AssetStaging",version:"2.86.0"},AssetStaging.BUNDLING_INPUT_DIR="/asset-input",AssetStaging.BUNDLING_OUTPUT_DIR="/asset-output",AssetStaging.assetCache=new cache_1.Cache,exports.AssetStaging=AssetStaging;function renderAssetFilename(assetHash,extension=""){return`asset.${assetHash}${extension}`}function determineHashType(assetHashType,customSourceFingerprint){const hashType=customSourceFingerprint?assetHashType??assets_1.AssetHashType.CUSTOM:assetHashType??assets_1.AssetHashType.SOURCE;if(customSourceFingerprint&&hashType!==assets_1.AssetHashType.CUSTOM)throw new Error(`Cannot specify \`${assetHashType}\` for \`assetHashType\` when \`assetHash\` is specified. Use \`CUSTOM\` or leave \`undefined\`.`);if(hashType===assets_1.AssetHashType.CUSTOM&&!customSourceFingerprint)throw new Error("`assetHash` must be specified when `assetHashType` is set to `AssetHashType.CUSTOM`.");return hashType}function calculateCacheKey(props){return crypto.createHash("sha256").update(JSON.stringify(sortObject(props))).digest("hex")}function sortObject(object){if(typeof object!="object"||object instanceof Array)return object;const ret={};for(const key of Object.keys(object).sort())ret[key]=sortObject(object[key]);return ret}function singleArchiveFile(directory){if(!fs.existsSync(directory))throw new Error(`Directory ${directory} does not exist.`);if(!fs.statSync(directory).isDirectory())throw new Error(`${directory} is not a directory.`);const content=fs.readdirSync(directory);if(content.length===1){const file=path.join(directory,content[0]),extension=getExtension(content[0]).toLowerCase();if(fs.statSync(file).isFile()&&ARCHIVE_EXTENSIONS.includes(extension))return file}}function determineBundledAsset(bundleDir,outputType){const archiveFile=singleArchiveFile(bundleDir);switch(outputType===bundling_1.BundlingOutput.AUTO_DISCOVER&&(outputType=archiveFile?bundling_1.BundlingOutput.ARCHIVED:bundling_1.BundlingOutput.NOT_ARCHIVED),outputType){case bundling_1.BundlingOutput.NOT_ARCHIVED:return{path:bundleDir,packaging:assets_1.FileAssetPackaging.ZIP_DIRECTORY};case bundling_1.BundlingOutput.ARCHIVED:if(!archiveFile)throw new Error("Bundling output directory is expected to include only a single archive file when `output` is set to `ARCHIVED`");return{path:archiveFile,packaging:assets_1.FileAssetPackaging.FILE,extension:getExtension(archiveFile)}}}function getExtension(source){for(const ext of ARCHIVE_EXTENSIONS)if(source.toLowerCase().endsWith(ext))return ext;return path.extname(source)}
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          ^
Error: Failed to bundle asset jopluysterburg-outlook-notification-service-moev-worker/moev-worker/Code/Stage, bundle output is located at /Users/jopluysterburg/nn-si-apps/apps/outlook-notification-service/cdk.out/bundling-temp-d584e2c42df1368879da8c18f61404b80a3bf918e8d5566b694c054423a0bcef-error: Error: bash -c pnpm exec -- esbuild --bundle "/Users/jopluysterburg/nn-si-apps/apps/outlook-notification-service/src/lambda/moev-worker.ts" --target=node18 --platform=node --outfile="/Users/jopluysterburg/nn-si-apps/apps/outlook-notification-service/cdk.out/bundling-temp-d584e2c42df1368879da8c18f61404b80a3bf918e8d5566b694c054423a0bcef/index.js" --external:@aws-sdk/* run in directory /Users/jopluysterburg/nn-si-apps exited with status 1
    at AssetStaging.bundle (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:2:603)
    at AssetStaging.stageByBundling (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:1:4544)
    at stageThisAsset (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:1:2005)
    at Cache.obtain (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/private/cache.js:1:242)
    at new AssetStaging (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:1:2400)
    at new Asset (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/aws-s3-assets/lib/asset.js:1:736)
    at AssetCode.bind (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/aws-lambda/lib/code.js:1:4628)
    at new Function (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/aws-lambda/lib/function.js:1:7479)
    at new NodejsFunction (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/aws-lambda-nodejs/lib/function.js:1:1174)
    at new WorkerStack (/Users/jopluysterburg/nn-si-apps/apps/outlook-notification-service/lib/worker-stack.ts:62:5)
npm notice
npm notice New major version of npm available! 8.19.3 -> 10.0.0
npm notice Changelog: <https://github.com/npm/cli/releases/tag/v10.0.0>
npm notice Run `npm install -g [email protected]` to update!
npm notice
Subprocess exited with error 1

 ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 >  NX   Ran target deploy for project outlook-notification-service and 4 task(s) they depend on (18s)

         With additional flags:
           --all=true

    ✖    1/5 failed
    ✔    4/5 succeeded [4 read from cache]

Using this library in the browser

Hi! Thanks a lot for this library. I'm using PDFExtract.extractBuffer from the browser and it works perfectly to find text in a PDF. The only problem is the stubs added for the Node environment. I understand this library is meant to be used in Node but it basically eases PDF.js usage for my use case a lot.

Would you consider adding stubs only if they are not present already or making a different entry point for browser? (without 'fs' and 'stubs' dependencies).

Otherwise, could you please explain what did you remove from PDF.js to make it more lightweight? Thanks!

json data issue

some pdf to json data are like vertical wise some are horizontal wise, some like mix , is it possible to figure out

Is there a way to determine styles of the text?

I need to extract some relevant text which are mostly bold in the pdf. But I do not find an obvious way where it indicates if the text is bold or has any other styles applied to it. Would be nice if it could have that.

Synchronous version

I am trying to extract text from multiple one-page files with a loop to a new file path but tasks get completed out of order. Also, some loops return undefined because the promise did not resolve yet. I tried chaining thens and async/awaits but I wasn't able to get it to work. This async version would be optimal for an expert but a synchronous version would be perfect for the unfamiliar in avoiding the asynchronous headaches!

Get position / coordinates also for substrings / parts of the text-chunks extracted from the PDF-file?

pdf.js-extract (and also pdf.js on its own) allow to get the x- and y-coordinates as well as the width and the height of the text chunks within the PDF-file. But is there any way to also get the coordinates of parts of these chunks, i.e. substrings of the str-String of the content items extracted via pdf.js(-extract)? This means that in the "Hello, world!"-example I would like to be able to know the x-coordinate of where "Hello" ends or where "world" starts?

Is there any way more comfortable than trying to calculate this via the width of the single characters of the given font?

The reason I'm trying to do this is that I want to search for a specific text in the PDF-file and find out its coordinates in order to be able to do something with it, draw a rectangle around it to highlight it for example. Any advice on how to achieve this would be welcome. Thanks in advance!

Is there a way to extract links from a pdf

Extract images?

Would be nice if the lib could extract a list of images (identifier/pathnames by pages) and give a way to extract some of them. Especially useful to see when a pdf has 0 text but many images, then a OCR work aside can be started.

Thank you for your work 👍.

Coordinate system

Is there a specific reason the coordinate system seems to be flipped? The Y axis runs from top to bottom while the X axis runs from left to right. You can see how initially this might be confusing to folks new to this package. Was there a specific reason why you opted to not have the axes share the same origin?

0
| Y
|
|
| X
↓ -------------→
0

Missing information

The output of parsing of my pdf file is missing lots of information.
201800000893524691.pdf

normalizeWhitespace parameter for getTextContent() option

Can we expand the options so that getTextContent() can run with the normalizeWhitespace parameter?

pdf.js-extract/lib/index.js

Line 66 in b1babe1

return page.getTextContent().then(function (content) {

so it would look like :

const normalizeWhitespaceParam = options && options.normalizeWhitespace === true ? true : false;
return page.getTextContent({ normalizeWhitespace: normalizeWhitespaceParam }).then(function (content) {

source: https://github.com/mozilla/pdf.js/blob/b2e7d0c89b76e228e49c7cee759873322a442f62/src/display/api.js#L779
thanks

Is there a way to get the fill color for text?

Basically the title, I am using typescript so maybe the types are not up to day but i don't see a way to get the color information for a text

support in React native

how to use this library in react native application?

Get Coordinates of Each word.

Hi,
Is it possible to get coordinates of each word in the PDF. "Hello, world!" output is a chunk of words, I want to extract each word as one separate item - i.e. Hello; ,; world; ! all separate. Is it possible?

page order changed

Howdy,
I recently upgraded from 0.1.1 to 0.1.3 and have found that it extracts pages in a different order on some pdfs. Have you encountered this? is it a pdf.js-extract issue (e.g. lib/index.js) or pdf.js itself?

I've just tested, and 0.1.2 has the same page ordering. What commitId corresponds to v0.1.2?

Error when using library in a serverless function, bundled via webpack

I try to use this library in a project that is deployed to AWS lambda. The code is bundled via webpack (serverless-webpack). While executing I get the following error:

ERROR	webpack://herakles/node_modules/pdf.js-extract/lib/pdfjs/pdf.js:14614
          _this10._readyCapability.reject(new Error("Setting up fake worker failed: \"".concat(reason.message, "\".")));
                                          ^

Error: Setting up fake worker failed: "Cannot find module './pdf.worker.js'
Require stack:
- /var/task/src/verify-pdf.js
- /var/runtime/UserFunction.js
- /var/runtime/index.js".
    at null.<anonymous> (webpack://herakles/node_modules/pdf.js-extract/lib/pdfjs/pdf.js:14614:43)
    at processTicksAndRejections (internal/process/task_queues.js:95:5)

Error: No "GlobalWorkerOptions.workerSrc" specified.

Hello,

Im having an issue trying to use extractBuffer inside of a jest test. Any call results in the error
Error: No "GlobalWorkerOptions.workerSrc" specified.

Ive even tried adding these 2 lines to lib/index.js file
const worker = require("./pdfjs/pdf.worker.js");
pdfjsLib.GlobalWorkerOptions.workerSrc = worker

which just results in a promise that never resolves.

Im kind of at a loss here any insight would be appreciated.

Thanks

Safeguard stubbing in case of shared library

While using this library I discovered a small issue: We have a monorepo with a shared package that contains domain logic, which is partly used by our backend and partly by a react frontend. This shared package uses pdf.js-extract for server-side stuff. Since we also import some typings and other methods inside our react client (that runs in a browser) we get several assertion errors from this line

pdf.js-extract/lib/index.js

Line 6 in 1df4d00

require("./pdfjs/domstubs.js").setStubs(global);

In the end we don't consume pdf.js-extract in our client, but the above line is called nonetheless because of the shared package between backend/frontend. Would it be possible to safeguard the stubbing by checking if the file in run inside node.js?

if (typeof window === 'undefined' || typeof process === 'object') {
  require("./pdfjs/domstubs.js").setStubs(global);
}

In-the-middle word breaking on diacritics

I have such output:

[
>    {
>      x: 214.17,
>      y: 83.52999999999997,
>      str: 'PE',
>      dir: 'ltr',
>      width: 10.005,
>      height: 7.5,
>      fontName: 'Helvetica'
>    },
>    {
>      x: 224.18,
>      y: 83.52999999999997,
>      str: 'ŁNOMOCNICTWO / ',
>      dir: 'ltr',
>      width: 72.47999999999999,
>      height: 7.5,
>      fontName: 'g_d0_f1'
>    }]

It's supposed to be found as one word, but seems like library is breaking it into two treating it as a special symbol. Any way to get around this issue? Should be 'PEŁNOMOCNICTWO' as one word.

Same with other words that include diacritics.

How to extract columns ?

I am trying to extract the columns but i canno't find anywhere documented way to do so as there is how to extract the rows ? Please can anyone provide a working example or any walk through for this ?

Setting up fake worker failed

Hi! First of all, thank you for your library.

I'm using Next.js and using the pdf.js-extract library on the pages/api folder.
It works perfectly on development, but it crashes on production with the following error:

Setting up fake worker failed: "Cannot find module './pdf.worker.js' Require stack: - /var/task/node_modules/pdf.js-extract/lib/pdfjs/pdf.js - /var/task/node_modules/pdf.js-extract/lib/cmap-reader.js - /var/task/node_modules/pdf.js-extract/lib/index.js - /var/task/.next/server/pages/api/notas-fiscais/extract.js - /var/task/node_modules/next/dist/server/next-server.js - /var/task/___next_launcher.cjs".

This is my package.json:

{ "name": "rbna-automations", "version": "0.1.0", "private": true, "scripts": { "dev": "next dev", "build": "next build", "start": "next start", "lint": "next lint" }, "dependencies": { "@types/node": "18.15.11", "@types/react": "18.0.35", "@types/react-dom": "18.0.11", "devextreme": "22.2.5", "devextreme-react": "22.2.5", "eslint": "8.38.0", "eslint-config-next": "13.3.0", "mindee": "^3.7.3", "multiparty": "^4.2.3", "next": "13.3.0", "openai": "^3.2.1", "pdf.js-extract": "^0.2.1", "react": "18.2.0", "react-dom": "18.2.0", "typescript": "5.0.4", "xlsx": "https://cdn.sheetjs.com/xlsx-0.19.3/xlsx-0.19.3.tgz" }, "devDependencies": { "@types/multiparty": "^0.0.33" } }

I found this thread: mozilla/pdf.js#12066, but couldn't figure it out how to fix the problem on the context of pdf.js-extract.

Thank you very much!

ffalt / pdf.js-extract Goto Github PK

pdf.js-extract's People

Contributors

Stargazers

Watchers

Forkers

pdf.js-extract's Issues

Recommend Projects

Recommend Topics

Recommend Org