ffalt / pdf.js-extract Goto Github PK
View Code? Open in Web Editor NEWnodejs lib for extracting data from PDF files
License: Other
nodejs lib for extracting data from PDF files
License: Other
eg.
pdfExtract.extract(Buffer.from(pdf.Body), options, (err, data) => {
if (err) return console.log(err);
console.log(data);
});
I found that pdf.js-extract have a problem with cyryllic script. Is there any workaround for that?
Impossible to find any red text color in the pdf file.
I was working with pdfjs-dist
and then discovered another library called pdf.js-extract
that I was curious to try out. However, the issue I encountered is that I can't seem to retrieve coordinates in pdf.js-extract
like I can in pdfjs-dist
.
With pdfjs-dist
, we get information like this:
{
"str": "MY TEXT",
"dir": "ltr",
"width": 35.11999999999999,
"height": 8,
"transform": [
8,
0,
0,
8,
279.94,
473.06
],
"fontName": "g_d0_f11",
"hasEOL": false
}
But with pdf.js-extract
, I'm getting this:
{
"x": 279.94,
"y": 375.45,
"str": "MY TEXT",
"dir": "ltr",
"width": 35.11999999999999,
"height": 8,
"fontName": "Helvetica"
}
Both outputs are different. In my use case, I need to obtain the Y-coordinate of words in a PDF file. In pdfjs-dist, transform[5] provides the Y-coordinate (473.06)
, which I use for PDF modification. However, in pdf.js-extract, the Y-value is (375.45)
which is different.
Can you help me figure out what I can do in this situation?
I try to pdfExtract.extract()
Message error: 'The Array.prototype
contains unexpected enumerable properties: clone, last, first, max, min, insert; thus breaking e.g. for...in
iteration of Array
s.',
Hi,
Thanks a lot for creating this library. I am using this library more than a year now. I was working with 0.1.5 version. Everything was working fine, as i can match no. of divs created by PDF.js in web browser.
But after upgradation to 0.2.0 version, i noticed that no. of divs generated are significantly higher than 0.1.5 version. Hence, they are not matching with DIVs generated by PDF.js in web browser. Please see, if there is a bug or i am missing something.
Hi, this library is cool. But how can i extract arabic texts as a whole just like ordinary characters??
I'd appreciate any leads on this.
Thanks.
Thank you for create this project.
It works perfectly.
Is there a way to convert the extracted json to pdf?
(node:32392) UnhandledPromiseRejectionWarning: Error: ENOENT: no such file or directory, open '//../base/shared/util.js'
at Object.openSync (fs.js:462:3)
at Object.readFileSync (fs.js:364:35)
at D:\Users\DSingh69\MyWork\ProjectsCode\NELP\CART_T\API\dist\index.js:341:847420
at Array.forEach ()
import { PDFExtract, PDFExtractOptions } from 'pdf.js-extract';
const pdfExtract = new PDFExtract();
const options: PDFExtractOptions = { };
async function main() {
const res = await pdfExtract.extract('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf', options);
}
main()
[Error: ENOENT: no such file or directory, open 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf'] {
errno: -2,
code: 'ENOENT',
syscall: 'open',
path: 'https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf'
}
Node v20
"pdf.js-extract": "^0.2.1"
Am I missing something?
I also tried, but same error:
import { readFile } from 'fs/promises';
import { PDFExtract, PDFExtractOptions } from 'pdf.js-extract';
const buffer = await readFile('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf');
pdfExtract.extractBuffer(buffer, options, (err, data) => {
if (err) {
return console.error(err);
}
console.log(data);
});
Does it work only with local files?
I am trying to extract individual words and their coordinates from a pdf file. I have tried it on a few files. On one of the files, it does the extraction perfectly. However, on the other file (which only has one line in it), it returns the entire sentence as a "word" without parsing out the individual words or their coordinates. I made this pdf myself by creating a file in Microsoft Word and exporting it as a pdf. Ive added a screenshot of what I get when I run
pdfExtract.extract(filePath, options, (err, data) => { if (err) { return console.log(err); } data.pages.forEach(function(page, index, object) { page.content.forEach(function(content, index, object) { console.log(content); }); }); });
A month ago I used the version 0.2.0 library and it worked really well. But a week ago, when I used it again, it returned an error like this:
Error
at BaseExceptionClosure ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 565: 29)
at Array. ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 568: 2)
at w_pdfjs_require ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 24196: 41)
at \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 24425: 13
at \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 24469: 3
at \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 24472: 12
at webpackUniversalModuleDefinition ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 25: 20)
at Object. ( \ node_modules \ pdf.js-extract \ lib \ pdfjs \ pdf.js: 32: 3)
at Module._compile (internal / modules / cjs / loader.js: 1063: 30)
at Module._extensions..js (internal / modules / cjs / loader.js: 1092: 10)
at Object.require.extensions. [as .js] ( \ node_modules \ babel-watch \ runner.js: 64: 7)
at Module.load (internal / modules / cjs / loader.js: 928: 32)
at Function.Module._load (internal / modules / cjs / loader.js: 769: 14)
at Module.require (internal / modules / cjs / loader.js: 952: 19)
at require (internal / modules / cjs / helpers.js: 88: 18)
at Object. ( \ node_modules \ pdf.js-extract \ lib \ cmap-reader.js: 2: 18) {
message: 'The Array.prototype
contains unexpected enumerable properties: clone, last, first, max, min, insert; thus breaking e.g. for ... in
iteration of Array
s. ',
name: 'UnknownErrorException',
details: 'Error: The Array.prototype
contains unexpected enumerable properties: clone, last, first, max, min, insert; thus breaking e.g. for ... in
iteration of Array
s.
My code:
const pdfExtract = new PDFExtract();
const options = {};
pdfExtract.extract(file, options, (err, extractResult) => {
if (err) {
console.error(err);
}
...do something
});
My project is using Node version 14.15.3
TypeError: Cannot read property 'src' of null
at D:\Projects\PPT-PDF\ppt-pdf\node_modules\pdfjs-dist-for-node\build\pdf.combined.js:44786:53
at Object. (D:\Projects\PPT-PDF\ppt-pdf\node_modules\pdfjs-dist-for-node\build\pdf.combined.js:
at Module._compile (module.js:624:30)
at Object.Module._extensions..js (module.js:635:10)
at Module.load (module.js:545:32)
at tryModuleLoad (module.js:508:12)
at Function.Module._load (module.js:500:3)
at Module.require (module.js:568:17)
at require (internal/module.js:11:18)
at Object. (D:\Projects\PPT-PDF\ppt-pdf\node_modules\pdf.js-extract\lib\pdf-extract.js:15:16)
When using in Angular 6, getting these error during compilation. Any idea, how to resolve it?
ERROR in ./node_modules/pdf.js-extract/lib/pdfjs/pdf.js
Module not found: Error: Can't resolve 'http' in 'F:.........\node_modules\pdf.js-extract\lib\pdfjs'
ERROR in ./node_modules/pdf.js-extract/lib/pdfjs/pdf.js
Module not found: Error: Can't resolve 'https' in 'F:.........\node_modules\pdf.js-extract\lib\pdfjs'
Thanks,
-Deepak
Not really working. I just tried to copy the simple example into an electron application and I got the error: "Setting up fake worker failed: "Cannot read property 'WorkerMessageHandler' of undefined""
Getting the error below when trying to compile this.
error - ./node_modules/pdf.js-extract/lib/cmap-reader.js:1:0
Module not found: Can't resolve 'fs'
Installation as such is working fine, no missing dependencies are reported.
Hello!
First of all great work!
I encountered an issue extracting data from a pdf download from print dialog. In metadata the producer is "Microsoft: Print To PDF". I did not have any issues with any other producer types such as "iText 4.2.0 by 1T3XT".
Is there any solution?
I am using PDFExtract class, extract method and get every pages content.
Version: "pdf.js-extract": "^0.2.1"
Thanks,
Vasilis
I use this package to get the text out of pdfs to make them searchable. Occasionally some characters are extracted as unknown symbols (displayed as boxes).
The following characters display as 003 on the pdf but for some reason are extracted as \x00 which is a hex code for null I think.
{
x: 139.8924901141838,
y: 192.57599427999992,
str: '\x00',
dir: 'ltr',
width: 5.62011142035425,
height: 10.504880634296622,
fontName: 'g_d0_f2'
},
{
x: 145.51259390011606,
y: 192.57599427999992,
str: '\x00',
dir: 'ltr',
width: 5.62011142035425,
height: 10.504880634296622,
fontName: 'g_d0_f2'
},
{
x: 151.1326976860483,
y: 192.57599427999992,
str: '\x00',
dir: 'ltr',
width: 5.62011142035425,
height: 10.504880634296622,
fontName: 'g_d0_f2'
}
Getting this while trying to use pdf.js-extract npm library from NodeJS. Using it to extract text from PDF from backend
Hi @ffalt thank you a lot for this project. I have successfully been using your extractBuffer
function in a browser environment.
Working with pdfjs-dist V4.0.269 I noticed that the y coordinate is slightly wrong. If you consider upgrading pdfjs I had success calculating the y coordinate in the following way:
page.getTextContent().then((content) => {
// Content contains lots of information about the text layout and styles, but we need only strings at the moment
pag.content = content.items.map((item) => {
const tx = Util.transform(viewport.transform, item.transform);
return {
x: tx[4],
y: tx[5] - item.height,
str: item.str,
dir: item.dir,
width: item.width,
height: item.height,
fontName: item.fontName
};
});
})
This would replace the block that you currently have here
I hope this will be of help
Upon deployment of my lambda function to AWS I receive the following error:
[ERROR] Could not resolve "../build/Release/canvas.node"
node_modules/.pnpm/[email protected]/node_modules/canvas/lib/bindings.js:3:25:
3 │ const bindings = require('../build/Release/canvas.node')
╵ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
I am currently working on a MacBook Ventura 13.4 in a NX Mono repo project. Installation with pnpm succeeded without trouble.
full logs:
pnpm nx deploy outlook-notification-service --all
✔ 4/4 dependent project tasks succeeded [4 read from cache]
Hint: you can run the command with --verbose to see the full dependent project outputs
————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
> nx run outlook-notification-service:deploy --all
OptionsParsedExecutorInterface {"all":true,"parseArgs":{"all":true},"sourceRoot":"apps/outlook-notification-service/src","root":"apps/outlook-notification-service"}
nodeCommandWithRelativePath node /Users/jopluysterburg/nn-si-apps/node_modules/aws-cdk/bin/cdk.js deploy
Executing command: node /Users/jopluysterburg/nn-si-apps/node_modules/aws-cdk/bin/cdk.js deploy --all true
Bundling asset jopluysterburg-outlook-notification-service-delta-query/run-cycle-lambda/Code/Stage...
...8fd9b885fff3a4aea42171ec022a7be290526cb840f39d5fde19b45/index.js 1.8mb ⚠️
⚡ Done in 320ms
Bundling asset jopluysterburg-outlook-notification-service-delta-query/restart-cycle-lambda/Code/Stage...
...d50faf401aaba372e1b18435da90c9514d6f56fb5fda6d330d5c458/index.js 1.8mb ⚠️
⚡ Done in 62ms
Bundling asset jopluysterburg-outlook-notification-service-moev-worker/moev-worker/Code/Stage...
✘ [ERROR] Could not resolve "../build/Release/canvas.node"
node_modules/.pnpm/[email protected]/node_modules/canvas/lib/bindings.js:3:25:
3 │ const bindings = require('../build/Release/canvas.node')
╵ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error
/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:2
`),localBundling=options.local?.tryBundle(bundleDir,options),!localBundling){const assetStagingOptions={sourcePath:this.sourcePath,bundleDir,...options};switch(options.bundlingFileAccess){case bundling_1.BundlingFileAccess.VOLUME_COPY:new asset_staging_1.AssetBundlingVolumeCopy(assetStagingOptions).run();break;case bundling_1.BundlingFileAccess.BIND_MOUNT:default:new asset_staging_1.AssetBundlingBindMount(assetStagingOptions).run();break}}}catch(err){const bundleErrorDir=bundleDir+"-error";throw fs.existsSync(bundleErrorDir)&&fs.removeSync(bundleErrorDir),fs.renameSync(bundleDir,bundleErrorDir),new Error(`Failed to bundle asset ${this.node.path}, bundle output is located at ${bundleErrorDir}: ${err}`)}if(fs_1.FileSystem.isEmpty(bundleDir)){const outputDir=localBundling?bundleDir:AssetStaging.BUNDLING_OUTPUT_DIR;throw new Error(`Bundling did not produce any output. Check that content is written to ${outputDir}.`)}}calculateHash(hashType,bundling,outputDir){if(hashType==assets_1.AssetHashType.CUSTOM||hashType==assets_1.AssetHashType.SOURCE&&bundling){const hash=crypto.createHash("sha256");return hash.update(this.customSourceFingerprint??fs_1.FileSystem.fingerprint(this.sourcePath,this.fingerprintOptions)),bundling&&hash.update(JSON.stringify(bundling)),hash.digest("hex")}switch(hashType){case assets_1.AssetHashType.SOURCE:return fs_1.FileSystem.fingerprint(this.sourcePath,this.fingerprintOptions);case assets_1.AssetHashType.BUNDLE:case assets_1.AssetHashType.OUTPUT:if(!outputDir)throw new Error(`Cannot use \`${hashType}\` hash type when \`bundling\` is not specified.`);return fs_1.FileSystem.fingerprint(outputDir,this.fingerprintOptions);default:throw new Error("Unknown asset hash type.")}}}_a=JSII_RTTI_SYMBOL_1,AssetStaging[_a]={fqn:"aws-cdk-lib.AssetStaging",version:"2.86.0"},AssetStaging.BUNDLING_INPUT_DIR="/asset-input",AssetStaging.BUNDLING_OUTPUT_DIR="/asset-output",AssetStaging.assetCache=new cache_1.Cache,exports.AssetStaging=AssetStaging;function renderAssetFilename(assetHash,extension=""){return`asset.${assetHash}${extension}`}function determineHashType(assetHashType,customSourceFingerprint){const hashType=customSourceFingerprint?assetHashType??assets_1.AssetHashType.CUSTOM:assetHashType??assets_1.AssetHashType.SOURCE;if(customSourceFingerprint&&hashType!==assets_1.AssetHashType.CUSTOM)throw new Error(`Cannot specify \`${assetHashType}\` for \`assetHashType\` when \`assetHash\` is specified. Use \`CUSTOM\` or leave \`undefined\`.`);if(hashType===assets_1.AssetHashType.CUSTOM&&!customSourceFingerprint)throw new Error("`assetHash` must be specified when `assetHashType` is set to `AssetHashType.CUSTOM`.");return hashType}function calculateCacheKey(props){return crypto.createHash("sha256").update(JSON.stringify(sortObject(props))).digest("hex")}function sortObject(object){if(typeof object!="object"||object instanceof Array)return object;const ret={};for(const key of Object.keys(object).sort())ret[key]=sortObject(object[key]);return ret}function singleArchiveFile(directory){if(!fs.existsSync(directory))throw new Error(`Directory ${directory} does not exist.`);if(!fs.statSync(directory).isDirectory())throw new Error(`${directory} is not a directory.`);const content=fs.readdirSync(directory);if(content.length===1){const file=path.join(directory,content[0]),extension=getExtension(content[0]).toLowerCase();if(fs.statSync(file).isFile()&&ARCHIVE_EXTENSIONS.includes(extension))return file}}function determineBundledAsset(bundleDir,outputType){const archiveFile=singleArchiveFile(bundleDir);switch(outputType===bundling_1.BundlingOutput.AUTO_DISCOVER&&(outputType=archiveFile?bundling_1.BundlingOutput.ARCHIVED:bundling_1.BundlingOutput.NOT_ARCHIVED),outputType){case bundling_1.BundlingOutput.NOT_ARCHIVED:return{path:bundleDir,packaging:assets_1.FileAssetPackaging.ZIP_DIRECTORY};case bundling_1.BundlingOutput.ARCHIVED:if(!archiveFile)throw new Error("Bundling output directory is expected to include only a single archive file when `output` is set to `ARCHIVED`");return{path:archiveFile,packaging:assets_1.FileAssetPackaging.FILE,extension:getExtension(archiveFile)}}}function getExtension(source){for(const ext of ARCHIVE_EXTENSIONS)if(source.toLowerCase().endsWith(ext))return ext;return path.extname(source)}
^
Error: Failed to bundle asset jopluysterburg-outlook-notification-service-moev-worker/moev-worker/Code/Stage, bundle output is located at /Users/jopluysterburg/nn-si-apps/apps/outlook-notification-service/cdk.out/bundling-temp-d584e2c42df1368879da8c18f61404b80a3bf918e8d5566b694c054423a0bcef-error: Error: bash -c pnpm exec -- esbuild --bundle "/Users/jopluysterburg/nn-si-apps/apps/outlook-notification-service/src/lambda/moev-worker.ts" --target=node18 --platform=node --outfile="/Users/jopluysterburg/nn-si-apps/apps/outlook-notification-service/cdk.out/bundling-temp-d584e2c42df1368879da8c18f61404b80a3bf918e8d5566b694c054423a0bcef/index.js" --external:@aws-sdk/* run in directory /Users/jopluysterburg/nn-si-apps exited with status 1
at AssetStaging.bundle (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:2:603)
at AssetStaging.stageByBundling (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:1:4544)
at stageThisAsset (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:1:2005)
at Cache.obtain (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/private/cache.js:1:242)
at new AssetStaging (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/core/lib/asset-staging.js:1:2400)
at new Asset (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/aws-s3-assets/lib/asset.js:1:736)
at AssetCode.bind (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/aws-lambda/lib/code.js:1:4628)
at new Function (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/aws-lambda/lib/function.js:1:7479)
at new NodejsFunction (/Users/jopluysterburg/nn-si-apps/node_modules/.pnpm/[email protected][email protected]/node_modules/aws-cdk-lib/aws-lambda-nodejs/lib/function.js:1:1174)
at new WorkerStack (/Users/jopluysterburg/nn-si-apps/apps/outlook-notification-service/lib/worker-stack.ts:62:5)
npm notice
npm notice New major version of npm available! 8.19.3 -> 10.0.0
npm notice Changelog: <https://github.com/npm/cli/releases/tag/v10.0.0>
npm notice Run `npm install -g [email protected]` to update!
npm notice
Subprocess exited with error 1
————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————
> NX Ran target deploy for project outlook-notification-service and 4 task(s) they depend on (18s)
With additional flags:
--all=true
✖ 1/5 failed
✔ 4/5 succeeded [4 read from cache]
Hi! Thanks a lot for this library. I'm using PDFExtract.extractBuffer
from the browser and it works perfectly to find text in a PDF. The only problem is the stubs added for the Node environment. I understand this library is meant to be used in Node but it basically eases PDF.js usage for my use case a lot.
Would you consider adding stubs only if they are not present already or making a different entry point for browser? (without 'fs' and 'stubs' dependencies).
Otherwise, could you please explain what did you remove from PDF.js to make it more lightweight? Thanks!
some pdf to json data are like vertical wise some are horizontal wise, some like mix , is it possible to figure out
I need to extract some relevant text which are mostly bold in the pdf. But I do not find an obvious way where it indicates if the text is bold or has any other styles applied to it. Would be nice if it could have that.
I am trying to extract text from multiple one-page files with a loop to a new file path but tasks get completed out of order. Also, some loops return undefined because the promise did not resolve yet. I tried chaining thens and async/awaits but I wasn't able to get it to work. This async version would be optimal for an expert but a synchronous version would be perfect for the unfamiliar in avoiding the asynchronous headaches!
pdf.js-extract (and also pdf.js on its own) allow to get the x- and y-coordinates as well as the width and the height of the text chunks within the PDF-file. But is there any way to also get the coordinates of parts of these chunks, i.e. substrings of the str-String of the content items extracted via pdf.js(-extract)? This means that in the "Hello, world!"-example I would like to be able to know the x-coordinate of where "Hello" ends or where "world" starts?
Is there any way more comfortable than trying to calculate this via the width of the single characters of the given font?
The reason I'm trying to do this is that I want to search for a specific text in the PDF-file and find out its coordinates in order to be able to do something with it, draw a rectangle around it to highlight it for example. Any advice on how to achieve this would be welcome. Thanks in advance!
Would be nice if the lib could extract a list of images (identifier/pathnames by pages) and give a way to extract some of them. Especially useful to see when a pdf has 0 text but many images, then a OCR work aside can be started.
Thank you for your work 👍.
Is there a specific reason the coordinate system seems to be flipped? The Y axis runs from top to bottom while the X axis runs from left to right. You can see how initially this might be confusing to folks new to this package. Was there a specific reason why you opted to not have the axes share the same origin?
0
| Y
|
|
| X
↓ -------------→
0
The output of parsing of my pdf file is missing lots of information.
201800000893524691.pdf
Can we expand the options so that getTextContent() can run with the normalizeWhitespace parameter?
Line 66 in b1babe1
so it would look like :
const normalizeWhitespaceParam = options && options.normalizeWhitespace === true ? true : false;
return page.getTextContent({ normalizeWhitespace: normalizeWhitespaceParam }).then(function (content) {
source: https://github.com/mozilla/pdf.js/blob/b2e7d0c89b76e228e49c7cee759873322a442f62/src/display/api.js#L779
thanks
Basically the title, I am using typescript so maybe the types are not up to day but i don't see a way to get the color information for a text
how to use this library in react native application?
Hi,
Is it possible to get coordinates of each word in the PDF. "Hello, world!" output is a chunk of words, I want to extract each word as one separate item - i.e. Hello
; ,
; world
; !
all separate. Is it possible?
Howdy,
I recently upgraded from 0.1.1 to 0.1.3 and have found that it extracts pages in a different order on some pdfs. Have you encountered this? is it a pdf.js-extract issue (e.g. lib/index.js
) or pdf.js
itself?
I've just tested, and 0.1.2 has the same page ordering. What commitId corresponds to v0.1.2?
I try to use this library in a project that is deployed to AWS lambda. The code is bundled via webpack (serverless-webpack). While executing I get the following error:
ERROR webpack://herakles/node_modules/pdf.js-extract/lib/pdfjs/pdf.js:14614
_this10._readyCapability.reject(new Error("Setting up fake worker failed: \"".concat(reason.message, "\".")));
^
Error: Setting up fake worker failed: "Cannot find module './pdf.worker.js'
Require stack:
- /var/task/src/verify-pdf.js
- /var/runtime/UserFunction.js
- /var/runtime/index.js".
at null.<anonymous> (webpack://herakles/node_modules/pdf.js-extract/lib/pdfjs/pdf.js:14614:43)
at processTicksAndRejections (internal/process/task_queues.js:95:5)
Hello,
Im having an issue trying to use extractBuffer inside of a jest test. Any call results in the error
Error: No "GlobalWorkerOptions.workerSrc" specified.
Ive even tried adding these 2 lines to lib/index.js file
const worker = require("./pdfjs/pdf.worker.js");
pdfjsLib.GlobalWorkerOptions.workerSrc = worker
which just results in a promise that never resolves.
Im kind of at a loss here any insight would be appreciated.
Thanks
While using this library I discovered a small issue: We have a monorepo with a shared package that contains domain logic, which is partly used by our backend and partly by a react frontend. This shared package uses pdf.js-extract for server-side stuff. Since we also import some typings and other methods inside our react client (that runs in a browser) we get several assertion errors from this line
Line 6 in 1df4d00
In the end we don't consume pdf.js-extract in our client, but the above line is called nonetheless because of the shared package between backend/frontend. Would it be possible to safeguard the stubbing by checking if the file in run inside node.js?
if (typeof window === 'undefined' || typeof process === 'object') {
require("./pdfjs/domstubs.js").setStubs(global);
}
I have such output:
[
> {
> x: 214.17,
> y: 83.52999999999997,
> str: 'PE',
> dir: 'ltr',
> width: 10.005,
> height: 7.5,
> fontName: 'Helvetica'
> },
> {
> x: 224.18,
> y: 83.52999999999997,
> str: 'ŁNOMOCNICTWO / ',
> dir: 'ltr',
> width: 72.47999999999999,
> height: 7.5,
> fontName: 'g_d0_f1'
> }]
It's supposed to be found as one word, but seems like library is breaking it into two treating it as a special symbol. Any way to get around this issue? Should be 'PEŁNOMOCNICTWO' as one word.
Same with other words that include diacritics.
I am trying to extract the columns but i canno't find anywhere documented way to do so as there is how to extract the rows ? Please can anyone provide a working example or any walk through for this ?
Hi! First of all, thank you for your library.
I'm using Next.js and using the pdf.js-extract library on the pages/api folder.
It works perfectly on development, but it crashes on production with the following error:
Setting up fake worker failed: "Cannot find module './pdf.worker.js' Require stack: - /var/task/node_modules/pdf.js-extract/lib/pdfjs/pdf.js - /var/task/node_modules/pdf.js-extract/lib/cmap-reader.js - /var/task/node_modules/pdf.js-extract/lib/index.js - /var/task/.next/server/pages/api/notas-fiscais/extract.js - /var/task/node_modules/next/dist/server/next-server.js - /var/task/___next_launcher.cjs".
This is my package.json:
{ "name": "rbna-automations", "version": "0.1.0", "private": true, "scripts": { "dev": "next dev", "build": "next build", "start": "next start", "lint": "next lint" }, "dependencies": { "@types/node": "18.15.11", "@types/react": "18.0.35", "@types/react-dom": "18.0.11", "devextreme": "22.2.5", "devextreme-react": "22.2.5", "eslint": "8.38.0", "eslint-config-next": "13.3.0", "mindee": "^3.7.3", "multiparty": "^4.2.3", "next": "13.3.0", "openai": "^3.2.1", "pdf.js-extract": "^0.2.1", "react": "18.2.0", "react-dom": "18.2.0", "typescript": "5.0.4", "xlsx": "https://cdn.sheetjs.com/xlsx-0.19.3/xlsx-0.19.3.tgz" }, "devDependencies": { "@types/multiparty": "^0.0.33" } }
I found this thread: mozilla/pdf.js#12066, but couldn't figure it out how to fix the problem on the context of pdf.js-extract.
Thank you very much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.