jzillmann / pdf-to-markdown Goto Github PK
View Code? Open in Web Editor NEWA PDF to Markdown converter
Home Page: https://pdf2md.morethan.io
License: MIT License
A PDF to Markdown converter
Home Page: https://pdf2md.morethan.io
License: MIT License
Right now following pieces are in place:
snowpack.config.js
mount: {
...
'node_modules/pdfjs-dist/es5/build/': { url: '/worker', static: true },
},
ui/src/store.js
pdfjs.GlobalWorkerOptions.workerSrc = `worker/pdf.worker.min.js`;
This works in dev mode, but for a production deployment (npm run build
) one has to copy node_modules/pdfjs-dist/build/pdf.worker.min.js
to worker/pdf.worker.min.js
.
There has to be a better way!
There is related documentation and conversation:
But i failed so far with any approach...
This is only praise, this converter works really fine!
Thx for good work, save my time before handmade
Hello @jzillmann,
@opengovsg (The Open Government Products division of the Singapore Government) forked this repo a while ago, and created an npm package and CLI tool thanks to the code that you wrote.
As you are the original author of the codebase, we would like to request an MIT license for the following derivative works:
We feel that a more permissive license than the current GPL 3.0 one would encourage more active use and maintenance of the derivative forks. For clarity, we are happy that the original repo remains on GPL 3.0, so long as @opengovsg has your permission and a license from you to publish our derivative work as MIT.
If you have any queries, do reply to this issue!
The pdf parsing of https://homepages.cwi.nl/~lex/files/dict.pdf doesn't look very appealing.
Thinks i already noticed
Currently we can :
Example.pdf
Would be nice to have a 3rd option where one can enter a URL.
PDF.js
already can source from a URL (we're doing it for (2) already), so this should be purely UI.
C:\pdf-to-markdown>npm run build
[email protected] build C:\pdf-to-markdown
webpack
Hash: 396f0bfb9d565b6f60f0
Version: webpack 1.15.0
Time: 362ms
Asset Size Chunks Chunk Names
index.html 457 bytes [emitted]
bundle.worker.js 1.54 MB [emitted]
favicons/favicon.ico 318 bytes [emitted]
[0] ./src/javascript/index.jsx 0 bytes [built] [failed]
ERROR in ./src/javascript/index.jsx
Module parse failed: C:\pdf-to-markdown\src\javascript\index.jsx Unexpected toke
n (11:20)
You may need an appropriate loader to handle this file type.
SyntaxError: Unexpected token (11:20)
at Parser.pp$4.raise (C:\pdf-to-markdown\node_modules\webpack\node_modules\a
corn\dist\acorn.js:2221:15)
at Parser.pp.unexpected (C:\pdf-to-markdown\node_modules\webpack\node_module
s\acorn\dist\acorn.js:603:10)
at Parser.pp$3.parseExprAtom (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:1822:12)
at Parser.pp$3.parseExprSubscripts (C:\pdf-to-markdown\node_modules\webpack
node_modules\acorn\dist\acorn.js:1715:21)
at Parser.pp$3.parseMaybeUnary (C:\pdf-to-markdown\node_modules\webpack\node
_modules\acorn\dist\acorn.js:1692:19)
at Parser.pp$3.parseExprOps (C:\pdf-to-markdown\node_modules\webpack\node_mo
dules\acorn\dist\acorn.js:1637:21)
at Parser.pp$3.parseMaybeConditional (C:\pdf-to-markdown\node_modules\webpac
k\node_modules\acorn\dist\acorn.js:1620:21)
at Parser.pp$3.parseMaybeAssign (C:\pdf-to-markdown\node_modules\webpack\nod
e_modules\acorn\dist\acorn.js:1597:21)
at Parser.pp$3.parseExprList (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:2165:22)
at Parser.pp$3.parseSubscripts (C:\pdf-to-markdown\node_modules\webpack\node
_modules\acorn\dist\acorn.js:1741:35)
at Parser.pp$3.parseExprSubscripts (C:\pdf-to-markdown\node_modules\webpack
node_modules\acorn\dist\acorn.js:1718:17)
at Parser.pp$3.parseMaybeUnary (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:1692:19)
at Parser.pp$3.parseExprOps (C:\pdf-to-markdown\node_modules\webpack\node_mo
dules\acorn\dist\acorn.js:1637:21)
at Parser.pp$3.parseMaybeConditional (C:\pdf-to-markdown\node_modules\webpac
k\node_modules\acorn\dist\acorn.js:1620:21)
at Parser.pp$3.parseMaybeAssign (C:\pdf-to-markdown\node_modules\webpack\nod
e_modules\acorn\dist\acorn.js:1597:21)
at Parser.pp$3.parseExpression (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:1573:21)
at Parser.pp$1.parseStatement (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:727:47)
at Parser.pp$1.parseBlock (C:\pdf-to-markdown\node_modules\webpack\node_modu
les\acorn\dist\acorn.js:981:25)
at Parser.pp$3.parseFunctionBody (C:\pdf-to-markdown\node_modules\webpack\no
de_modules\acorn\dist\acorn.js:2105:24)
at Parser.pp$1.parseFunction (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:1065:10)
at Parser.pp$1.parseFunctionStatement (C:\pdf-to-markdown\node_modules\webpa
ck\node_modules\acorn\dist\acorn.js:818:17)
at Parser.pp$1.parseStatement (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:694:19)
at Parser.pp$1.parseTopLevel (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:638:25)
at Parser.parse (C:\pdf-to-markdown\node_modules\webpack\node_modules\acorn
dist\acorn.js:516:17)
at Object.parse (C:\pdf-to-markdown\node_modules\webpack\node_modules\acorn
dist\acorn.js:3098:39)
at Parser.parse (C:\pdf-to-markdown\node_modules\webpack\lib\Parser.js:902:1
5)
at NormalModule. (C:\pdf-to-markdown\node_modules\webpack\lib\Nor
malModule.js:104:16)
at NormalModule.onModuleBuild (C:\pdf-to-markdown\node_modules\webpack-core
lib\NormalModuleMixin.js:310:10)
at nextLoader (C:\pdf-to-markdown\node_modules\webpack-core\lib\NormalModule
Mixin.js:275:25)
at C:\pdf-to-markdown\node_modules\webpack-core\lib\NormalModuleMixin.js:259
:5
You should consider change the URL of your online website https://pdf2md.morethan.io/ because pdf2md is an other package npm pdf2md.
The version at https://pdf2md.morethan.io/ doesn't remove any header/footers for me, but the same pdf uploaded to https://jzillmann.github.io/pdf-to-markdown-staging/ does. I just can't figure out how to get the output from the github.io version with repetitive elements removed.
expect to add the download function for the online version
Hi!
This version can only convert to text, so is there any way/option to convert with media(images...)
Thanks
With the PDF file https://homepages.cwi.nl/~lex/files/dict.pdf (a ~220 pages math textbook) I get the following error in the console, and no markdown file to download:
pdf2md.morethan.io/:1 Uncaught (in promise) Unsupported headline level: 11 (supported are 1-6)
Hello!
I was wondering if it would be possible to create a standalone version of pdf-to-markdown to use in other projects. The current project includes the generation of the page etc as well. Just the converter would be nice to have.
Example
PdfDocument = PDFJS.getDocument(...);
Converter = new PdfToMarkDown();
var Markdown = Converter.makeMarkdown(PdfDocument);
Would be forever grateful! Thanks in advance.
I convert a very big pdf ( 180 MB) in seconds. Thanks
Peter
Hello Johannes,
Thanks for pdf-to-markdown.
I was trying to install it on my localhost machine but didn't success to make it running. Is it possible ?
I've made a clone of the git repository (git clone https://github.com/jzillmann/pdf-to-markdown) then fired a few npm statements (npm install, npm lint, ...) but, once done, how can I start the interface ?
The src/index.html static page stays with the empty <div id="main"/>
(seems logic) but, yeah, how can I install and run locally ?
Thanks a lot in advance !
The paragraphs are broken into several short lines, but I know this is not a problem as the markdown will always consider two continuous lines as a single paragraph. BUT, in case you want to add some functionality to create very long lines
(aka paragraphs), I've written this shell script. Perhaps you could translate it to javascript if it looks interesting. It works well, although it's not perfect.
GitHub Gist: paragrapher
The purpose of this script is to analyze plain text files (with or without the ".txt" extension) looking for broken paragraphs, i.e., paragraphs splited in more than one line, and join them in a single very long line.
I love pdf2markdown so much, and very much appreciate that it's an online tool (I haven't figured out how to make the code run in linux, so for now I am content with the website).
I was wondering though -- was there any plans to integrate PDF to Markdown in a PKM (personal knowledge management) tool like Obsidian? I have to convert a boatload of PDFs to markdown to put them in Obsidian and from a purely self-serving perspective, it would vastly improve my efficiency if such a conversion tool existed within the app.
I'm sure you have enough on your plate, but I thought I'd at least ask. Also, is there any way to provide a little financial support for all your work? I relied on pdf2markdown a great deal for a recent project, and will continue to sing its praises and use it in the future. I even wrote a little article on what I was working on and how helpful pdf2markdown was to achieving my goals. https://careylening.substack.com/p/the-power-of-links-and-second-brains-d1d
hi team, i tried yr converter but nothing showed up on yr screen,
thanks for the chance though and if you know what happened let me know and i will be back as i have many pdf's that have to be converted as Chat40 is waiting for me.
here is my page
https://pdf2md.morethan.io/#:~:text=This%20tool%20converts%20a%20PDF,different%20tools%20and%20different%20ages.
It's probably my fault as i am one of those humans who is allergic to code,
and my Neurons simply Refuse to accept any I try to give them... ;)
waiting in anticipation of a successful solution as i really need you badly.
Thanks for for the opportunity.
cheers Kev Borg
Kiwi
Thank you for this package. I was interested in converting a PDF with web hyperlinks into markdown but they were not shown in the output. This is expected, right?
Would be nice to be able to use pdf-to-markdown from command line!
Uncaught TypeError: Cannot set properties of null (setting 'innerHTML')
at jqmini.js:1:53
VM96:405 Edge Translation started
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f956': 'SyntaxError: Invalid font data in ArrayBuffer.'.
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f961': 'SyntaxError: Invalid font data in ArrayBuffer.'.
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f981': 'SyntaxError: Invalid font data in ArrayBuffer.'.
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f991': 'SyntaxError: Invalid font data in ArrayBuffer.'.
pdf2md.morethan.io/:1 Uncaught (in promise) Unsupported headline level: 7 (supported are 1-6)
2pdf2md.morethan.io/:1 Uncaught (in promise) Error: Could not establish connection. Receiving end does not exist.
bundle.js:2 Uncaught TypeError: Cannot read properties of null (reading 'getHostNode')
at Object.getHostNode (bundle.js:2:564541)
at l.getHostNode (bundle.js:2:522842)
at Object.getHostNode (bundle.js:2:564541)
at Object.updateChildren (bundle.js:2:519030)
at Y._reconcilerUpdateChildren (bundle.js:2:560442)
at Y._updateChildren (bundle.js:2:561369)
at Y.updateChildren (bundle.js:2:561267)
at Y._updateDOMChildren (bundle.js:2:537299)
at Y.updateComponent (bundle.js:2:535523)
at Y.receiveComponent (bundle.js:2:535076)
3pdf2md.morethan.io/:1 Uncaught (in promise) Error: Could not establish connection. Receiving end does not exist.
2bundle.js:2 Uncaught TypeError: Cannot read properties of null (reading 'getHostNode')
at Object.getHostNode (bundle.js:2:564541)
at l.getHostNode (bundle.js:2:522842)
at Object.getHostNode (bundle.js:2:564541)
at Object.updateChildren (bundle.js:2:519030)
at Y._reconcilerUpdateChildren (bundle.js:2:560442)
at Y._updateChildren (bundle.js:2:561369)
at Y.updateChildren (bundle.js:2:561267)
at Y._updateDOMChildren (bundle.js:2:537299)
at Y.updateComponent (bundle.js:2:535523)
at Y.receiveComponent (bundle.js:2:535076)
Hi, I want to know how to change "127.0.0.1" to my server's ip and change the port(default port is 8080). Thank u!
Is it possible to add utf-8 support to the app? Trying to convert pdfs with CJK characters ended up with garbled text.
Browser: Avast Secure Browser
OS: Windows 11
Console output:
pdf2md.morethan.io/:1 Uncaught (in promise) Unsupported headline level: 7 (supported are 1-6)
DevTools failed to load source map: Could not load content for https://pdf2md.morethan.io/bootstrap.css.map: HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.