jzillmann / pdf-to-markdown Goto Github PK

View Code? Open in Web Editor NEW

1.1K 18.0 183.0 189.65 MB

A PDF to Markdown converter

Home Page: https://pdf2md.morethan.io

License: MIT License

HTML 0.28% JavaScript 99.49% CSS 0.23%

pdf-converter markdown

pdf-to-markdown's People

Stargazers

Watchers

Forkers

illusionfield stweil yasuotanaka mattbailey dagfans happy-ferret kadesnikov rifleguy orchestor dej611 betojsx nguoianphu yuanfuiw marvindanig studocu d-feller yesoreyeram opengovsg trumancfy raymondwo alex-linhares asa1230 hannahromney davan690 beardedginger24 anastasiya-sk worldoflinux martinandersen3d sliceofbytes gryn010 smitnald wugenqiang csrudolflai fuxin123z cjsteel guillaumeisabellex ahandsel tangqianyu iprojas dazegithub kumakichi uncodedtech yixingwang0517 fabriciocgf swipswaps rambo2015 docuprint xdshivani gscholer oddaolse timurguseynov web-dev-collaborative ng572 darkcheftar beethovenvirus extratone lyhiving laudehenri matkoclaudio flo223 deltapains side-projects-42 zerogau vanuo1013 guopengcheng12138 lj12306 yannoudou uozooo tedxnet zhizhizhikang bdfinst hippietilley uakbr chenfan0 webstorage119 blackfishdel seanvosler thomasbiddle gsyx666 jessmtermini hurricanemark joyeoffice eularinani modern-z nath1as visaodeempresa gerwim jpollard-cs hirajanwin davidlama freitzzz k1000 tj1456 c4tom tlwzzy tinyx3k that-one-arab rescenic prasanth595 fcgca

pdf-to-markdown's Issues

Updating pdfjs fro 2.6.347 to 2.7.570: Uncaught TypeError: Cannot set property 'workerSrc' of undefined

Update pdfjs, e.g. npm remove --save pdfjs-dist && npm install --save pdfjs-dist
Run the application nom start
=> Error: Uncaught TypeError: Cannot set property 'workerSrc' of undefined

Screenshots from Chrome & Safari:

V2: Include PDF.js web-worker in a way that works with bundling

Right now following pieces are in place:

In snowpack.config.js

  mount: {
        ...      
        'node_modules/pdfjs-dist/es5/build/': { url: '/worker', static: true },
    },

And then in ui/src/store.js

pdfjs.GlobalWorkerOptions.workerSrc = `worker/pdf.worker.min.js`;

This works in dev mode, but for a production deployment (npm run build) one has to copy node_modules/pdfjs-dist/build/pdf.worker.min.js to worker/pdf.worker.min.js.

There has to be a better way!

There is related documentation and conversation:

But i failed so far with any approach...

Is it possible to detect highlighted sections (annotations) on a pdf and preserve that in md?

Cool!

This is only praise, this converter works really fine!
Thx for good work, save my time before handmade

Grant @opengovsg MIT license for pdf2md fork

Hello @jzillmann,

@opengovsg (The Open Government Products division of the Singapore Government) forked this repo a while ago, and created an npm package and CLI tool thanks to the code that you wrote.

As you are the original author of the codebase, we would like to request an MIT license for the following derivative works:

https://github.com/opendocsg/pdf2md
https://github.com/opendocsg/pdf2md-web
@opengovsg equivalents of the repos above, to facilitate the transfer of those repos from @opendocsg to @opengovsg

We feel that a more permissive license than the current GPL 3.0 one would encourage more active use and maintenance of the derivative forks. For clarity, we are happy that the original repo remains on GPL 3.0, so long as @opengovsg has your permission and a license from you to publish our derivative work as MIT.

If you have any queries, do reply to this issue!

Parsing Metadata doesn't finish

After uploading my markdown file the page doesn't stop to display the following screen.

Nothing to download here for me - nor any progress.

Improve parsing on dict.pdf

The pdf parsing of https://homepages.cwi.nl/~lex/files/dict.pdf doesn't look very appealing.

Thinks i already noticed

No TOC display (and strange header size detection, see #21)
Characters are clutched together which shouldn't be
Page numbers are not detected and removed

Read PDFs from URL option

Currently we can :

drop or browse PDFs
Open the Example.pdf

Would be nice to have a 3rd option where one can enter a URL.

PDF.js already can source from a URL (we're doing it for (2) already), so this should be purely UI.

when i type "npm run build" but error:

C:\pdf-to-markdown>npm run build

[email protected] build C:\pdf-to-markdown
webpack

Hash: 396f0bfb9d565b6f60f0
Version: webpack 1.15.0
Time: 362ms
Asset Size Chunks Chunk Names
index.html 457 bytes [emitted]
bundle.worker.js 1.54 MB [emitted]
favicons/favicon.ico 318 bytes [emitted]
[0] ./src/javascript/index.jsx 0 bytes [built] [failed]

ERROR in ./src/javascript/index.jsx
Module parse failed: C:\pdf-to-markdown\src\javascript\index.jsx Unexpected toke
n (11:20)
You may need an appropriate loader to handle this file type.
SyntaxError: Unexpected token (11:20)
at Parser.pp$4.raise (C:\pdf-to-markdown\node_modules\webpack\node_modules\a
corn\dist\acorn.js:2221:15)
at Parser.pp.unexpected (C:\pdf-to-markdown\node_modules\webpack\node_module
s\acorn\dist\acorn.js:603:10)
at Parser.pp$3.parseExprAtom (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:1822:12)
at Parser.pp$3.parseExprSubscripts (C:\pdf-to-markdown\node_modules\webpack
node_modules\acorn\dist\acorn.js:1715:21)
at Parser.pp$3.parseMaybeUnary (C:\pdf-to-markdown\node_modules\webpack\node
_modules\acorn\dist\acorn.js:1692:19)
at Parser.pp$3.parseExprOps (C:\pdf-to-markdown\node_modules\webpack\node_mo
dules\acorn\dist\acorn.js:1637:21)
at Parser.pp$3.parseMaybeConditional (C:\pdf-to-markdown\node_modules\webpac
k\node_modules\acorn\dist\acorn.js:1620:21)
at Parser.pp$3.parseMaybeAssign (C:\pdf-to-markdown\node_modules\webpack\nod
e_modules\acorn\dist\acorn.js:1597:21)
at Parser.pp$3.parseExprList (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:2165:22)
at Parser.pp$3.parseSubscripts (C:\pdf-to-markdown\node_modules\webpack\node
_modules\acorn\dist\acorn.js:1741:35)
at Parser.pp$3.parseExprSubscripts (C:\pdf-to-markdown\node_modules\webpack
node_modules\acorn\dist\acorn.js:1718:17)
at Parser.pp$3.parseMaybeUnary (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:1692:19)
at Parser.pp$3.parseExprOps (C:\pdf-to-markdown\node_modules\webpack\node_mo
dules\acorn\dist\acorn.js:1637:21)
at Parser.pp$3.parseMaybeConditional (C:\pdf-to-markdown\node_modules\webpac
k\node_modules\acorn\dist\acorn.js:1620:21)
at Parser.pp$3.parseMaybeAssign (C:\pdf-to-markdown\node_modules\webpack\nod
e_modules\acorn\dist\acorn.js:1597:21)
at Parser.pp$3.parseExpression (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:1573:21)
at Parser.pp$1.parseStatement (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:727:47)
at Parser.pp$1.parseBlock (C:\pdf-to-markdown\node_modules\webpack\node_modu
les\acorn\dist\acorn.js:981:25)
at Parser.pp$3.parseFunctionBody (C:\pdf-to-markdown\node_modules\webpack\no
de_modules\acorn\dist\acorn.js:2105:24)
at Parser.pp$1.parseFunction (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:1065:10)
at Parser.pp$1.parseFunctionStatement (C:\pdf-to-markdown\node_modules\webpa
ck\node_modules\acorn\dist\acorn.js:818:17)
at Parser.pp$1.parseStatement (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:694:19)
at Parser.pp$1.parseTopLevel (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:638:25)
at Parser.parse (C:\pdf-to-markdown\node_modules\webpack\node_modules\acorn
dist\acorn.js:516:17)
at Object.parse (C:\pdf-to-markdown\node_modules\webpack\node_modules\acorn
dist\acorn.js:3098:39)
at Parser.parse (C:\pdf-to-markdown\node_modules\webpack\lib\Parser.js:902:1
5)
at NormalModule. (C:\pdf-to-markdown\node_modules\webpack\lib\Nor
malModule.js:104:16)
at NormalModule.onModuleBuild (C:\pdf-to-markdown\node_modules\webpack-core
lib\NormalModuleMixin.js:310:10)
at nextLoader (C:\pdf-to-markdown\node_modules\webpack-core\lib\NormalModule
Mixin.js:275:25)
at C:\pdf-to-markdown\node_modules\webpack-core\lib\NormalModuleMixin.js:259
:5

where are the files?

ciao!

how are you?

I see it parses, but i get no output. where the markdown files?

thanks!
ciao!

d

Fix: No valid versions available for pdf-to-markdown

You should consider change the URL of your online website https://pdf2md.morethan.io/ because pdf2md is an other package npm pdf2md.

Some help removing reptitive elements

The version at https://pdf2md.morethan.io/ doesn't remove any header/footers for me, but the same pdf uploaded to https://jzillmann.github.io/pdf-to-markdown-staging/ does. I just can't figure out how to get the output from the github.io version with repetitive elements removed.

download problem

expect to add the download function for the online version

Can not view the converted Markdown result

After I upload the pdf file on the https://pdf2md.morethan.io/, I can not view the converted result.

is there any option to convert pdf to mark down with embedded images

Hi!
This version can only convert to text, so is there any way/option to convert with media(images...)

Thanks

Error: Unsupported Headline Level 11

With the PDF file https://homepages.cwi.nl/~lex/files/dict.pdf (a ~220 pages math textbook) I get the following error in the console, and no markdown file to download:
pdf2md.morethan.io/:1 Uncaught (in promise) Unsupported headline level: 11 (supported are 1-6)

Standalone version

Hello!

I was wondering if it would be possible to create a standalone version of pdf-to-markdown to use in other projects. The current project includes the generation of the page etc as well. Just the converter would be nice to have.

Example

PdfDocument = PDFJS.getDocument(...);  
Converter = new PdfToMarkDown();  
var Markdown = Converter.makeMarkdown(PdfDocument);

Would be forever grateful! Thanks in advance.

Thanks for this amazing tool

I convert a very big pdf ( 180 MB) in seconds. Thanks
Peter

How can we install it on our localhost ?

Hello Johannes,

Thanks for pdf-to-markdown.

I was trying to install it on my localhost machine but didn't success to make it running. Is it possible ?

I've made a clone of the git repository (git clone https://github.com/jzillmann/pdf-to-markdown) then fired a few npm statements (npm install, npm lint, ...) but, once done, how can I start the interface ?

The src/index.html static page stays with the empty <div id="main"/> (seems logic) but, yeah, how can I install and run locally ?

Thanks a lot in advance !

Broken paragraphs (enhancement)

The paragraphs are broken into several short lines, but I know this is not a problem as the markdown will always consider two continuous lines as a single paragraph. BUT, in case you want to add some functionality to create very long lines (aka paragraphs), I've written this shell script. Perhaps you could translate it to javascript if it looks interesting. It works well, although it's not perfect.

GitHub Gist: paragrapher

The purpose of this script is to analyze plain text files (with or without the ".txt" extension) looking for broken paragraphs, i.e., paragraphs splited in more than one line, and join them in a single very long line.

This is an amazing tool -- I do have one FR though.

I love pdf2markdown so much, and very much appreciate that it's an online tool (I haven't figured out how to make the code run in linux, so for now I am content with the website).

I was wondering though -- was there any plans to integrate PDF to Markdown in a PKM (personal knowledge management) tool like Obsidian? I have to convert a boatload of PDFs to markdown to put them in Obsidian and from a purely self-serving perspective, it would vastly improve my efficiency if such a conversion tool existed within the app.

I'm sure you have enough on your plate, but I thought I'd at least ask. Also, is there any way to provide a little financial support for all your work? I relied on pdf2markdown a great deal for a recent project, and will continue to sing its praises and use it in the future. I even wrote a little article on what I was working on and how helpful pdf2markdown was to achieving my goals. https://careylening.substack.com/p/the-power-of-links-and-second-brains-d1d

Tried yr markdown pdf converter but nothing showed up

hi team, i tried yr converter but nothing showed up on yr screen,
thanks for the chance though and if you know what happened let me know and i will be back as i have many pdf's that have to be converted as Chat40 is waiting for me.

here is my page
https://pdf2md.morethan.io/#:~:text=This%20tool%20converts%20a%20PDF,different%20tools%20and%20different%20ages.

It's probably my fault as i am one of those humans who is allergic to code,
and my Neurons simply Refuse to accept any I try to give them... ;)

waiting in anticipation of a successful solution as i really need you badly.

Thanks for for the opportunity.

cheers Kev Borg
Kiwi

[email protected]
[email protected]

Feature Request: Include hyperlink support

Thank you for this package. I was interested in converting a PDF with web hyperlinks into markdown but they were not shown in the output. This is expected, right?

Gather PDF's for test suite

Gather various PDF's that can used as a test suite
Those should be safe to be added to the repository licensing wise 👀

CLI tool

Would be nice to be able to use pdf-to-markdown from command line!

Can not view the converted Markdown result

Uncaught TypeError: Cannot set properties of null (setting 'innerHTML')
at jqmini.js:1:53
VM96:405 Edge Translation started
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f956': 'SyntaxError: Invalid font data in ArrayBuffer.'.
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f961': 'SyntaxError: Invalid font data in ArrayBuffer.'.
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f981': 'SyntaxError: Invalid font data in ArrayBuffer.'.
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f991': 'SyntaxError: Invalid font data in ArrayBuffer.'.
pdf2md.morethan.io/:1 Uncaught (in promise) Unsupported headline level: 7 (supported are 1-6)
2pdf2md.morethan.io/:1 Uncaught (in promise) Error: Could not establish connection. Receiving end does not exist.
bundle.js:2 Uncaught TypeError: Cannot read properties of null (reading 'getHostNode')
at Object.getHostNode (bundle.js:2:564541)
at l.getHostNode (bundle.js:2:522842)
at Object.getHostNode (bundle.js:2:564541)
at Object.updateChildren (bundle.js:2:519030)
at Y._reconcilerUpdateChildren (bundle.js:2:560442)
at Y._updateChildren (bundle.js:2:561369)
at Y.updateChildren (bundle.js:2:561267)
at Y._updateDOMChildren (bundle.js:2:537299)
at Y.updateComponent (bundle.js:2:535523)
at Y.receiveComponent (bundle.js:2:535076)
3pdf2md.morethan.io/:1 Uncaught (in promise) Error: Could not establish connection. Receiving end does not exist.
2bundle.js:2 Uncaught TypeError: Cannot read properties of null (reading 'getHostNode')
at Object.getHostNode (bundle.js:2:564541)
at l.getHostNode (bundle.js:2:522842)
at Object.getHostNode (bundle.js:2:564541)
at Object.updateChildren (bundle.js:2:519030)
at Y._reconcilerUpdateChildren (bundle.js:2:560442)
at Y._updateChildren (bundle.js:2:561369)
at Y.updateChildren (bundle.js:2:561267)
at Y._updateDOMChildren (bundle.js:2:537299)
at Y.updateComponent (bundle.js:2:535523)
at Y.receiveComponent (bundle.js:2:535076)

How to change localhost to my server's ip?

Hi, I want to know how to change "127.0.0.1" to my server's ip and change the port(default port is 8080). Thank u!

UTF-8 Support

Is it possible to add utf-8 support to the app? Trying to convert pdfs with CJK characters ended up with garbled text.

Uncaught exception: Unsupported headline level: 7 (supported are 1-6)

Browser: Avast Secure Browser
OS: Windows 11

Console output:
pdf2md.morethan.io/:1 Uncaught (in promise) Unsupported headline level: 7 (supported are 1-6)
DevTools failed to load source map: Could not load content for https://pdf2md.morethan.io/bootstrap.css.map: HTTP error: status code 404, net::ERR_HTTP_RESPONSE_CODE_FAILURE

jzillmann / pdf-to-markdown Goto Github PK

pdf-to-markdown's People

Stargazers

Watchers

Forkers

pdf-to-markdown's Issues

Recommend Projects

Recommend Topics

Recommend Org