Git Product home page Git Product logo

pdf-to-markdown's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pdf-to-markdown's Issues

V2: Include PDF.js web-worker in a way that works with bundling

Right now following pieces are in place:

  • In snowpack.config.js
  mount: {
        ...      
        'node_modules/pdfjs-dist/es5/build/': { url: '/worker', static: true },
    },
  • And then in ui/src/store.js
pdfjs.GlobalWorkerOptions.workerSrc = `worker/pdf.worker.min.js`;

This works in dev mode, but for a production deployment (npm run build) one has to copy node_modules/pdfjs-dist/build/pdf.worker.min.js to worker/pdf.worker.min.js.

There has to be a better way!

There is related documentation and conversation:

But i failed so far with any approach...

Cool!

This is only praise, this converter works really fine!
Thx for good work, save my time before handmade

Grant @opengovsg MIT license for pdf2md fork

Hello @jzillmann,

@opengovsg (The Open Government Products division of the Singapore Government) forked this repo a while ago, and created an npm package and CLI tool thanks to the code that you wrote.

As you are the original author of the codebase, we would like to request an MIT license for the following derivative works:

We feel that a more permissive license than the current GPL 3.0 one would encourage more active use and maintenance of the derivative forks. For clarity, we are happy that the original repo remains on GPL 3.0, so long as @opengovsg has your permission and a license from you to publish our derivative work as MIT.

If you have any queries, do reply to this issue!

Parsing Metadata doesn't finish

After uploading my markdown file the page doesn't stop to display the following screen.
image

Nothing to download here for me - nor any progress.

Read PDFs from URL option

Currently we can :

  1. drop or browse PDFs
  2. Open the Example.pdf

Would be nice to have a 3rd option where one can enter a URL.

PDF.js already can source from a URL (we're doing it for (2) already), so this should be purely UI.

when i type "npm run build" but error:

C:\pdf-to-markdown>npm run build

[email protected] build C:\pdf-to-markdown
webpack

Hash: 396f0bfb9d565b6f60f0
Version: webpack 1.15.0
Time: 362ms
Asset Size Chunks Chunk Names
index.html 457 bytes [emitted]
bundle.worker.js 1.54 MB [emitted]
favicons/favicon.ico 318 bytes [emitted]
[0] ./src/javascript/index.jsx 0 bytes [built] [failed]

ERROR in ./src/javascript/index.jsx
Module parse failed: C:\pdf-to-markdown\src\javascript\index.jsx Unexpected toke
n (11:20)
You may need an appropriate loader to handle this file type.
SyntaxError: Unexpected token (11:20)
at Parser.pp$4.raise (C:\pdf-to-markdown\node_modules\webpack\node_modules\a
corn\dist\acorn.js:2221:15)
at Parser.pp.unexpected (C:\pdf-to-markdown\node_modules\webpack\node_module
s\acorn\dist\acorn.js:603:10)
at Parser.pp$3.parseExprAtom (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:1822:12)
at Parser.pp$3.parseExprSubscripts (C:\pdf-to-markdown\node_modules\webpack
node_modules\acorn\dist\acorn.js:1715:21)
at Parser.pp$3.parseMaybeUnary (C:\pdf-to-markdown\node_modules\webpack\node
_modules\acorn\dist\acorn.js:1692:19)
at Parser.pp$3.parseExprOps (C:\pdf-to-markdown\node_modules\webpack\node_mo
dules\acorn\dist\acorn.js:1637:21)
at Parser.pp$3.parseMaybeConditional (C:\pdf-to-markdown\node_modules\webpac
k\node_modules\acorn\dist\acorn.js:1620:21)
at Parser.pp$3.parseMaybeAssign (C:\pdf-to-markdown\node_modules\webpack\nod
e_modules\acorn\dist\acorn.js:1597:21)
at Parser.pp$3.parseExprList (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:2165:22)
at Parser.pp$3.parseSubscripts (C:\pdf-to-markdown\node_modules\webpack\node
_modules\acorn\dist\acorn.js:1741:35)
at Parser.pp$3.parseExprSubscripts (C:\pdf-to-markdown\node_modules\webpack
node_modules\acorn\dist\acorn.js:1718:17)
at Parser.pp$3.parseMaybeUnary (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:1692:19)
at Parser.pp$3.parseExprOps (C:\pdf-to-markdown\node_modules\webpack\node_mo
dules\acorn\dist\acorn.js:1637:21)
at Parser.pp$3.parseMaybeConditional (C:\pdf-to-markdown\node_modules\webpac
k\node_modules\acorn\dist\acorn.js:1620:21)
at Parser.pp$3.parseMaybeAssign (C:\pdf-to-markdown\node_modules\webpack\nod
e_modules\acorn\dist\acorn.js:1597:21)
at Parser.pp$3.parseExpression (C:\pdf-to-markdown\node_modules\webpack\node
modules\acorn\dist\acorn.js:1573:21)
at Parser.pp$1.parseStatement (C:\pdf-to-markdown\node_modules\webpack\node

modules\acorn\dist\acorn.js:727:47)
at Parser.pp$1.parseBlock (C:\pdf-to-markdown\node_modules\webpack\node_modu
les\acorn\dist\acorn.js:981:25)
at Parser.pp$3.parseFunctionBody (C:\pdf-to-markdown\node_modules\webpack\no
de_modules\acorn\dist\acorn.js:2105:24)
at Parser.pp$1.parseFunction (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:1065:10)
at Parser.pp$1.parseFunctionStatement (C:\pdf-to-markdown\node_modules\webpa
ck\node_modules\acorn\dist\acorn.js:818:17)
at Parser.pp$1.parseStatement (C:\pdf-to-markdown\node_modules\webpack\node

modules\acorn\dist\acorn.js:694:19)
at Parser.pp$1.parseTopLevel (C:\pdf-to-markdown\node_modules\webpack\node_m
odules\acorn\dist\acorn.js:638:25)
at Parser.parse (C:\pdf-to-markdown\node_modules\webpack\node_modules\acorn
dist\acorn.js:516:17)
at Object.parse (C:\pdf-to-markdown\node_modules\webpack\node_modules\acorn
dist\acorn.js:3098:39)
at Parser.parse (C:\pdf-to-markdown\node_modules\webpack\lib\Parser.js:902:1
5)
at NormalModule. (C:\pdf-to-markdown\node_modules\webpack\lib\Nor
malModule.js:104:16)
at NormalModule.onModuleBuild (C:\pdf-to-markdown\node_modules\webpack-core
lib\NormalModuleMixin.js:310:10)
at nextLoader (C:\pdf-to-markdown\node_modules\webpack-core\lib\NormalModule
Mixin.js:275:25)
at C:\pdf-to-markdown\node_modules\webpack-core\lib\NormalModuleMixin.js:259
:5

where are the files?

ciao!

how are you?

I see it parses, but i get no output. where the markdown files?

thanks!
ciao!

image

Standalone version

Hello!

I was wondering if it would be possible to create a standalone version of pdf-to-markdown to use in other projects. The current project includes the generation of the page etc as well. Just the converter would be nice to have.

Example

PdfDocument = PDFJS.getDocument(...);  
Converter = new PdfToMarkDown();  
var Markdown = Converter.makeMarkdown(PdfDocument);

Would be forever grateful! Thanks in advance.

How can we install it on our localhost ?

Hello Johannes,

Thanks for pdf-to-markdown.

I was trying to install it on my localhost machine but didn't success to make it running. Is it possible ?

I've made a clone of the git repository (git clone https://github.com/jzillmann/pdf-to-markdown) then fired a few npm statements (npm install, npm lint, ...) but, once done, how can I start the interface ?

The src/index.html static page stays with the empty <div id="main"/> (seems logic) but, yeah, how can I install and run locally ?

Thanks a lot in advance !

Broken paragraphs (enhancement)

The paragraphs are broken into several short lines, but I know this is not a problem as the markdown will always consider two continuous lines as a single paragraph. BUT, in case you want to add some functionality to create very long lines (aka paragraphs), I've written this shell script. Perhaps you could translate it to javascript if it looks interesting. It works well, although it's not perfect.

GitHub Gist: paragrapher

The purpose of this script is to analyze plain text files (with or without the ".txt" extension) looking for broken paragraphs, i.e., paragraphs splited in more than one line, and join them in a single very long line.

This is an amazing tool -- I do have one FR though.

I love pdf2markdown so much, and very much appreciate that it's an online tool (I haven't figured out how to make the code run in linux, so for now I am content with the website).

I was wondering though -- was there any plans to integrate PDF to Markdown in a PKM (personal knowledge management) tool like Obsidian? I have to convert a boatload of PDFs to markdown to put them in Obsidian and from a purely self-serving perspective, it would vastly improve my efficiency if such a conversion tool existed within the app.

I'm sure you have enough on your plate, but I thought I'd at least ask. Also, is there any way to provide a little financial support for all your work? I relied on pdf2markdown a great deal for a recent project, and will continue to sing its praises and use it in the future. I even wrote a little article on what I was working on and how helpful pdf2markdown was to achieving my goals. https://careylening.substack.com/p/the-power-of-links-and-second-brains-d1d

Tried yr markdown pdf converter but nothing showed up

hi team, i tried yr converter but nothing showed up on yr screen,
thanks for the chance though and if you know what happened let me know and i will be back as i have many pdf's that have to be converted as Chat40 is waiting for me.

here is my page
https://pdf2md.morethan.io/#:~:text=This%20tool%20converts%20a%20PDF,different%20tools%20and%20different%20ages.

It's probably my fault as i am one of those humans who is allergic to code,
and my Neurons simply Refuse to accept any I try to give them... ;)

waiting in anticipation of a successful solution as i really need you badly.

Thanks for for the opportunity.

cheers Kev Borg
Kiwi

[email protected]
[email protected]

Gather PDF's for test suite

  • Gather various PDF's that can used as a test suite
  • Those should be safe to be added to the repository licensing wise ๐Ÿ‘€

CLI tool

Would be nice to be able to use pdf-to-markdown from command line!

Can not view the converted Markdown result

Uncaught TypeError: Cannot set properties of null (setting 'innerHTML')
at jqmini.js:1:53
VM96:405 Edge Translation started
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f956': 'SyntaxError: Invalid font data in ArrayBuffer.'.
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f961': 'SyntaxError: Invalid font data in ArrayBuffer.'.
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f981': 'SyntaxError: Invalid font data in ArrayBuffer.'.
single-file-hooks-frames.js:1 OTS parsing error: glyf: Failed to parse table
e.FontFace @ single-file-hooks-frames.js:1
bundle.js:2 Warning: Failed to load font 'g_d0_f991': 'SyntaxError: Invalid font data in ArrayBuffer.'.
pdf2md.morethan.io/:1 Uncaught (in promise) Unsupported headline level: 7 (supported are 1-6)
2pdf2md.morethan.io/:1 Uncaught (in promise) Error: Could not establish connection. Receiving end does not exist.
bundle.js:2 Uncaught TypeError: Cannot read properties of null (reading 'getHostNode')
at Object.getHostNode (bundle.js:2:564541)
at l.getHostNode (bundle.js:2:522842)
at Object.getHostNode (bundle.js:2:564541)
at Object.updateChildren (bundle.js:2:519030)
at Y._reconcilerUpdateChildren (bundle.js:2:560442)
at Y._updateChildren (bundle.js:2:561369)
at Y.updateChildren (bundle.js:2:561267)
at Y._updateDOMChildren (bundle.js:2:537299)
at Y.updateComponent (bundle.js:2:535523)
at Y.receiveComponent (bundle.js:2:535076)
3pdf2md.morethan.io/:1 Uncaught (in promise) Error: Could not establish connection. Receiving end does not exist.
2bundle.js:2 Uncaught TypeError: Cannot read properties of null (reading 'getHostNode')
at Object.getHostNode (bundle.js:2:564541)
at l.getHostNode (bundle.js:2:522842)
at Object.getHostNode (bundle.js:2:564541)
at Object.updateChildren (bundle.js:2:519030)
at Y._reconcilerUpdateChildren (bundle.js:2:560442)
at Y._updateChildren (bundle.js:2:561369)
at Y.updateChildren (bundle.js:2:561267)
at Y._updateDOMChildren (bundle.js:2:537299)
at Y.updateComponent (bundle.js:2:535523)
at Y.receiveComponent (bundle.js:2:535076)

UTF-8 Support

Is it possible to add utf-8 support to the app? Trying to convert pdfs with CJK characters ended up with garbled text.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.