jean-humann / docs-to-pdf Goto Github PK

View Code? Open in Web Editor NEW

91.0 0.0 15.0 21.86 MB

Generate PDF for document website 🧑‍🔧

Home Page: https://www.npmjs.com/package/docs-to-pdf

License: MIT License

JavaScript 0.94% TypeScript 10.89% Shell 0.07% Dockerfile 0.22% MDX 0.58% HTML 87.04% CSS 0.25%

documentation docusaurus docusaurus-documentation pdf-generation pdf pdf-converter

docs-to-pdf's Introduction

Docs to PDF

📌 Introduction

This is a PDF generator from document website such as docusaurus. This is a fork of mr-pdf which was not maintained anymore. Feel free to contribute to this project.

📦 Installation

npm install -g docs-to-pdf

🚀 Quick Start

npx docs-to-pdf --initialDocURLs="https://docusaurus.io/docs/" --contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page" --coverImage="https://docusaurus.io/img/docusaurus.png" --coverTitle="Docusaurus v2"

⚡ Usage

For Docusaurus v2

npx docs-to-pdf docusaurus --initialDocURLs="https://docusaurus.io/docs/"

npx docs-to-pdf --initialDocURLs="https://docusaurus.io/docs/" --contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page" --coverImage="https://docusaurus.io/img/docusaurus.png" --coverTitle="Docusaurus v2"

🍗 CLI Global Options

Option	Required	Description
`--initialDocURLs`	Yes	set URL to start generating PDF from.
`--contentSelector`	No	used to find the part of main content
`--paginationSelector`	No	CSS Selector used to find next page to be printed for looping.
`--excludeURLs`	No	URLs to be excluded in PDF
`--excludeSelectors`	No	exclude selectors from PDF. Separate each selector with comma and no space. But you can use space in each selector. ex: `--excludeSelectors=".nav,.next > a"`
`--cssStyle`	No	CSS style to adjust PDF output ex: `--cssStyle="body{padding-top: 0;}"` *If you're project owner you can use `@media print { }` to edit CSS for PDF.
`--outputPDFFilename`	No	name of the output PDF file. Default is `docs-to-pdf.pdf`
`--pdfMargin`	No	set margin around PDF file. Separate each margin with comma and no space. ex: `--pdfMargin="10,20,30,40"`. This sets margin `top: 10px, right: 20px, bottom: 30px, left: 40px`
`--paperFormat`	No	pdf format ex: `--paperFormat="A3"`. Please check this link for available formats Puppeteer document
`--disableTOC`	No	Optional toggle to show the table of contents or not
`--coverTitle`	No	Title for the PDF cover.
`--coverImage`	No	`<src>` Image for PDF cover (does not support SVG)
`--coverSub`	No	Subtitle the for PDF cover. Add `<br/>` tags for multiple lines.
`--headerTemplate`	No	HTML template for the print header. Please check this link for details of injecting values Puppeteer document
`--footerTemplate`	No	HTML template for the print footer. Please check this link for details of injecting values Puppeteer document
`--puppeteerArgs`	No	Add puppeteer BrowserLaunchArgumentOptions arguments ex: --sandbox Puppeteer document
`--protocolTimeout`	No	Timeout setting for individual protocol calls in milliseconds. If omitted, the default value of 180000 ms (3 min) is used
`--filterKeyword`	No	Only adds pages to the PDF containing a given meta keywords. Makes it possible to generate PDFs of selected pages
`--baseUrl`	No	Base URL for all relative URLs. Allows to render the pdf on localhost (ci/Github Actions) while referencing the deployed page.
`--excludePaths`	No	URL Paths to be excluded
`--restrictPaths`	No	Keep Only URL Path with the same rootPath as `--initialDocURLs`

Docusaurus Options

Option	Required	Description
`--version`	No	Docusaurus version. Default is 2.
`--builDir`	No	Path to Docusaurus build dir. Either absolute or relative from path of the shell

🎨 Examples and Demo PDF

Docusaurus v2

https://docusaurus.io/

initialDocURLs: https://docusaurus.io/docs

demoPDF: https://github.com/jean-humann/docs-to-pdf/blob/master/pdf/v2-docusaurus.pdf

command:

npx docs-to-pdf docusaurus --initialDocURLs="https://docusaurus.io/docs/"

npx docs-to-pdf --initialDocURLs="https://docusaurus.io/docs/" --contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page" --coverImage="https://docusaurus.io/img/docusaurus.png" --coverTitle="Docusaurus v2"

Docusaurus v1 - Legacy

https://docusaurus.io/en/

initialDocURLs: https://docusaurus.io/docs/en/installation

demoPDF: https://github.com/jean-humann/docs-to-pdf/blob/master/pdf/v1-docusaurus.pdf

command:

npx docs-to-pdf docusaurus --initialDocURLs="https://docusaurus.io/docs/en/installation" --version=1

npx docs-to-pdf --initialDocURLs="https://docusaurus.io/docs/en/installation" --contentSelector="article" --paginationSelector=".docs-prevnext > a.docs-next" --excludeSelectors=".fixedHeaderContainer,footer.nav-footer,#docsNav,nav.onPageNav,a.edit-page-link,div.docs-prevnext" --cssStyle=".navPusher {padding-top: 0;}" --pdfMargin="20"

PR to add new docs is welcome here... 😸

📄 How `docs-to-pdf` works

puppeteer can make html to PDF like you can print HTML page in chrome browser
so, the idea of docs-to-pdf is generating one big HTML through looping page link, then run page.pdf() from puppeteer to generate PDF.

🎉 Thanks

This repo's code is coming from https://github.com/KohheePeace/mr-pdf.

Thanks for awesome code made by @KohheePeace, @maxarndt and @aloisklink.

@bojl approach to make TOC was awesome and breakthrough.

docs-to-pdf's People

Contributors

Stargazers

Forkers

codingluke jan-dix vintagentleman giraffesyo ds4497 anatolykopyl clayshoaf cbeeler ilinksolutionsbr mrtomyshellby pangeoradar pfdgithub westorres9 tobi1220 ngrayluna

docs-to-pdf's Issues

bookmarks

Can I support generating PDF bookmarks?

Hyperlinks in PDF linking to web documentation

The links (apart from TOC) inside the PDF open up the corresponding web page instead of the PDF page. Is there a way to ensure the links point to the heading in the PDF instead of the web page?

docs-to-pdf runs forever with circular links

Converting https://python.langchain.com/ runs forever because https://python.langchain.com/docs/expression_language/cookbook/tools --next leads to a previous page.
npx docs-to-pdf --initialDocURLs="https://python.langchain.com/" --contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --coverImage="https://upload.wikimedia.org/wikipedia/commons/3/3f/LangChain_logo.png" --coverTitle="LangChain"

Basic Auth support

Hi Jean,

thanks for creating this project.
It works great for me.

The production version of my documentation is behind a basic auth access.
Would it be possible add the credentials at startup of the crawler?

Kind regards

ProtocolError: Runtime.callFunctionOn timed out.

Error on generating - timeout

I am trying to generate PDF from

npx docs-to-pdf --initialDocURLs="https://ignatandrei.github.io/RSCG_Examples/v2/docs/List-of-RSCG" --contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page"  --coverTitle="RSCG --protocolTimeout=54000"

It is all well before the final
[30.08.2023 23:15.27.852] [LOG] Start generating PDF...
[30.08.2023 23:15.27.852] [LOG] Generate cover...
[30.08.2023 23:15.27.852] [LOG] Start generating TOC...
[30.08.2023 23:15.27.958] [LOG] Restructuring the html of a document...
[30.08.2023 23:15.35.378] [LOG] Remove unnecessary HTML...
[30.08.2023 23:15.35.379] [LOG] Scroll to the bottom of the page...
[30.08.2023 23:16.29.393] [ERROR] ProtocolError: Runtime.callFunctionOn timed out. Increase the 'protocolTimeout' setting in launch/connect calls for a higher timeout if needed.
at <instance_members_initializer> (C:\Users\ignat\AppData\Local\npm-cache_npx\c16ac64a6c7aba73\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:49:14)
at new Callback (C:\Users\ignat\AppData\Local\npm-cache_npx\c16ac64a6c7aba73\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:53:16)
at CallbackRegistry.create (C:\Users\ignat\AppData\Local\npm-cache_npx\c16ac64a6c7aba73\node_modules\puppeteer-core\lib\cjs\puppeteer\common\Connection.js:93:26)

Could you please help?

Quick Start example doesn't work

I tried running the example from the README

npx docs-to-pdf --initialDocURLs="https://docusaurus.io/docs/" --contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page" --coverImage="https://docusaurus.io/img/docusaurus.png" --coverTitle="Docusaurus v2"

and I got this error:

[10.10.2023 11:08.19.379] [DEBUG] Using Chromium from /home/kkovacs/.cache/puppeteer/chrome/linux-117.0.5938.149/chrome-linux64/chrome
[10.10.2023 11:08.19.607] [DEBUG] Chrome user data dir: /tmp/puppeteer_dev_chrome_profile-2V52e1
[10.10.2023 11:08.19.646] [LOG]   Retrieving html from https://docusaurus.io/docs/
[10.10.2023 11:08.21.047] [DEBUG] Found 0 elements
[10.10.2023 11:08.21.049] [LOG]   Success
[10.10.2023 11:08.21.051] [LOG]   Retrieving html from https://docusaurus.io/docs/category/getting-started
[10.10.2023 11:08.22.165] [DEBUG] Found 0 elements
[10.10.2023 11:08.22.166] [LOG]   Success


...


[10.10.2023 11:09.23.630] [LOG]   Success
[10.10.2023 11:09.23.634] [LOG]   Retrieving html from https://docusaurus.io/docs/deployment
[10.10.2023 11:09.25.372] [DEBUG] Found 6 elements
[10.10.2023 11:09.25.379] [DEBUG] Clicking summary: How much resource (person-hours, money) am I willing to invest in this?
[10.10.2023 11:09.26.267] [DEBUG] Clicking summary: How much server-side configuration would I need?
[10.10.2023 11:09.27.104] [DEBUG] Clicking summary: Do I have needs to cooperate?
[10.10.2023 11:09.27.944] [DEBUG] Clicking summary: GitHub action files
[10.10.2023 11:09.28.771] [DEBUG] Clicking summary: GitHub action file
[10.10.2023 11:09.28.780] [ERROR] Error: Node is either not clickable or not an Element
    at CdpElementHandle.clickablePoint (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/puppeteer-core/lib/cjs/puppeteer/api/ElementHandle.js:680:23)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async CdpElementHandle.<anonymous> (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/puppeteer-core/lib/cjs/puppeteer/api/ElementHandle.js:258:32)
    at async CdpElementHandle.click (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/puppeteer-core/lib/cjs/puppeteer/api/ElementHandle.js:710:30)
    at async CdpElementHandle.<anonymous> (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/puppeteer-core/lib/cjs/puppeteer/api/ElementHandle.js:261:36)
    at async openDetails (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/docs-to-pdf/lib/utils.js:212:13)
    at async generatePDF (/home/kkovacs/.npm/_npx/c16ac64a6c7aba73/node_modules/docs-to-pdf/lib/utils.js:82:21)

Just wanted to point this out because I'm struggling to get this to work on my own site, so I wanted a working example reference.

Templates for arguments

--contentSelector="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page"

This software always requires a so-long options. It is so long that no one can input without reading the README. It would be nice if we can shorten this to like:

--template docusaurus2

Parametrize Arguments to Table of Contens

Would it be possible to parameterize the title Table of Contents when generating the PDF?

An option to control whether all of `<details>` elements are opened

https://docusaurus.io/docs/markdown-features#details

<details> allows us to hide contents only for experts. It would be nice if we can control whether <details> are opened.

In the current version, all of <details> are closed.

For beginners

For experts

Can Puppeteer do this operation before printing the jointed page?

flowchart TD

S(Start) --> F[Find and open closed elements]
F --> C{New closed\nelements appeared?}
C -->|Yes| F
C -->|No| Done(Done)

can you provide docker image?

Clean puppeteer_dev_chrome_profile

Puppeteer saves a lot of GB's in tmp folder and never clears it. I ran out of disc space. Would be nice if this is cleaned up.
puppeteer/puppeteer#1791 (comment)

Option to restrict the subpath range

npx docs-to-pdf --initialDocURLs="https://docusaurus.io/docs/markdown-features" --contentSele
ctor="article" --paginationSelector="a.pagination-nav__link.pagination-nav__link--next" --excludeSelectors=".margin-vert--xl a,[class^='tocCollapsible'],.breadcrumbs,.theme-edit-this-page" --coverImage="https://docusaurus.io/img/docusaurus.png" --coverTitle="Docusaurus v2"
[13.08.2023 17:17.08.551] [DEBUG] Using Chromium from C:\Program Files\Google\Chrome\Application\chrome.exe
[13.08.2023 17:17.08.781] [DEBUG] Chrome user data dir: C:\Users\tatsu\AppData\Local\Temp\puppeteer_dev_chrome_profile-wjQgPd
[13.08.2023 17:17.08.870] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features
[13.08.2023 17:17.10.684] [LOG]   Success
[13.08.2023 17:17.10.689] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/react
[13.08.2023 17:17.12.843] [LOG]   Success
[13.08.2023 17:17.12.844] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/tabs
[13.08.2023 17:17.14.508] [LOG]   Success
[13.08.2023 17:17.14.510] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/code-blocks
[13.08.2023 17:17.16.113] [LOG]   Success
[13.08.2023 17:17.16.114] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/admonitions
[13.08.2023 17:17.17.707] [LOG]   Success
[13.08.2023 17:17.17.711] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/toc
[13.08.2023 17:17.19.122] [LOG]   Success
[13.08.2023 17:17.19.127] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/assets
[13.08.2023 17:17.21.602] [LOG]   Success
[13.08.2023 17:17.21.603] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/links
[13.08.2023 17:17.23.143] [LOG]   Success
[13.08.2023 17:17.23.144] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/plugins
[13.08.2023 17:17.24.639] [LOG]   Success
[13.08.2023 17:17.24.641] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/math-equations
[13.08.2023 17:17.26.649] [LOG]   Success
[13.08.2023 17:17.26.650] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/diagrams
[13.08.2023 17:17.28.193] [LOG]   Success
[13.08.2023 17:17.28.194] [LOG]   Retrieving html from https://docusaurus.io/docs/markdown-features/head-metadata
[13.08.2023 17:17.29.655] [LOG]   Success
[13.08.2023 17:17.29.658] [LOG]   Retrieving html from https://docusaurus.io/docs/styling-layout
[13.08.2023 17:17.30.985] [LOG]   Success
[13.08.2023 17:17.30.987] [LOG]   Retrieving html from https://docusaurus.io/docs/swizzling
[13.08.2023 17:17.32.235] [LOG]   Success
︙

Is there an option to prevent this software from fetching pages out of https://docusaurus.io/docs/markdown-features?
It can't be covered by --excludeURLs.

Non square images seem to be squashed

This is great!

I've come across an issue if I try to use a non square image on the cover - it seems to be turned into a square

Incorrect requirements documented

The docs state --initialDocURLs as the only required parameter. That's incorrect.

Add support for Docusaurus v3.x

I think the title says it all.

Error: Could not find Google Chrome executable for channel 'stable' at '/opt/google/chrome/chrome'.

Since v0.3.1 this error shows up on start, with v0.3.0 everything works fine

Line Break Control / Prevent page breaks after headers

A lot of my pages break at suboptimal places

Would love to be able to make it so that a header is never the last thing printed on a page

how to inject vars into html template

as per https://pptr.dev/api/puppeteer.pdfoptions/#properties how do you pass

- date formatted print date

- title document title

- url document location

- pageNumber current page number

- totalPages total pages in the document```
to
--headerTemplate

is it `--headerTemplate="${date}"` etc

How to disabled cover and TOC title

Without coverTitle coverImage coverSub options, a blank cover is still generated.
TOC title Table of contents: cannot be modified or disabled.

Search / Select in Mac Preview not working

Hi @jean-humann

I just figured out something very strange. When I open the generated PDF in my firefox, I can select and search text just fine. However, when I open the same File in Mac Preview the text is not correctly selectable.

Here a video showing it with the example pdf.

Screenshot_2023-08-10_000075.mp4

When I try the same with the PDFs generated by marp which also uses pupperteer/chromium to generate PDFs from HTML, everything works fine. @yhatt do you maybe have some idea on this?

Best codingluke

Error: Node is either not clickable or not an Element when <details> is inside <tabs>

Hello!

I have a page with <tabs>, one of which contains <details>.

Last logs before the error:

[LOG]   Retrieving html from <page url>
[DEBUG] Found 1 elements
[DEBUG] Clicking summary: <element name>

and then the error:

Error: Node is either not clickable or not an Element
    at CdpElementHandle.clickablePoint (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\node_modules\puppeteer-core\lib\cjs\puppeteer\api\ElementHandle.js:682:23)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
    at async CdpElementHandle.<anonymous> (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\node_modules\puppeteer-core\lib\cjs\puppeteer\api\ElementHandle.js:259:32)
    at async CdpElementHandle.click (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\node_modules\puppeteer-core\lib\cjs\puppeteer\api\ElementHandle.js:712:30)
    at async CdpElementHandle.<anonymous> (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\node_modules\puppeteer-core\lib\cjs\puppeteer\api\ElementHandle.js:262:36)
    at async openDetails (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\lib\utils.js:212:13)
    at async generatePDF (C:\Users\user\AppData\Roaming\npm\node_modules\docs-to-pdf\lib\utils.js:82:21)

Idea: Align headers level to the sidebar nesting, or make page level configurable by meta keywords

At the moment, when generating a PDF from a Website, every subpage starts with a <h1>. However on the Website some pages are nested under higher level pages.

For example:

Here getting started is the entry point and has multiple subsites like "installation" and "configuration" and so on.

I question myself whether it would be great to finde out, if a page is a parent or a child and automatically change the heading level to the next, when it is a child. On installation the <h1> would become a <h2> and so on...

💡 We could also manage this with meta keywords, so it would be manual configurable per page :)
Together with the bookmarks enhancement this would make it superior to word and google docs.

What do you think?