Git Product home page Git Product logo

doc88-downloader's Introduction

POC doc88.com Downloader

This is a POC downloader of documents from doc88.com. It saves pages of a given document as PNGs or JPEGs. It doesn't have any dependencies — it's a bit of JavaScript that you paste into Developer Tools' Console. It was tested in Chrome and Firefox.

Instructions

The download procedure is a bit of a PITA, but hey… it's a POC.

  1. Navigate to the desired document in your browser.

  2. Make sure browser's zoom level is set to 100% — based on some tests it seems that zoom levels lower than 100% can result in lower quality of captured pages.

  3. Scroll through all the pages in the document, one by one, and make sure all of them have loaded. Depending on the document this might be the most arduous part of the process.

  4. Open Developer Tools (e.g. press Ctrl+Shift+I).

  5. Switch to JavaScript Console.

  6. For PNGs paste this JavaScript in Console and confirm with Enter.

  7. For JPEGs paste this JavaScript in Console and confirm with Enter.

  8. Download pages in batches. Type:

    downloadPages(1, 10)

    in Console and hit Enter to download pages 1 through 10.

    • ℹ It is advised to download 10 pages at a time. After saving a batch of pages simply enter downloadPages(11, 20) to download pages 11 through 20, and so on.
    • ℹ In case of Chrome, the first time you download a batch of pages you may see a popup stating that "This site is attempting to download multiple files". You have to allow it as each page is downloaded as a separate file.
  9. Make sure all desired pages were downloaded correctly.

That's it!

Converting downloaded images back to a PDF

Under Linux you can easily convert downloaded images back to a PDF. You will need ImageMagick package first:

sudo apt-get install imagemagick

Then — in directory in which the images are — issue the following command which will produce output.pdf PDF file from the images:

convert $(ls -1v *.jpg *.png 2>/dev/null | tr '\n' ' ') output.pdf

If you further want to OCR the PDF (recognize the text in it and make it searchable), install the OCRmyPDF package:

sudo apt-get install ocrmypdf

Then — in directory in which the PDF is — issue the following command which will perform text recognition in the output.pdf PDF and make it searchable:

ocrmypdf output.pdf output.pdf

doc88-downloader's People

Contributors

apankowski avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.