Git Product home page Git Product logo

blinkist-scraper's Introduction

blinkist-scraper

A python script to download book summaries and audio from Blinkist and generate some pretty output files.

Installation / Requirements

pip install -r requirements.txt

This script uses ChromeDriver to automate the Google Chrome browser - therefore Google Chrome needs to be installed in order to work.

Usage

usage: main.py [-h] [--language {de,en}] [--match-language]
               [--cooldown COOLDOWN] [--headless] [--audio] [--concat-audio]
               [--no-scrape] [--book BOOK] [--category CATEGORY]
               [--create-html] [--create-epub] [--create-pdf]
               email password                                              
                                                                              
Example with non-optional arguments                                           
                                                                              
positional arguments:                                                         
  email                The email to log into your premium Blinkist account    
  password             The password to log into your premium Blinkist account 
                                                                              
optional arguments:                                                           
  -h, --help           show this help message and exit
  --language {en,de}   The language to scrape books in - either 'en' for
                       english or 'de' for german (default en)
  --match-language     Skip scraping books if not in the requested language
                       (not all book are avaible in german, default false)
  --cooldown COOLDOWN  Seconds to wait between scraping books, and downloading
                       audio files. Can't be smaller than 1 (default 1)                  
  --headless           Start the automated web browser in headless mode. Works
                       only if you already logged in once (default false)                       
  --audio              Download the audio blinks for each book (default true)               
  --concat-audio       Concatenate the audio blinks into a single file and tag
                       it. Requires ffmpeg (default false)                                    
  --no-scrape          Don't scrape the website, only process existing json   
                       files in the dump folder (default false)
  --book BOOK          Scrapes this book only, takes the blinkist url for the
                       book (e.g. https://www.blinkist.com/en/books/... or
                       nhttps://www.blinkist.com/en/nc/reader/...)
  --book-category BOOK_CATEGORY
                        When scraping a single book, categorize it under this
                        category (works with '--book' only)
  --categories CATEGORIES [CATEGORIES ...]
                        Only the categories whose label contains at least one
                        string here will be scraped. Case-insensitive; use
                        spaces to separate categories. (e.g. "--categories
                        entrep market" will only scrape books under
                        "Entrepreneurship" and "Marketing & Sales")
  --ignore-categories IGNORE_CATEGORIES [IGNORE_CATEGORIES ...]
                        If a category label contains anything in
                        ignored_categories, books under that category will not
                        be scraped. Case-insensitive; use spaces to separate
                        categories. (e.g. "--ignored-categories entrep market"
                        will skip scraping of "Entrepreneurship" and
                        "Marketing & Sales")
  --create-html        Generate a formatted html document for the book (default true)        
  --create-epub        Generate a formatted epub document for the book (default true)       
  --create-pdf         Generate a formatted pdf document for the book.
                       Requires wkhtmltopdf (default false)                                   

Basic usage

python main.py email password where email and password are the login details to your premium Blinkist account.

The script uses Selenium with a Chrome driver to scrape the site. Blinkist uses captchas on login, so the script will wait for the user to solve it and login on first run (although the email and password fields are filled in automatically from the arguments) - the sessions cookies are stored so the script can be run in headless mode with the appropriate flag afterwards. The output files are stored in the books folder, arranged in subfolders by category and by the book's title and author.

Customizing HTML output

The script builds a nice-looking html version of the book by using the 'book.html' and 'chapter.html' files in the 'templates' folder as a base. Every parameter between curly braces in those files (e.g. {title}) is replaced by the appropriate value from the book metadata (dumped in the dump folder upon scraping), following a 1-to-1 naming convention with the json parameters (.e.g {title} will be replaced by the title parameter, {who_should_read} but the who_should_read one and so on).

The special field {__chapters__} is replaced with all the book's chapters. Chapters are created by parsing each chapter object in the book metadata and using the chapter.html template file in the same fashion, replacing tokens with the parameters inside the chapter object.

Generating .pdf

Add the --create-pdf argument to the script to generate a .pdf file from the .html one. This requires the wkhtmltopdf tool to be installed and present in the PATH.

Downloading audio

The script download audio blinks as well. This is done by waiting for a request to the Blinkist's audio endpoint in their library api for the first chapter's audio blink which is sent as soon as the user navigates to a book's reader page; then re-using the valid request's headers to build additional requests to the rest of the chapter's audio files. The files are downloaded as .m4a.

Concatenating audio files

Add the --concat-audio argument to the script to concatenate the individual audio blinks into a single file and tag it with the appropriate book title and author. This requires the ffmpeg tool to be installed and present in the PATH.

Processing book dumps with no scraping

During scraping, the script saves all book's metadata in json files inside the dump folder. Those can be used by the script to re-generate the .html, .epub and .pdf output files without having to scrape the website again. To do so, pass the --no-scrape argument to the script.

Quirks & known Bugs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.