openzim / gutenberg Goto Github PK

View Code? Open in Web Editor NEW

130.0 18.0 37.0 127.7 MB

Scraper for downloading the entire ebooks repository of project Gutenberg

Home Page: https://download.kiwix.org/zim/gutenberg

License: GNU General Public License v3.0

CSS 14.49% JavaScript 25.31% Python 55.36% HTML 4.20% Dockerfile 0.32% Shell 0.32%

gutenberg scraper offline zim

gutenberg's Introduction

Gutenberg Offline

This scraper downloads the whole Project Gutenberg library and puts it in a ZIM file, a clean and user friendly format for storing content for offline usage.

Warning

This scraper is now known to have a serious flaw. A critical bug #219 has been discovered which leads to incomplete archives. Work on #97 (complete rewrite of the scraper logic) now seems mandatory to fix these annoying problems. We however currently miss the necessary bandwidth to address these changes. Help is of course welcomed, but be warned this is going to be a significant project (at least 10 man.days to change the scraper logic so that we can fix the issue I would say, so probably the double since human is always bad at estimations).

Coding guidelines

Main coding guidelines comes from the openZIM Wiki

Setting up the environment

Here we will setup everything needed to run the source version from your machine, supposing you want to modify it. If you simply want to run the tool, you should either install the PyPi package or use the Docker image. Docker image can also be used for development but needs a bit of tweaking for live reload of your code modifications.

Install the dependencies

First, ensure you use the proper Python version, inline with the requirement of pyproject.toml (you might for instance use pyenv to manage multiple Python versions in parallel).

You then need to install the various tools/libraries needed by the scraper.

GNU/Linux

sudo apt-get install python-pip python-dev libxml2-dev libxslt-dev advancecomp jpegoptim pngquant p7zip-full gifsicle curl zip zim-tools

macOS

brew install advancecomp jpegoptim pngquant p7zip gifsicle

Setup the package

First, clone this repository.

git clone [email protected]:kiwix/gutenberg.git
cd gutenberg

If you do not already have it on your system, install hatch to build the software and manage virtual environments (you might be interested by our detailed Developer Setup as well).

pip3 install hatch

Start a hatch shell: this will install software including dependencies in an isolated virtual environment.

hatch shell

That's it. You can now run gutenberg2zim from your terminal.

Getting started

After setting up the whole environment you can just run the main script gutenberg2zim. It will download, process and export the content.

./gutenberg2zim

Arguments

You can also specify parameters to customize the content. Only want books with the Id 100-200? Books only in French? English? Or only those both? No problem! You can also include or exclude book formats. You can add bookshelves and the option to search books by title to enrich your user experince.

./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search

This will download books in English and French that have the Id 100 to 200 in the HTML (default) and PDF format.

You can find the full arguments list below:

-h --help                       Display this help message
-y --wipe-db                    Empty cached book metadata
-F --force                      Redo step even if target already exist

-l --languages=<list>           Comma-separated list of lang codes to filter export to (preferably ISO 639-1, else ISO 639-3)
-f --formats=<list>             Comma-separated list of formats to filter export to (epub, html, pdf, all)

-e --static-folder=<folder>     Use-as/Write-to this folder static HTML
-z --zim-file=<file>            Write ZIM into this file path
-t --zim-title=<title>          Set ZIM title
-n --zim-desc=<description>     Set ZIM description
-d --dl-folder=<folder>         Folder to use/write-to downloaded ebooks
-u --rdf-url=<url>              Alternative rdf-files.tar.bz2 URL
-b --books=<ids>                Execute the processes for specific books, separated by commas, or dashes for intervals
-c --concurrency=<nb>           Number of concurrent process for processing tasks
--dlc=<nb>                      Number of concurrent *download* process for download (overwrites --concurrency). if server blocks high rate requests
-m --one-language-one-zim=<folder> When more than 1 language, do one zim for each   language (and one with all)
--no-index                      Do NOT create full-text index within ZIM file
--check                         Check dependencies
--prepare                       Download rdf-files.tar.bz2
--parse                         Parse all RDF files and fill-up the DB
--download                      Download ebooks based on filters
--zim                           Create a ZIM file
--title-search                  Add field to search a book by title and directly jump to it
--bookshelves                   Add bookshelves
--optimization-cache=<url>      URL with credentials to S3 bucket for using as optimization cache
--use-any-optimized-version     Try to use any optimized version found on optimization cache

Screenshots

License

GPLv3 or later, see LICENSE for more details.

gutenberg's People

Contributors

Stargazers

Watchers

gutenberg's Issues

Match UI language and content language on single-lang ZIM

If the content is a single language and that language is one of our UI languages, enforce it.

keep search box centered as screen size increases/decreases

Currently it goes to Left aligned as you drag the screen larger.

UI Language not persisted

UI language (top-right corner select) is not persisted over pages in ZIM (works locally)

Create a HTML redirection based on author

In the Kiwix search bar is not possible to get suggestiond based on authors. Would be good to create HTML redirection to books & cover following this scheme "author + title"

Scraper now dies

$gutenberg2zim -l en --dlc=1

[pdf] not avail. for #12506# Critiques and Addresses
                html already exists at dl-cache/12506.html
        Downloading content files for Book #12507
                epub already exists at dl-cache/12507.epub
[pdf] not avail. for #12507# The History of the Rise, Progress and Accomplishment of the Abolition of the African Slave Trade by the British Parliament (1808)
                html already exists at dl-cache/12507.html
        Downloading content files for Book #12508
                epub already exists at dl-cache/12508.epub
[pdf] not avail. for #12508# The Meaning of Good—A Dialogue
                html already exists at dl-cache/12508.html
        Downloading content files for Book #12509
                epub already exists at dl-cache/12509.epub
[pdf] not avail. for #12509# The Moon Rock
                html already exists at dl-cache/12509.html
        Downloading content files for Book #12510
                epub already exists at dl-cache/12510.epub
[pdf] not avail. for #12510# Targum
                html already exists at dl-cache/12510.html
        Downloading content files for Book #12511
                epub already exists at dl-cache/12511.epub
[pdf] not avail. for #12511# Blackwood's Edinburgh Magazine — Volume 53, No. 332, June, 1843
                html already exists at dl-cache/12511.html
        Downloading content files for Book #12512
                epub already exists at dl-cache/12512.epub
Segmentation fault (core dumped)

Make cookie per-project

Cookie is domained gutenberg and thus is shared accross all gutenberg related ZIM files on kiwix-serve.
This might lead to UI glitches and errors.
It should be isolated per-project

Conflict with jquery-ui used by kiwix-serve

The result is that the content them applies to kiwix-serve toolbar

Ebook dead links inserted in HTML

If a book has for example a PDF version, but for any reason the ebook is not available in the dl-cache/static directory, then the HTML still includes it. To avoid any dead link this should not be the case.

An example:
With our current code (dcefb5f), for the following book, the PDF is not downloaded and consequently not available, but the HTML generated still provides links/icons pointing to the PDF version. All these links are dead links and should not be inserted.
http://www.gutenberg.org/ebooks/11

Crash by exporting #46860

By exporting only books in Portuguese I have found this bug:

$ rm -rf static/ ; ./dump-gutenberg.py --books=46860 --export
EXPORTING ebooks to static folder (and JSON)
[46860]
Filtered book collection size: 1
Filtered book collection, PDF: 0
Filtered book collection, ePUB: 1
Filtered book collection, HTML: 1
Dumping full_by_popularity.js
Dumping full_by_title.js
Dumping lang_pt_by_popularity.js
Dumping lang_pt_by_title.js
Dumping authors_lang_pt.js
Dumping auth_80_by_popularity.js
Dumping auth_80_by_title.js
Dumping authors.js
Dumping languages.js
Dumping main_languages.js
Exporting Book #46860.
Exporting to static/O Napoleão de Nothing Hill.46860.html
Copying companion file to 46860_image002.jpg
Copying /media/data/gutenberg/static/46860_image002.jpg
Copying companion file to 46860_image008.jpg
Copying /media/data/gutenberg/static/46860_image008.jpg
Copying companion file to 46860_image001.jpg
Copying /media/data/gutenberg/static/46860_image001.jpg
Copying companion file to 46860_cc0.png
Copying /media/data/gutenberg/static/46860_cc0.png
Copying companion file to 46860_image003.jpg
Copying /media/data/gutenberg/static/46860_image003.jpg
Copying companion file to 46860_image007.jpg
Copying /media/data/gutenberg/static/46860_image007.jpg
Copying companion file to 46860_image004.jpg
Copying /media/data/gutenberg/static/46860_image004.jpg
Copying companion file to 46860_image006.jpg
Copying /media/data/gutenberg/static/46860_image006.jpg
Copying companion file to 46860_cover.gif
Copying /media/data/gutenberg/static/46860_cover.gif
Copying companion file to 46860_image055.jpg
Copying /media/data/gutenberg/static/46860_image055.jpg
Copying format file to O Napoleão de Nothing Hill.46860.epub
Creating ePUB at /tmp/tmpAnHaAJ.epub
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 137, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books
books=books)
File "/media/data/gutenberg/gutenberg/export.py", line 557, in export_book_to
archive_name_for(book, format))
File "/media/data/gutenberg/gutenberg/export.py", line 524, in handle_companion_file
optimize_epub(src, tmp_epub.name)
File "/media/data/gutenberg/gutenberg/export.py", line 491, in optimize_epub
with open(opff, 'r') as fd:
IOError: [Errno 2] No such file or directory: u'/tmp/tmpM3AA0L/46860/content.opf'

Crash with invalid epub

            Copying companion file to 28969_wilhelm_tell_schiller_friedrich.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmpFYw00B.epub
            Copying companion file to 28969_pg6780-images.mobi
            shitty ext: /media/data/gutenberg/static/28969_pg6780-images.mobi
            Copying /media/data/gutenberg/static/28969_pg6780-images.mobi
            Exporting HTML file to /media/data/gutenberg/static/28969_6787-h.htm
            Copying companion file to 28969_love_and_intrigue_schiller_friedrich.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmpZq0rSI.epub
            Copying companion file to 28969_the_thirty_years_war_complete_schiller_friedrich.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmp9p2Zzv.epub
            Copying companion file to 28969_3pb216.jpg
            Copying /media/data/gutenberg/static/28969_3pb216.jpg
            Copying companion file to 28969_the_piccolomini_schiller_friedrich.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmphjKR5d.epub
            Copying companion file to 28969_pg6790-images.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmpCyYXya.epub

Traceback (most recent call last):
File "./dump-gutenberg.py", line 154, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 141, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books
books=books)
File "/media/data/gutenberg/gutenberg/export.py", line 548, in export_book_to
handle_companion_file(fname)
File "/media/data/gutenberg/gutenberg/export.py", line 524, in handle_companion_file
optimize_epub(src, tmp_epub.name)
File "/media/data/gutenberg/gutenberg/export.py", line 440, in optimize_epub
with zipfile.ZipFile(src, 'r') as zf:
File "/usr/lib/python2.7/zipfile.py", line 714, in init
self._GetContents()
File "/usr/lib/python2.7/zipfile.py", line 748, in _GetContents
self._RealGetContents()
File "/usr/lib/python2.7/zipfile.py", line 763, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file

consider adding a back to top button at the end of the book

To reduce scrolling back up to the top, having a button at the bottom would make it easier for a user to jump back to the top

Too long filename

    Exporting Book #2810.
            Exporting to static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Recorded by William L. Riordon.2810.html

Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 137, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/export.py", line 149, in export_all_books
books=books)
File "/media/data/gutenberg/gutenberg/export.py", line 365, in export_book_to
with open(article_fpath, 'w') as f:
IOError: [Errno 36] File name too long: u'static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum\u2014the New York County court house bootblack stand; Recorded by William L. Riordon.2810.html'

PDF not always downloaded correctly

For the following book, a PDF is available:
http://www.gutenberg.org/ebooks/11

Have a look on the mirror:
http://gutenberg.readingroo.ms/1/11/

But the script seems to be unable to download it:
$ rm -rf static/ ; ./dump-gutenberg.py --keep-db --download --books=11
DOWNLOADING ebooks from mirror using filters
[11]
Downloading content files for Book #11
epub already exists at dl-cache/11.epub
[pdf] Requesting URLs for #11# Alice's Adventures in Wonderland
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/11/11-pdf.pdf HTTP/1.1" 404 227
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/11/11.pdf HTTP/1.1" 404 223
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/11/pg11.pdf HTTP/1.1" 404 225
NO FILE FOR #11/pdf
[u'http://gutenberg.readingroo.ms/cache/generated/11/pg11.pdf',
u'http://gutenberg.readingroo.ms/cache/generated/11/11.pdf',
u'http://gutenberg.readingroo.ms/cache/generated/11/11-pdf.pdf']
html already exists at dl-cache/11.html
(gut)kelson@zimfarm:/media/data/gutenberg$ ls -la http://gutenberg.readingroo.ms/cache/generated/11/11.pdf
ls: cannot access http://gutenberg.readingroo.ms/cache/generated/11/11.pdf: No such file or directory

Language filter does not apply correctly

If you have a multilanguage Gutenberg ZIM file, then you can filter by:

language
author

But, if you:
1 - Choose a language
2 - Choose an author

You get all the books of this author without regarding the language version.

Language fitler should apply to this list of books per author and we should only get books from this author in the selected language.

Provide a categorization by genre

Searching the gutenberg dim is hard, browsing it is impossible if you don't know what you are looking for. I therefore wonder whether it is possible to add genre buttons, like "Crime", "Fantasy", "Erotic literature" etc. Also, one idea could be to add toplists, best rated crime fiction, or most downloaded crime fiction etc. There are no summaries of the content of each book, and that would have been nice, but i guess that's impossible as such things are not present on the gutenberg homepage either.

First reported at https://sourceforge.net/p/kiwix/feature-requests/988/

Site just displays the loading spinner

The commit e993ffcb294e7f2cbe02a55305e7f443f8301d01 by @kelson42 broke the site. It's only displaying the loading-spinner now.

Reproduce it by exporting the site:

./dump-gutenberg.py --export -l en

I have tested this on Chrome, Firefox and Safari.

Browser language should be detected

And ZIM UI language should be chosen automatically consequently.

Merge with new_design branch

@kelson42 @Seb35 @rashiq,
the rewrite of the HTML/CSS in a cleaner way is done. It's all in new_design branch.
Unfortunately, it started right after the hackathon so there is a lot to merge.
I am leaving now and won't have access to a computer for a long time, so one of you have to work on it.

The new thing is based on purecss for reset, grids and forms. It has been tested thoroughly on both desktop and mobile and it works fine. @rashiq, if you need to tweak the UI, do it according to the css file rules (no px sizes, the least possible css the better).

Couple info:

python should merge easily has it has been only slightly edited.
don't try to merge the templates (all), style.css nor tools.js. It just won't work and would introduce new problems.
Instead, port back any commit you did that matters : localization, strings fixes, etc.
@rashiq please see my comment on the zimwriterfs avoidance commit.

good luck ; I'll catch up when possible.

Translate table "back" "next"

... buttons.

wrong zim language

ZIM meta data language should be a sorted list of languages.

See:
http://openzim.org/wiki/Metadata

--keep-db should be replaced

Parsing all the RDF is a really long process, even on a good computer. The default behaviour should not to remove and reparse everything.

If no action is given (--parse, --download, --export, --do-everything...) then all the steps should be done and then the script work in update mode, that means that only missing books/RDF are parsed. If --books is specified, only the corresponding books are parsed.

The use should still have the ability to erase the db and we should provide a new action called --erase-db

Package script for pypi

Prepare and publish a package on Pypi.

Crash during download

"GET /etext93/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext05/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext01/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext00/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext02/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext04/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext95/24006-h.zip HTTP/1.1" 404 217
Downloading content files for Book #24010
[epub] Requesting URLs for #24010# The Gods are Athirst
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/24010/pg24010.epub HTTP/1.1" 200 207009
[pdf] not avail. for #24010# The Gods are Athirst
[html] Requesting URLs for #24010# The Gods are Athirst
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext92/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext90/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext96/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext94/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /2/4/0/1/24010/24010-h.html HTTP/1.1" 404 224
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext98/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /2/4/0/1/24010/24010-h.htm HTTP/1.1" 404 223
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/24010/pg24010.html.utf8 HTTP/1.1" 404 237
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext00/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext93/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext91/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /2/4/0/1/24010/24010-h.zip HTTP/1.1" 200 506925
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 129, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/download.py", line 200, in download_all_books
download_cache=download_cache)
File "/media/data/gutenberg/gutenberg/download.py", line 46, in handle_zipped_epub
if not is_safe(n)]):
File "/media/data/gutenberg/gutenberg/download.py", line 34, in is_safe
if path(fname).basename() == fname:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)

New crash by exporting a book

    Exporting Book #12018.
            Exporting to static/Notes and Queries, Number 17, February 23, 1850.12018.html
            Copying format file to Notes and Queries, Number 17, February 23, 1850.12018.epub
            Creating ePUB at /tmp/tmp9J0uzM.epub
            Exporting to static/Notes and Queries, Number 17, February 23, 1850_cover.12018.html
    Exporting Book #12019.
            Exporting to static/Queen Hortense: A Life Picture of the Napoleonic Era.12019.html

Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 137, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books
books=books)
File "/media/data/gutenberg/gutenberg/export.py", line 378, in export_book_to
new_html = update_html_for_static(book=book, html_content=html)
File "/media/data/gutenberg/gutenberg/export.py", line 275, in update_html_for_static
[1 for e in body.children
AttributeError: 'NoneType' object has no attribute 'children'
-rw-rw-r-- 1 kelson kelson 428404 Sep 29 12:20 static/authors.js
-rw-rw-r-- 1 kelson kelson 428404 Sep 29 12:15 static/authors_lang_en.js

make download book icon clearer

Consider using an icon that has a down arrow or something more easily recognizable or someone may hit it accidentally not knowing what it is (like I did)

Some books have no pictures

In some books we see white squares which meens that the images are not loaded.

Dead links in HTML ebooks (second use case)

In "Les Misérables" #135, at the top of the page you have an "Enlarge" link inviting I guess to see the picture beside (photography of books) in a better resolution. Unfortunately this is a dead link, the picture does not exist.

On the online version, the picture is there:
http://www.gutenberg.org/files/135/135-h/135-h.htm

Dead links in HTML ebooks

In "Adventures of Huckleberry Finn" (#76), the HTML version of the book has a box/div "Format Choice" a the top of the page.

In that box, the second link with the label "2. A file with images which automatically accomodate to any screen size; this is the best choice for the small screens of Tablets and Smart Phones." is a dead link.

On the online version this link is not dead:
http://www.gutenberg.org/files/76/76-h/76-h.htm

Broswer back should work correctly

Steps:
1 - Load page 2 of page of results (default page)
2 - click to see the cover page of any result
3 - click browser back button

You get:
page 1 of page of results

You want:
page 2 of page of results

So the filtering is kept, but not the pagination

Script should die if a "typo" is done in command line (wrong argument)

$ rm -rf static/ ; ./dump-gutenberg.py --lang=en --export --zim
'list' object has no attribute 'split'
EXPORTING ebooks to static folder (and JSON)
Filtered book collection size: 37654

Too many SQL Variables on Export

root@gutenberg:/var/www/gutenberg# ./dump-gutenberg.py --export
EXPORTING ebooks to static folder (and JSON)
Filtered book collection size: 45463
Filtered book collection, PDF: 962
Filtered book collection, ePUB: 45326
Filtered book collection, HTML: 45294
Dumping full_by_popularity.js
Dumping full_by_title.js
Dumping lang_en_by_popularity.js
Dumping lang_en_by_title.js
Dumping authors_lang_en.js
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 137, in main
only_books=BOOKS)
File "/var/www/gutenberg/gutenberg/export.py", line 96, in export_all_books
formats=formats)
File "/var/www/gutenberg/gutenberg/export.py", line 569, in export_to_json_helpers
Author.first_names.asc())],
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2139, in iter
return iter(self.execute())
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2132, in execute
self._qr = ResultWrapper(model_class, self._execute(), query_meta)
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 1838, in _execute
return self.database.execute_sql(sql, params, self.require_commit)
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2414, in execute_sql
self.commit()
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2283, in exit
reraise(new_type, new_type(*exc_value.args), traceback)
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2406, in execute_sql
cursor.execute(sql, params or ())
peewee.OperationalError: too many SQL variables

Creation process dies too often because of connectivity problems

For example:

[html] Requesting URLs for #6511# The suppressed Gospels and Epistles of the original New Testament of Jesus the Christ, Volume 5, St. Paul
Starting new HTTP connection (1): gutenberg.readingroo.ms
http://gutenberg.readingroo.ms:80 "GET /etext03/9962-h.zip HTTP/1.1" 404 216
Starting new HTTP connection (1): gutenberg.readingroo.ms
        Downloading content files for Book #12415
Traceback (most recent call last):
  File "/usr/local/bin/gutenberg2zim", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/src/gutenberg2zim/gutenberg2zim", line 192, in <module>
    main(docopt(help, version=0.1))
  File "/src/gutenberg2zim/gutenberg2zim", line 160, in main
    force=FORCE)
  File "/src/gutenberg2zim/gutenbergtozim/download.py", line 225, in download_all_books
    Pool(concurrency).map(dlb, available_books)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
requests.exceptions.ConnectionError: HTTPConnectionPool(host='gutenberg.readingroo.ms', port=80): Max retries exceeded with url: /etext05/741.html.images (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcdd9e5cb90>: Failed to establish a new connection: [Errno 110] Connection timed out',))
[epub] Requesting URLs for #12415# Byways Around San Francisco Bay
http://gutenberg.readingroo.ms:80 "GET /etext05/5805.html.noimages HTTP/1.1" 404 224
http://gutenberg.readingroo.ms:80 "GET /etext94/4343-h.htm HTTP/1.1" 404 216
Starting new HTTP connection (1): gutenberg.readingroo.ms
Starting new HTTP connection (1): gutenberg.readingroo.ms
Starting new HTTP connection (1): gutenberg.readingroo.ms

Smarter download approach

To download each book, gutenberg2zim tries a list of URL pattern where the book file(s) might be. We have no way to know, for sure, where the files are.

The reason behind this is that over time the Gutenberg project has chosen many different methodologies.

Therefore, to do download file, the script has a hardcoded list of potential URLs where to download the file(s) and the script just go through each of then until one suceeds.

Most of the time, the script makes more tha 4-5 tries before catching the right one. Each try cost time, 05-1 second.

That's why it would be nice to make a smart guess about which URL pattern in the list might better work, and a propose a method.

Each time a URL pattern has been proven successful for a book, then move it from one to the top of the list and always tires the list of pattern from the top to the end.

By doing so, we could have a far more faster download process.

if only one language, language filter should be invisible

This is currently done in js, but on slow devices the language filter might be visible for a short time. It would be better to make it invisible "display: none" directly in python.

On cover page, clickable author

On any cover page, the author is given. This author should be a link, and clicking on it should provide the list of books of this author.

main page does not have a mobile skin

the entry page http://zimfarm.kiwix.org:8080 listing the libraries does not show well on a phone, no mobile skin.

Update/force usage of Python3

Looks like the process has now a segfault with Python2. Do not forgetting the publication of pypi and updating the Dockerfile.

Updating Styles

I am looking to update the styles of the pages (like changing the header text or something) but I don't want to reprocess all the books, just the html generated for each book and Home.html... would I use -k? :P Since it takes a bit to rebuild, I would rather not just "try and see"

Provide --withFullTextIndex option

This zimwriterfs option allows to compute directly a fulltext index to be integrated in the ZIM file. This option should be ideally also provided by gutenberg2zim.

Using --books only should automatically do everything

Adding the list of additional steps to do should be optional, without any other addition stuff all the steps should be done.

Currently it dies:
./dump-gutenberg.py -k --books=10003
CHECKING for dependencies on the system
PREPARING rdf-files cache from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
df-files.tar.bz2 already exists in rdf-files.tar.bz2
RDF-files folder already exists in rdf-files
PARSING rdf-files in rdf-files
Setting up the database
license table already exists.
format table already exists.
author table already exists.
book table already exists.
bookformat table already exists.
Looping throught RDF files in rdf-files
Parsing file rdf-files/10003/pg10003.rdf
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 121, in main
parse_and_fill(rdf_path=RDF_FOLDER, only_books=BOOKS)
File "/media/data/projs/gutenberg/gutenberg/rdf.py", line 76, in parse_and_fill
parse_and_process_file(fpath)
File "/media/data/projs/gutenberg/gutenberg/rdf.py", line 94, in parse_and_process_file
save_rdf_in_database(parser)
File "/media/data/projs/gutenberg/gutenberg/rdf.py", line 215, in save_rdf_in_database
downloads=parser.downloads
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 3038, in create
inst.save(force_insert=True)
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 3163, in save
pk_from_cursor = self.insert(*_field_dict).execute()
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2247, in execute
return self.database.last_insert_id(self._execute(), self.model_class)
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 1838, in _execute
return self.database.execute_sql(sql, params, self.require_commit)
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2414, in execute_sql
self.commit()
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2283, in exit
reraise(new_type, new_type(_exc_value.args), traceback)
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2406, in execute_sql
cursor.execute(sql, params or ())
peewee.IntegrityError: UNIQUE constraint failed: book.id

[URGENT] On cover page, EPUB/PDF/... links are not URL encoded

So, if the the title contains a special characters, URL is broken. The URL should be URL encoded.

Check with "A quoi tient l'amour?" from "Emile Blémont"
http://library.kiwix.org/gutenberg_fr_all_09_2014/A/A%20quoi%20tient%20l%27amour%3F_cover.12487.html

I have fixed similar problems in javascript, but don't know how to do it in Python

Be avaible to avoid using of /tmp as temporary directory

It seems the tempfile module uses /tmp as an exchange directory. The problem is that in our case /tmp is not on the same disk and not in the disk array. That for sure slows down the whole process. It would be good to be able to specify an alternative tmp dirctory (or create one in the same directory structure).

Translate Title

I mean the <title> tag

Keep trace of images optimisation

For now it's impossible to know the status (regarding optim) image by image. This is a problem if you want to resume a process (you do not know so you skip everything or redo everything).

[kinda offtopic] problem installing virtualenvwrapper

Want to check out this project and possibly improve things but I'm stuck with pip install virtualenvwrapper:

The package setup script has attempted to modify files on your system
that are not within the EasyInstall build area, and has been aborted.

This is with virtualenvwrapper-4.3.1.tar.gz. I haven't found an appropriate place to report this to the project (or found another solution) and I want to store this piece of info somewhere for now (before I go to sleep).

Exporting crash, "peewee.OperationalError: too many SQL variables"

command:
./dump-gutenberg.py -l fr,es -f pdf,epub

log:

GET /4/6/3/1/46314/46314-h.htm HTTP/1.1" 404 252
EXPORTING ebooks to satic folder (and JSON)
        Filtered book collection size: 2808
        Filtered book collection, PDF: 45
        Filtered book collection, ePUB: 2778
        Filtered book collection, HTML: 2786
                Dumping full_by_popularity.js
                Dumping full_by_title.js
                Dumping lang_fr_by_popularity.js
                Dumping lang_fr_by_title.js
                Dumping authors_lang_fr.js
                Dumping lang_es_by_popularity.js
                Dumping lang_es_by_title.js
                Dumping authors_lang_es.js
Traceback (most recent call last):
  File "./dump-gutenberg.py", line 150, in <module>
    main(docopt(help, version=0.1))
  File "./dump-gutenberg.py", line 137, in main
    only_books=BOOKS)
  File "/media/data/tmp/gutenberg/gutenberg/export.py", line 96, in export_all_books
    formats=formats)
  File "/media/data/tmp/gutenberg/gutenberg/export.py", line 576, in export_to_json_helpers
    for author in authors:
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2139, in __iter__
    return iter(self.execute())
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2132, in execute
    self._qr = ResultWrapper(model_class, self._execute(), query_meta)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 1838, in _execute
    return self.database.execute_sql(sql, params, self.require_commit)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2414, in execute_sql
    self.commit()
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2283, in __exit__
    reraise(new_type, new_type(*exc_value.args), traceback)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2406, in execute_sql
    cursor.execute(sql, params or ())
peewee.OperationalError: too many SQL variables

"back/forward" does not work well with pagination

Coming back (with the browser back buttonn) to the welcome page does not keep the page number

Thumbs.db files should not be in ZIM files (static directory)

In dl-static we have a few Thumbs.db files (created by windows file explorer). This files should not be copied to the static directory.

dl-cache/37293_Thumbs.db
dl-cache/10912_Thumbs.db
dl-cache/22189_Thumbs.db
dl-cache/16275_Thumbs.db
dl-cache/14472_Thumbs.db
dl-cache/26998_Thumbs.db
dl-cache/11496_Thumbs.db
dl-cache/31242_Thumbs.db
dl-cache/37149_Thumbs.db
dl-cache/37243_Thumbs.db
dl-cache/10862_Thumbs.db
dl-cache/7005_Thumbs.db
dl-cache/37185_Thumbs.db
dl-cache/10796_Thumbs.db
dl-cache/10014_Thumbs.db
dl-cache/11171_Thumbs.db
dl-cache/19149_Thumbs.db
dl-cache/33269_Thumbs.db
dl-cache/37156_Thumbs.db
dl-cache/10015_Thumbs.db
dl-cache/12536_Thumbs.db
dl-cache/37063_Thumbs.db
dl-cache/30526_Thumbs.db
dl-cache/32988_Thumbs.db
dl-cache/16499_Thumbs.db
dl-cache/10013_Thumbs.db
dl-cache/10008_Thumbs.db
dl-cache/33225_Thumbs.db
dl-cache/12848_Thumbs.db
dl-cache/18153_Thumbs.db
dl-cache/33205_Thumbs.db
dl-cache/37164_Thumbs.db
dl-cache/13549_Thumbs.db
dl-cache/10830_Thumbs.db
dl-cache/31168_Thumbs.db
dl-cache/37398_Thumbs.db
dl-cache/32397_Thumbs.db
dl-cache/6961_Thumbs.db
dl-cache/10018_Thumbs.db
dl-cache/36891_Thumbs.db
dl-cache/12278_Thumbs.db

Double-click (our mouse middle click) problem

Double-click (our mouse middle click) on a list item on the welcome page should open the cover page in a new tab. Nothing happens unfortunately.

Implement --zim-title and --zim-desc

For now it seems to fail:

./dump-gutenberg.py --keep-db --export --zim-title="Bibliothèque du projet Gutenberg" --zim-desc="Premier producteur d’ebooks gratuits" --zim --languages=fr
Usage: dump-gutenberg.py [-k] [-l LANGS] [-f FORMATS] [-r RDF_FOLDER] [-m URL_MIRROR] [-d CACHE_PATH] [-e STATIC_PATH] [-z ZIM_PATH] [-u RDF_URL] [-b BOOKS] [--prepare] [--parse] [--download] [--export] [--zim] [--complete]