Git Product home page Git Product logo

gutenberg's Introduction

Gutenberg Offline

This scraper downloads the whole Project Gutenberg library and puts it in a ZIM file, a clean and user friendly format for storing content for offline usage.

CodeFactor License: GPL v3 codecov PyPI version shields.io PyPI - Python Version Docker

Warning

This scraper is now known to have a serious flaw. A critical bug #219 has been discovered which leads to incomplete archives. Work on #97 (complete rewrite of the scraper logic) now seems mandatory to fix these annoying problems. We however currently miss the necessary bandwidth to address these changes. Help is of course welcomed, but be warned this is going to be a significant project (at least 10 man.days to change the scraper logic so that we can fix the issue I would say, so probably the double since human is always bad at estimations).

Coding guidelines

Main coding guidelines comes from the openZIM Wiki

Setting up the environment

Here we will setup everything needed to run the source version from your machine, supposing you want to modify it. If you simply want to run the tool, you should either install the PyPi package or use the Docker image. Docker image can also be used for development but needs a bit of tweaking for live reload of your code modifications.

Install the dependencies

First, ensure you use the proper Python version, inline with the requirement of pyproject.toml (you might for instance use pyenv to manage multiple Python versions in parallel).

You then need to install the various tools/libraries needed by the scraper.

GNU/Linux

sudo apt-get install python-pip python-dev libxml2-dev libxslt-dev advancecomp jpegoptim pngquant p7zip-full gifsicle curl zip zim-tools

macOS

brew install advancecomp jpegoptim pngquant p7zip gifsicle

Setup the package

First, clone this repository.

git clone [email protected]:kiwix/gutenberg.git
cd gutenberg

If you do not already have it on your system, install hatch to build the software and manage virtual environments (you might be interested by our detailed Developer Setup as well).

pip3 install hatch

Start a hatch shell: this will install software including dependencies in an isolated virtual environment.

hatch shell

That's it. You can now run gutenberg2zim from your terminal.

Getting started

After setting up the whole environment you can just run the main script gutenberg2zim. It will download, process and export the content.

./gutenberg2zim

Arguments

You can also specify parameters to customize the content. Only want books with the Id 100-200? Books only in French? English? Or only those both? No problem! You can also include or exclude book formats. You can add bookshelves and the option to search books by title to enrich your user experince.

./gutenberg2zim -l en,fr -f pdf --books 100-200 --bookshelves --title-search

This will download books in English and French that have the Id 100 to 200 in the HTML (default) and PDF format.

You can find the full arguments list below:

-h --help                       Display this help message
-y --wipe-db                    Empty cached book metadata
-F --force                      Redo step even if target already exist

-l --languages=<list>           Comma-separated list of lang codes to filter export to (preferably ISO 639-1, else ISO 639-3)
-f --formats=<list>             Comma-separated list of formats to filter export to (epub, html, pdf, all)

-e --static-folder=<folder>     Use-as/Write-to this folder static HTML
-z --zim-file=<file>            Write ZIM into this file path
-t --zim-title=<title>          Set ZIM title
-n --zim-desc=<description>     Set ZIM description
-d --dl-folder=<folder>         Folder to use/write-to downloaded ebooks
-u --rdf-url=<url>              Alternative rdf-files.tar.bz2 URL
-b --books=<ids>                Execute the processes for specific books, separated by commas, or dashes for intervals
-c --concurrency=<nb>           Number of concurrent process for processing tasks
--dlc=<nb>                      Number of concurrent *download* process for download (overwrites --concurrency). if server blocks high rate requests
-m --one-language-one-zim=<folder> When more than 1 language, do one zim for each   language (and one with all)
--no-index                      Do NOT create full-text index within ZIM file
--check                         Check dependencies
--prepare                       Download rdf-files.tar.bz2
--parse                         Parse all RDF files and fill-up the DB
--download                      Download ebooks based on filters
--zim                           Create a ZIM file
--title-search                  Add field to search a book by title and directly jump to it
--bookshelves                   Add bookshelves
--optimization-cache=<url>      URL with credentials to S3 bucket for using as optimization cache
--use-any-optimized-version     Try to use any optimized version found on optimization cache

Screenshots

License

GPLv3 or later, see LICENSE for more details.

gutenberg's People

Contributors

aroradhruv2308 avatar aschlumpf avatar avgp avatar benoit74 avatar dattaz avatar elfkuzco avatar giovannipessiva avatar jhellingman avatar jtaylor351 avatar kelson42 avatar nemobis avatar rashiq avatar rgaudin avatar satyamtg avatar seb35 avatar theparthshukla avatar vlee1776 avatar zion-fung avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gutenberg's Issues

Create a HTML redirection based on author

In the Kiwix search bar is not possible to get suggestiond based on authors. Would be good to create HTML redirection to books & cover following this scheme "author + title"

Scraper now dies

$gutenberg2zim -l en --dlc=1

[pdf] not avail. for #12506# Critiques and Addresses
                html already exists at dl-cache/12506.html
        Downloading content files for Book #12507
                epub already exists at dl-cache/12507.epub
[pdf] not avail. for #12507# The History of the Rise, Progress and Accomplishment of the Abolition of the African Slave Trade by the British Parliament (1808)
                html already exists at dl-cache/12507.html
        Downloading content files for Book #12508
                epub already exists at dl-cache/12508.epub
[pdf] not avail. for #12508# The Meaning of Good—A Dialogue
                html already exists at dl-cache/12508.html
        Downloading content files for Book #12509
                epub already exists at dl-cache/12509.epub
[pdf] not avail. for #12509# The Moon Rock
                html already exists at dl-cache/12509.html
        Downloading content files for Book #12510
                epub already exists at dl-cache/12510.epub
[pdf] not avail. for #12510# Targum
                html already exists at dl-cache/12510.html
        Downloading content files for Book #12511
                epub already exists at dl-cache/12511.epub
[pdf] not avail. for #12511# Blackwood's Edinburgh Magazine — Volume 53, No. 332, June, 1843
                html already exists at dl-cache/12511.html
        Downloading content files for Book #12512
                epub already exists at dl-cache/12512.epub
Segmentation fault (core dumped)

Make cookie per-project

Cookie is domained gutenberg and thus is shared accross all gutenberg related ZIM files on kiwix-serve.
This might lead to UI glitches and errors.
It should be isolated per-project

Ebook dead links inserted in HTML

If a book has for example a PDF version, but for any reason the ebook is not available in the dl-cache/static directory, then the HTML still includes it. To avoid any dead link this should not be the case.

An example:
With our current code (dcefb5f), for the following book, the PDF is not downloaded and consequently not available, but the HTML generated still provides links/icons pointing to the PDF version. All these links are dead links and should not be inserted.
http://www.gutenberg.org/ebooks/11

Crash by exporting #46860

By exporting only books in Portuguese I have found this bug:

$ rm -rf static/ ; ./dump-gutenberg.py --books=46860 --export
EXPORTING ebooks to static folder (and JSON)
[46860]
Filtered book collection size: 1
Filtered book collection, PDF: 0
Filtered book collection, ePUB: 1
Filtered book collection, HTML: 1
Dumping full_by_popularity.js
Dumping full_by_title.js
Dumping lang_pt_by_popularity.js
Dumping lang_pt_by_title.js
Dumping authors_lang_pt.js
Dumping auth_80_by_popularity.js
Dumping auth_80_by_title.js
Dumping authors.js
Dumping languages.js
Dumping main_languages.js
Exporting Book #46860.
Exporting to static/O Napoleão de Nothing Hill.46860.html
Copying companion file to 46860_image002.jpg
Copying /media/data/gutenberg/static/46860_image002.jpg
Copying companion file to 46860_image008.jpg
Copying /media/data/gutenberg/static/46860_image008.jpg
Copying companion file to 46860_image001.jpg
Copying /media/data/gutenberg/static/46860_image001.jpg
Copying companion file to 46860_cc0.png
Copying /media/data/gutenberg/static/46860_cc0.png
Copying companion file to 46860_image003.jpg
Copying /media/data/gutenberg/static/46860_image003.jpg
Copying companion file to 46860_image007.jpg
Copying /media/data/gutenberg/static/46860_image007.jpg
Copying companion file to 46860_image004.jpg
Copying /media/data/gutenberg/static/46860_image004.jpg
Copying companion file to 46860_image006.jpg
Copying /media/data/gutenberg/static/46860_image006.jpg
Copying companion file to 46860_cover.gif
Copying /media/data/gutenberg/static/46860_cover.gif
Copying companion file to 46860_image055.jpg
Copying /media/data/gutenberg/static/46860_image055.jpg
Copying format file to O Napoleão de Nothing Hill.46860.epub
Creating ePUB at /tmp/tmpAnHaAJ.epub
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 137, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books
books=books)
File "/media/data/gutenberg/gutenberg/export.py", line 557, in export_book_to
archive_name_for(book, format))
File "/media/data/gutenberg/gutenberg/export.py", line 524, in handle_companion_file
optimize_epub(src, tmp_epub.name)
File "/media/data/gutenberg/gutenberg/export.py", line 491, in optimize_epub
with open(opff, 'r') as fd:
IOError: [Errno 2] No such file or directory: u'/tmp/tmpM3AA0L/46860/content.opf'

Crash with invalid epub

            Copying companion file to 28969_wilhelm_tell_schiller_friedrich.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmpFYw00B.epub
            Copying companion file to 28969_pg6780-images.mobi
            shitty ext: /media/data/gutenberg/static/28969_pg6780-images.mobi
            Copying /media/data/gutenberg/static/28969_pg6780-images.mobi
            Exporting HTML file to /media/data/gutenberg/static/28969_6787-h.htm
            Copying companion file to 28969_love_and_intrigue_schiller_friedrich.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmpZq0rSI.epub
            Copying companion file to 28969_the_thirty_years_war_complete_schiller_friedrich.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmp9p2Zzv.epub
            Copying companion file to 28969_3pb216.jpg
            Copying /media/data/gutenberg/static/28969_3pb216.jpg
            Copying companion file to 28969_the_piccolomini_schiller_friedrich.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmphjKR5d.epub
            Copying companion file to 28969_pg6790-images.epub
            Creating ePUB at /media/data/gutenberg/tmp/tmpCyYXya.epub

Traceback (most recent call last):
File "./dump-gutenberg.py", line 154, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 141, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books
books=books)
File "/media/data/gutenberg/gutenberg/export.py", line 548, in export_book_to
handle_companion_file(fname)
File "/media/data/gutenberg/gutenberg/export.py", line 524, in handle_companion_file
optimize_epub(src, tmp_epub.name)
File "/media/data/gutenberg/gutenberg/export.py", line 440, in optimize_epub
with zipfile.ZipFile(src, 'r') as zf:
File "/usr/lib/python2.7/zipfile.py", line 714, in init
self._GetContents()
File "/usr/lib/python2.7/zipfile.py", line 748, in _GetContents
self._RealGetContents()
File "/usr/lib/python2.7/zipfile.py", line 763, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file

Too long filename

    Exporting Book #2810.
            Exporting to static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum—the New York County court house bootblack stand; Recorded by William L. Riordon.2810.html

Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 137, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/export.py", line 149, in export_all_books
books=books)
File "/media/data/gutenberg/gutenberg/export.py", line 365, in export_book_to
with open(article_fpath, 'w') as f:
IOError: [Errno 36] File name too long: u'static/Plunkitt of Tammany Hall: a series of very plain talks on very practical politics, delivered by ex-Senator George Washington Plunkitt, the Tammany philosopher, from his rostrum\u2014the New York County court house bootblack stand; Recorded by William L. Riordon.2810.html'

PDF not always downloaded correctly

For the following book, a PDF is available:
http://www.gutenberg.org/ebooks/11

Have a look on the mirror:
http://gutenberg.readingroo.ms/1/11/

But the script seems to be unable to download it:
$ rm -rf static/ ; ./dump-gutenberg.py --keep-db --download --books=11
DOWNLOADING ebooks from mirror using filters
[11]
Downloading content files for Book #11
epub already exists at dl-cache/11.epub
[pdf] Requesting URLs for #11# Alice's Adventures in Wonderland
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/11/11-pdf.pdf HTTP/1.1" 404 227
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/11/11.pdf HTTP/1.1" 404 223
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/11/pg11.pdf HTTP/1.1" 404 225
NO FILE FOR #11/pdf
[u'http://gutenberg.readingroo.ms/cache/generated/11/pg11.pdf',
u'http://gutenberg.readingroo.ms/cache/generated/11/11.pdf',
u'http://gutenberg.readingroo.ms/cache/generated/11/11-pdf.pdf']
html already exists at dl-cache/11.html
(gut)kelson@zimfarm:/media/data/gutenberg$ ls -la http://gutenberg.readingroo.ms/cache/generated/11/11.pdf
ls: cannot access http://gutenberg.readingroo.ms/cache/generated/11/11.pdf: No such file or directory

Language filter does not apply correctly

If you have a multilanguage Gutenberg ZIM file, then you can filter by:

  • language
  • author

But, if you:
1 - Choose a language
2 - Choose an author

You get all the books of this author without regarding the language version.

Language fitler should apply to this list of books per author and we should only get books from this author in the selected language.

Provide a categorization by genre

Searching the gutenberg dim is hard, browsing it is impossible if you don't know what you are looking for. I therefore wonder whether it is possible to add genre buttons, like "Crime", "Fantasy", "Erotic literature" etc. Also, one idea could be to add toplists, best rated crime fiction, or most downloaded crime fiction etc. There are no summaries of the content of each book, and that would have been nice, but i guess that's impossible as such things are not present on the gutenberg homepage either.

First reported at https://sourceforge.net/p/kiwix/feature-requests/988/

Site just displays the loading spinner

The commit e993ffcb294e7f2cbe02a55305e7f443f8301d01 by @kelson42 broke the site. It's only displaying the loading-spinner now.

bildschirmfoto 2014-07-19 um 22 05 17

Reproduce it by exporting the site:

./dump-gutenberg.py --export -l en

I have tested this on Chrome, Firefox and Safari.

Merge with new_design branch

@kelson42 @Seb35 @rashiq,
the rewrite of the HTML/CSS in a cleaner way is done. It's all in new_design branch.
Unfortunately, it started right after the hackathon so there is a lot to merge.
I am leaving now and won't have access to a computer for a long time, so one of you have to work on it.

The new thing is based on purecss for reset, grids and forms. It has been tested thoroughly on both desktop and mobile and it works fine. @rashiq, if you need to tweak the UI, do it according to the css file rules (no px sizes, the least possible css the better).

Couple info:

  • python should merge easily has it has been only slightly edited.

  • don't try to merge the templates (all), style.css nor tools.js. It just won't work and would introduce new problems.

  • Instead, port back any commit you did that matters : localization, strings fixes, etc.

  • @rashiq please see my comment on the zimwriterfs avoidance commit.

    good luck ; I'll catch up when possible.

--keep-db should be replaced

Parsing all the RDF is a really long process, even on a good computer. The default behaviour should not to remove and reparse everything.

If no action is given (--parse, --download, --export, --do-everything...) then all the steps should be done and then the script work in update mode, that means that only missing books/RDF are parsed. If --books is specified, only the corresponding books are parsed.

The use should still have the ability to erase the db and we should provide a new action called --erase-db

Crash during download

"GET /etext93/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext05/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext01/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext00/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext02/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext04/24006-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext95/24006-h.zip HTTP/1.1" 404 217
Downloading content files for Book #24010
[epub] Requesting URLs for #24010# The Gods are Athirst
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/24010/pg24010.epub HTTP/1.1" 200 207009
[pdf] not avail. for #24010# The Gods are Athirst
[html] Requesting URLs for #24010# The Gods are Athirst
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext92/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext90/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext96/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext94/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /2/4/0/1/24010/24010-h.html HTTP/1.1" 404 224
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext98/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /2/4/0/1/24010/24010-h.htm HTTP/1.1" 404 223
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /cache/generated/24010/pg24010.html.utf8 HTTP/1.1" 404 237
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext00/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext93/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /etext91/24010-h.zip HTTP/1.1" 404 217
Starting new HTTP connection (1): gutenberg.readingroo.ms
"GET /2/4/0/1/24010/24010-h.zip HTTP/1.1" 200 506925
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 129, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/download.py", line 200, in download_all_books
download_cache=download_cache)
File "/media/data/gutenberg/gutenberg/download.py", line 46, in handle_zipped_epub
if not is_safe(n)]):
File "/media/data/gutenberg/gutenberg/download.py", line 34, in is_safe
if path(fname).basename() == fname:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)

New crash by exporting a book

    Exporting Book #12018.
            Exporting to static/Notes and Queries, Number 17, February 23, 1850.12018.html
            Copying format file to Notes and Queries, Number 17, February 23, 1850.12018.epub
            Creating ePUB at /tmp/tmp9J0uzM.epub
            Exporting to static/Notes and Queries, Number 17, February 23, 1850_cover.12018.html
    Exporting Book #12019.
            Exporting to static/Queen Hortense: A Life Picture of the Napoleonic Era.12019.html

Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 137, in main
only_books=BOOKS)
File "/media/data/gutenberg/gutenberg/export.py", line 155, in export_all_books
books=books)
File "/media/data/gutenberg/gutenberg/export.py", line 378, in export_book_to
new_html = update_html_for_static(book=book, html_content=html)
File "/media/data/gutenberg/gutenberg/export.py", line 275, in update_html_for_static
[1 for e in body.children
AttributeError: 'NoneType' object has no attribute 'children'
-rw-rw-r-- 1 kelson kelson 428404 Sep 29 12:20 static/authors.js
-rw-rw-r-- 1 kelson kelson 428404 Sep 29 12:15 static/authors_lang_en.js

make download book icon clearer

Consider using an icon that has a down arrow or something more easily recognizable or someone may hit it accidentally not knowing what it is (like I did)

Dead links in HTML ebooks

In "Adventures of Huckleberry Finn" (#76), the HTML version of the book has a box/div "Format Choice" a the top of the page.

In that box, the second link with the label "2. A file with images which automatically accomodate to any screen size; this is the best choice for the small screens of Tablets and Smart Phones." is a dead link.

On the online version this link is not dead:
http://www.gutenberg.org/files/76/76-h/76-h.htm

Broswer back should work correctly

Steps:
1 - Load page 2 of page of results (default page)
2 - click to see the cover page of any result
3 - click browser back button

You get:
page 1 of page of results

You want:
page 2 of page of results

So the filtering is kept, but not the pagination

Too many SQL Variables on Export

root@gutenberg:/var/www/gutenberg# ./dump-gutenberg.py --export
EXPORTING ebooks to static folder (and JSON)
Filtered book collection size: 45463
Filtered book collection, PDF: 962
Filtered book collection, ePUB: 45326
Filtered book collection, HTML: 45294
Dumping full_by_popularity.js
Dumping full_by_title.js
Dumping lang_en_by_popularity.js
Dumping lang_en_by_title.js
Dumping authors_lang_en.js
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 137, in main
only_books=BOOKS)
File "/var/www/gutenberg/gutenberg/export.py", line 96, in export_all_books
formats=formats)
File "/var/www/gutenberg/gutenberg/export.py", line 569, in export_to_json_helpers
Author.first_names.asc())],
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2139, in iter
return iter(self.execute())
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2132, in execute
self._qr = ResultWrapper(model_class, self._execute(), query_meta)
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 1838, in _execute
return self.database.execute_sql(sql, params, self.require_commit)
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2414, in execute_sql
self.commit()
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2283, in exit
reraise(new_type, new_type(*exc_value.args), traceback)
File "/usr/local/lib/python2.7/dist-packages/peewee.py", line 2406, in execute_sql
cursor.execute(sql, params or ())
peewee.OperationalError: too many SQL variables

Creation process dies too often because of connectivity problems

For example:

[html] Requesting URLs for #6511# The suppressed Gospels and Epistles of the original New Testament of Jesus the Christ, Volume 5, St. Paul
Starting new HTTP connection (1): gutenberg.readingroo.ms
http://gutenberg.readingroo.ms:80 "GET /etext03/9962-h.zip HTTP/1.1" 404 216
Starting new HTTP connection (1): gutenberg.readingroo.ms
        Downloading content files for Book #12415
Traceback (most recent call last):
  File "/usr/local/bin/gutenberg2zim", line 6, in <module>
    exec(compile(open(__file__).read(), __file__, 'exec'))
  File "/src/gutenberg2zim/gutenberg2zim", line 192, in <module>
    main(docopt(help, version=0.1))
  File "/src/gutenberg2zim/gutenberg2zim", line 160, in main
    force=FORCE)
  File "/src/gutenberg2zim/gutenbergtozim/download.py", line 225, in download_all_books
    Pool(concurrency).map(dlb, available_books)
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 251, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib/python2.7/multiprocessing/pool.py", line 567, in get
    raise self._value
requests.exceptions.ConnectionError: HTTPConnectionPool(host='gutenberg.readingroo.ms', port=80): Max retries exceeded with url: /etext05/741.html.images (Caused by NewConnectionError('<requests.packages.urllib3.connection.HTTPConnection object at 0x7fcdd9e5cb90>: Failed to establish a new connection: [Errno 110] Connection timed out',))
[epub] Requesting URLs for #12415# Byways Around San Francisco Bay
http://gutenberg.readingroo.ms:80 "GET /etext05/5805.html.noimages HTTP/1.1" 404 224
http://gutenberg.readingroo.ms:80 "GET /etext94/4343-h.htm HTTP/1.1" 404 216
Starting new HTTP connection (1): gutenberg.readingroo.ms
Starting new HTTP connection (1): gutenberg.readingroo.ms
Starting new HTTP connection (1): gutenberg.readingroo.ms

Smarter download approach

To download each book, gutenberg2zim tries a list of URL pattern where the book file(s) might be. We have no way to know, for sure, where the files are.

The reason behind this is that over time the Gutenberg project has chosen many different methodologies.

Therefore, to do download file, the script has a hardcoded list of potential URLs where to download the file(s) and the script just go through each of then until one suceeds.

Most of the time, the script makes more tha 4-5 tries before catching the right one. Each try cost time, 05-1 second.

That's why it would be nice to make a smart guess about which URL pattern in the list might better work, and a propose a method.

Each time a URL pattern has been proven successful for a book, then move it from one to the top of the list and always tires the list of pattern from the top to the end.

By doing so, we could have a far more faster download process.

On cover page, clickable author

On any cover page, the author is given. This author should be a link, and clicking on it should provide the list of books of this author.

Update/force usage of Python3

Looks like the process has now a segfault with Python2. Do not forgetting the publication of pypi and updating the Dockerfile.

Updating Styles

I am looking to update the styles of the pages (like changing the header text or something) but I don't want to reprocess all the books, just the html generated for each book and Home.html... would I use -k? :P Since it takes a bit to rebuild, I would rather not just "try and see"

Provide --withFullTextIndex option

This zimwriterfs option allows to compute directly a fulltext index to be integrated in the ZIM file. This option should be ideally also provided by gutenberg2zim.

Using --books only should automatically do everything

Adding the list of additional steps to do should be optional, without any other addition stuff all the steps should be done.

Currently it dies:
./dump-gutenberg.py -k --books=10003
CHECKING for dependencies on the system
PREPARING rdf-files cache from http://www.gutenberg.org/cache/epub/feeds/rdf-files.tar.bz2
df-files.tar.bz2 already exists in rdf-files.tar.bz2
RDF-files folder already exists in rdf-files
PARSING rdf-files in rdf-files
Setting up the database
license table already exists.
format table already exists.
author table already exists.
book table already exists.
bookformat table already exists.
Looping throught RDF files in rdf-files
Parsing file rdf-files/10003/pg10003.rdf
Traceback (most recent call last):
File "./dump-gutenberg.py", line 150, in
main(docopt(help, version=0.1))
File "./dump-gutenberg.py", line 121, in main
parse_and_fill(rdf_path=RDF_FOLDER, only_books=BOOKS)
File "/media/data/projs/gutenberg/gutenberg/rdf.py", line 76, in parse_and_fill
parse_and_process_file(fpath)
File "/media/data/projs/gutenberg/gutenberg/rdf.py", line 94, in parse_and_process_file
save_rdf_in_database(parser)
File "/media/data/projs/gutenberg/gutenberg/rdf.py", line 215, in save_rdf_in_database
downloads=parser.downloads
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 3038, in create
inst.save(force_insert=True)
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 3163, in save
pk_from_cursor = self.insert(*_field_dict).execute()
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2247, in execute
return self.database.last_insert_id(self._execute(), self.model_class)
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 1838, in _execute
return self.database.execute_sql(sql, params, self.require_commit)
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2414, in execute_sql
self.commit()
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2283, in exit
reraise(new_type, new_type(_exc_value.args), traceback)
File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2406, in execute_sql
cursor.execute(sql, params or ())
peewee.IntegrityError: UNIQUE constraint failed: book.id

Be avaible to avoid using of /tmp as temporary directory

It seems the tempfile module uses /tmp as an exchange directory. The problem is that in our case /tmp is not on the same disk and not in the disk array. That for sure slows down the whole process. It would be good to be able to specify an alternative tmp dirctory (or create one in the same directory structure).

Keep trace of images optimisation

For now it's impossible to know the status (regarding optim) image by image. This is a problem if you want to resume a process (you do not know so you skip everything or redo everything).

[kinda offtopic] problem installing virtualenvwrapper

Want to check out this project and possibly improve things but I'm stuck with pip install virtualenvwrapper:

The package setup script has attempted to modify files on your system
that are not within the EasyInstall build area, and has been aborted.

This is with virtualenvwrapper-4.3.1.tar.gz. I haven't found an appropriate place to report this to the project (or found another solution) and I want to store this piece of info somewhere for now (before I go to sleep).

Exporting crash, "peewee.OperationalError: too many SQL variables"

command:
./dump-gutenberg.py -l fr,es -f pdf,epub

log:

GET /4/6/3/1/46314/46314-h.htm HTTP/1.1" 404 252
EXPORTING ebooks to satic folder (and JSON)
        Filtered book collection size: 2808
        Filtered book collection, PDF: 45
        Filtered book collection, ePUB: 2778
        Filtered book collection, HTML: 2786
                Dumping full_by_popularity.js
                Dumping full_by_title.js
                Dumping lang_fr_by_popularity.js
                Dumping lang_fr_by_title.js
                Dumping authors_lang_fr.js
                Dumping lang_es_by_popularity.js
                Dumping lang_es_by_title.js
                Dumping authors_lang_es.js
Traceback (most recent call last):
  File "./dump-gutenberg.py", line 150, in <module>
    main(docopt(help, version=0.1))
  File "./dump-gutenberg.py", line 137, in main
    only_books=BOOKS)
  File "/media/data/tmp/gutenberg/gutenberg/export.py", line 96, in export_all_books
    formats=formats)
  File "/media/data/tmp/gutenberg/gutenberg/export.py", line 576, in export_to_json_helpers
    for author in authors:
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2139, in __iter__
    return iter(self.execute())
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2132, in execute
    self._qr = ResultWrapper(model_class, self._execute(), query_meta)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 1838, in _execute
    return self.database.execute_sql(sql, params, self.require_commit)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2414, in execute_sql
    self.commit()
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2283, in __exit__
    reraise(new_type, new_type(*exc_value.args), traceback)
  File "/home/kelson/.virtualenvs/gut/local/lib/python2.7/site-packages/peewee.py", line 2406, in execute_sql
    cursor.execute(sql, params or ())
peewee.OperationalError: too many SQL variables

Thumbs.db files should not be in ZIM files (static directory)

In dl-static we have a few Thumbs.db files (created by windows file explorer). This files should not be copied to the static directory.

dl-cache/37293_Thumbs.db
dl-cache/10912_Thumbs.db
dl-cache/22189_Thumbs.db
dl-cache/16275_Thumbs.db
dl-cache/14472_Thumbs.db
dl-cache/26998_Thumbs.db
dl-cache/11496_Thumbs.db
dl-cache/31242_Thumbs.db
dl-cache/37149_Thumbs.db
dl-cache/37243_Thumbs.db
dl-cache/10862_Thumbs.db
dl-cache/7005_Thumbs.db
dl-cache/37185_Thumbs.db
dl-cache/10796_Thumbs.db
dl-cache/10014_Thumbs.db
dl-cache/11171_Thumbs.db
dl-cache/19149_Thumbs.db
dl-cache/33269_Thumbs.db
dl-cache/37156_Thumbs.db
dl-cache/10015_Thumbs.db
dl-cache/12536_Thumbs.db
dl-cache/37063_Thumbs.db
dl-cache/30526_Thumbs.db
dl-cache/32988_Thumbs.db
dl-cache/16499_Thumbs.db
dl-cache/10013_Thumbs.db
dl-cache/10008_Thumbs.db
dl-cache/33225_Thumbs.db
dl-cache/12848_Thumbs.db
dl-cache/18153_Thumbs.db
dl-cache/33205_Thumbs.db
dl-cache/37164_Thumbs.db
dl-cache/13549_Thumbs.db
dl-cache/10830_Thumbs.db
dl-cache/31168_Thumbs.db
dl-cache/37398_Thumbs.db
dl-cache/32397_Thumbs.db
dl-cache/6961_Thumbs.db
dl-cache/10018_Thumbs.db
dl-cache/36891_Thumbs.db
dl-cache/12278_Thumbs.db

Implement --zim-title and --zim-desc

For now it seems to fail:

./dump-gutenberg.py --keep-db --export --zim-title="Bibliothèque du projet Gutenberg" --zim-desc="Premier producteur d’ebooks gratuits" --zim --languages=fr
Usage: dump-gutenberg.py [-k] [-l LANGS] [-f FORMATS] [-r RDF_FOLDER] [-m URL_MIRROR] [-d CACHE_PATH] [-e STATIC_PATH] [-z ZIM_PATH] [-u RDF_URL] [-b BOOKS] [--prepare] [--parse] [--download] [--export] [--zim] [--complete]

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.