gutenbergtools / ebookconverter Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 2.0 791 KB

code that orchestrates ebook conversion for project gutenberg

License: GNU General Public License v3.0

Python 32.86% Shell 3.64% HTML 63.50%

ebookconverter's People

Contributors

Stargazers

Watchers

Forkers

geekwolverine nognoa

ebookconverter's Issues

Create -h.zip after adding headers/footers

I'm not sure if this is for ebookconverter or elsewhere in the processing chain.

For HTML and plain text, ebookmaker adds the header+metadata and footer to posted books.

For HTML, I would like to stop pushing *-h.zip and instead have that created after the header+metadata and footer is added. The *-h.zip can then go in cache/epub/xxx rather than in 1/2/3/...

The *-h.zip is a very useful format since it allows download of the HTML file plus all assets. But I'd like it to have the correct up-to-date metadata from the catalog whenever the other generated formats are built.

RDF file needs dcterms:type

Per an email exchange October 26-28, it seems that the older code that put dcterms:type has been replaced, and some functionality was lost (see below).

The desire is for the XML/RDF file to have values that indicate all of the main categories like text, plus special categories listed under Browsing Options at https://www.gutenberg.org/ebooks: Special Categories: Audio Book, computer-generated Audio Book, human-read Compilations Data Music, recorded Music, Sheet Other recordings Pictures, moving Pictures, still.

Email explaining the issue:

Until a month ago, the rdf for these (non-book) files would have had the info
you're looking for in an attribute called dcterms:type.

Before last month, non-book RDF files were those generated by a script that last
ran more than 10 years ago.

A month ago we started re-generating RDF for non-book items in PG with the script
that has been making rdf for book items for the past 10 years, which for unknown
reasons does not emit a dcterms:type attribute.

It should not be a lot of work for us to add this attribute, but it is not likely
to be done until the later part of November, or later depending on resource
availability.

In the meantime, the information you really want is in this file (along with some
info that will never be in rdf files):

https://github.com/gitenberg-dev/gitberg/blob/master/gitenberg/data/missing.tsv

I advise using an RDF parser to read the rdf files; when we update third party
modules in the generating script, I would expect them to reproduce the RDF graph but
not necessarily its xml represeantation.

Possible timing issue with social media postings

Per some emails on December 9-10, there is an issue where Facebook postings to @gutenberg_new do not have a preview image.

I wonder whether this might be a timing issue, where the posting is sent before the landing page is fully ready.

It will be good to investigate this possibility and, if needed, add a delay before the social media postings.

Blank ebook pages on gutenberg.org

I've encountered 4 missing ebooks from the gutenberg website: 38200, 57983, 64156, 65643. These pages contain no data, and only have a link to the RDF file. I discovered these because the RDF files also have blank data.

In the July 2021 newsletter, 65643 is listed as, "The lives of celebrated travellers, Vol. 2, by James Augustus St. John"

Related, but I guess not an eText release is ID 90907, which exists in the rdf-files.tar.zip offline archive, and also contains blank data.

P.S. I wasn't sure where to raise this issue so please point me to the correct place if needed.

Notification requests from production team

For the backend notification code (Notifier.py), please consider this request: "Is it possible to have the name of the book and a link to the landing page. For me, that would make these success messages useful and meaningful."

notification lists and messages

gutenbergtools/ebookmaker#158
gutenbergtools/ebookmaker#157
gutenbergtools/ebookmaker#156

Cover image not created from RST source

Sample: https://www.gutenberg.org/ebooks/1399

1/3/9/1399/1399-rst/images/cover.jpg is correctly inserted into the EPUB and MOBI "with images" files.

However, the cover formats are not created in cache/epub/1399, i.e. pg1399.cover.small.jpg and pg1399.cover.medium.jpg. And therefore, there is no cover image to display for the landing page. These supplemental files should be generated.

rdf files won't rebuild unless source file is new

for this reason, the rdf file does not keep up with metadata updated in the cataloging interface

ebook # for log doesn't get set for txt job if there's no txt source

marc subfields in RDF

Now that subtitles for new books are uniformly being handled using marc subfields in the database, we need to determine the best way to handle them in RDF. Also check OPDS?

duplicate production credits after update; lack of WW blocks further processing

It seems there are two credit entries in the catalog for this book:

Edit	Delete	508 - Creation / Production Credits Note		0	Eric Schmidt (This file was produced from images generously made available by The Internet Archive)
Edit	Delete	508 - Creation / Production Credits Note		0	Eric Schmidt, Wouter Franssen and the Online Distributed Proofreading Team at https://www.pgdp.net (This file was produced from images generously made available by The Internet Archive)

So now you have to tell me what you did, how it was submitted, etc., so I can fix the bug, if that's what it is.

Eric

On Mar 14, 2023, at 6:02 PM, Jacqueline Jeremy wrote:

Hi Eric

The files for Ebook #68957 were replaced--the original files had many
errors, and did not include many illustrations. The autogenerated
files appear to have the right content, but the credit line is not
correct. Can you change it be the same as in the submitted files, i.e.

Produced by: Eric Schmidt, Wouter Franssen and the Online Distributed
Proofreading Team at https://www.pgdp.net (This file was produced from
images generously made available by The Internet Archive)

It seems there are two credit entries in the catalog for this book:

Edit Delete 508 - Creation / Production Credits Note 0 Eric Schmidt (This file was produced from images generously made available by The Internet Archive)
Edit Delete 508 - Creation / Production Credits Note 0 Eric Schmidt, Wouter Franssen and the Online Distributed Proofreading Team at https://www.pgdp.net/ (This file was produced from images generously made available by The Internet Archive)
So now you have to tell me what you did, how it was submitted, etc., so I can fix the bug, if that's what it is.

Eric

On Mar 14, 2023, at 6:02 PM, Jacqueline Jeremy wrote:

Hi Eric

The files for Ebook #68957 were replaced--the original files had many
errors, and did not include many illustrations. The autogenerated
files appear to have the right content, but the credit line is not
correct. Can you change it be the same as in the submitted files, i.e.

Produced by: Eric Schmidt, Wouter Franssen and the Online Distributed
Proofreading Team at https://www.pgdp.net/ (This file was produced from
images generously made available by The Internet Archive)

move backfile production crediits and updates into database

Here's an outline of one possible semiautomated workflow for migrating structured credit and update data into the existing database:

extract existing data (from files and database) during an rebuild cycle
port data into a spreadsheet.
munge it a bit
use volunteers (or students?) to verify munged data and deal with outliers
verify the results of step 4
port the verified data back into the database

Need to address various forms of credit/update sources.

credits and update from the database
credits in various forms placed after the *** START OF... marker
credits put in the Gutenberg metadata header before ebookconverter began to handle it, for example: gutenbergtools/ebookmaker#153
revised credits and updates

notifications only for failures

From @gbnewby

I thought this was raised earlier but didn't see an open issue for it.

The production team (whitewashers) consensus is they do not desire to get "Good news, PG whitewasher(s)" emails when a new title is successfully processed on ibiblio.

They only want to get an email if there is an error (not just a warning) during processing.

Thanks.