Git Product home page Git Product logo

ebookconverter's People

Contributors

eshellman avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

ebookconverter's Issues

Create -h.zip after adding headers/footers

I'm not sure if this is for ebookconverter or elsewhere in the processing chain.

For HTML and plain text, ebookmaker adds the header+metadata and footer to posted books.

For HTML, I would like to stop pushing *-h.zip and instead have that created after the header+metadata and footer is added. The *-h.zip can then go in cache/epub/xxx rather than in 1/2/3/...

The *-h.zip is a very useful format since it allows download of the HTML file plus all assets. But I'd like it to have the correct up-to-date metadata from the catalog whenever the other generated formats are built.

RDF file needs dcterms:type

Per an email exchange October 26-28, it seems that the older code that put dcterms:type has been replaced, and some functionality was lost (see below).

The desire is for the XML/RDF file to have values that indicate all of the main categories like text, plus special categories listed under Browsing Options at https://www.gutenberg.org/ebooks: Special Categories: Audio Book, computer-generated Audio Book, human-read Compilations Data Music, recorded Music, Sheet Other recordings Pictures, moving Pictures, still.

Email explaining the issue:

Until a month ago, the rdf for these (non-book) files would have had the info
you're looking for in an attribute called dcterms:type.

Before last month, non-book RDF files were those generated by a script that last
ran more than 10 years ago.

A month ago we started re-generating RDF for non-book items in PG with the script
that has been making rdf for book items for the past 10 years, which for unknown
reasons does not emit a dcterms:type attribute.

It should not be a lot of work for us to add this attribute, but it is not likely
to be done until the later part of November, or later depending on resource
availability.

In the meantime, the information you really want is in this file (along with some
info that will never be in rdf files):

https://github.com/gitenberg-dev/gitberg/blob/master/gitenberg/data/missing.tsv

I advise using an RDF parser to read the rdf files; when we update third party
modules in the generating script, I would expect them to reproduce the RDF graph but
not necessarily its xml represeantation.

Possible timing issue with social media postings

Per some emails on December 9-10, there is an issue where Facebook postings to @gutenberg_new do not have a preview image.

I wonder whether this might be a timing issue, where the posting is sent before the landing page is fully ready.

It will be good to investigate this possibility and, if needed, add a delay before the social media postings.

Blank ebook pages on gutenberg.org

I've encountered 4 missing ebooks from the gutenberg website: 38200, 57983, 64156, 65643. These pages contain no data, and only have a link to the RDF file. I discovered these because the RDF files also have blank data.

In the July 2021 newsletter, 65643 is listed as, "The lives of celebrated travellers, Vol. 2, by James Augustus St. John"

Related, but I guess not an eText release is ID 90907, which exists in the rdf-files.tar.zip offline archive, and also contains blank data.

P.S. I wasn't sure where to raise this issue so please point me to the correct place if needed.

Notification requests from production team

For the backend notification code (Notifier.py), please consider this request: "Is it possible to have the name of the book and a link to the landing page. For me, that would make these success messages useful and meaningful."

Cover image not created from RST source

Sample: https://www.gutenberg.org/ebooks/1399

1/3/9/1399/1399-rst/images/cover.jpg is correctly inserted into the EPUB and MOBI "with images" files.

However, the cover formats are not created in cache/epub/1399, i.e. pg1399.cover.small.jpg and pg1399.cover.medium.jpg. And therefore, there is no cover image to display for the landing page. These supplemental files should be generated.

marc subfields in RDF

Now that subtitles for new books are uniformly being handled using marc subfields in the database, we need to determine the best way to handle them in RDF. Also check OPDS?

duplicate production credits after update; lack of WW blocks further processing

It seems there are two credit entries in the catalog for this book:

Edit Delete 508 - Creation / Production Credits Note   0 Eric Schmidt (This file was produced from images generously made available by The Internet Archive)
Edit Delete 508 - Creation / Production Credits Note   0 Eric Schmidt, Wouter Franssen and the Online Distributed Proofreading Team at https://www.pgdp.net (This file was produced from images generously made available by The Internet Archive)
So now you have to tell me what you did, how it was submitted, etc., so I can fix the bug, if that's what it is.

Eric

On Mar 14, 2023, at 6:02 PM, Jacqueline Jeremy wrote:

Hi Eric

The files for Ebook #68957 were replaced--the original files had many
errors, and did not include many illustrations. The autogenerated
files appear to have the right content, but the credit line is not
correct. Can you change it be the same as in the submitted files, i.e.

Produced by: Eric Schmidt, Wouter Franssen and the Online Distributed
Proofreading Team at https://www.pgdp.net (This file was produced from
images generously made available by The Internet Archive)

It seems there are two credit entries in the catalog for this book:

Edit Delete 508 - Creation / Production Credits Note 0 Eric Schmidt (This file was produced from images generously made available by The Internet Archive)
Edit Delete 508 - Creation / Production Credits Note 0 Eric Schmidt, Wouter Franssen and the Online Distributed Proofreading Team at https://www.pgdp.net/ (This file was produced from images generously made available by The Internet Archive)
So now you have to tell me what you did, how it was submitted, etc., so I can fix the bug, if that's what it is.

Eric

On Mar 14, 2023, at 6:02 PM, Jacqueline Jeremy wrote:

Hi Eric

The files for Ebook #68957 were replaced--the original files had many
errors, and did not include many illustrations. The autogenerated
files appear to have the right content, but the credit line is not
correct. Can you change it be the same as in the submitted files, i.e.

Produced by: Eric Schmidt, Wouter Franssen and the Online Distributed
Proofreading Team at https://www.pgdp.net/ (This file was produced from
images generously made available by The Internet Archive)

move backfile production crediits and updates into database

Here's an outline of one possible semiautomated workflow for migrating structured credit and update data into the existing database:

  1. extract existing data (from files and database) during an rebuild cycle
  2. port data into a spreadsheet.
  3. munge it a bit
  4. use volunteers (or students?) to verify munged data and deal with outliers
  5. verify the results of step 4
  6. port the verified data back into the database

Need to address various forms of credit/update sources.

  1. credits and update from the database
  2. credits in various forms placed after the *** START OF... marker
  3. credits put in the Gutenberg metadata header before ebookconverter began to handle it, for example: gutenbergtools/ebookmaker#153
  4. revised credits and updates

notifications only for failures

From @gbnewby

I thought this was raised earlier but didn't see an open issue for it.

The production team (whitewashers) consensus is they do not desire to get "Good news, PG whitewasher(s)" emails when a new title is successfully processed on ibiblio.

They only want to get an email if there is an error (not just a warning) during processing.

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.