gutenbergtools / autocat3 Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 6.0 725 KB

CherryPy App that serves dynamic content for Project Gutenberg

License: GNU General Public License v3.0

Python 66.01% HTML 33.99%

autocat3's People

Contributors

Stargazers

Watchers

Forkers

ebookfoundation janicecheuk python-repository-hub beedleka marknwilliam

autocat3's Issues

check if we can create a relative cover image

Currently, an absolute url is used for the cover image (reused in metadata)

hidden column / accessibility / validation

There is a hidden column in the download table. Our speculation is that at some point in the past is was added for accessibility reasons.

It would be a good idea to review the website's accessibility with Accessibility Developer Tools or other tools and make changes as necessary.

Deprecate non-generated content when generated content is available

This should happen after ebookconverter issue #38 is completed and in production.

Now that we have confidence in the generated files, I think it's safe to add a little logic to the landing pages:

When there is a generated format, don't list the "as submitted" format.

This mostly applies to plain text and HTML and, eventually, when PDF is generated. Occasionally we have RST, RDF and other input formats that result in HTML or PDF - those input formats shouldn't be listed on the download page.

Basically, if there is cache/epub/.xxx then 1/2/3/.../.xxx should not be listed.

After this change, "More files..." will be the only place where the as-submitted files will be.

Subtitle truncation not quite right

The production team reported that truncation on the landing page doesn't seem quite right.

See: https://gutenberg.org/ebooks/71695

Truncation at the top of the landing page is "a", which seems incorrect.
The full subtitle is part of the bibrec & database. This is correct.
Truncation within the HTML & text is "China." This seems better than what's at the top of the landing page.

More breadcrumb complaints from Google Search Console

I'll paste in some screenshots of a new complaint from Google Search Console about breadcrumbs. There were already some recent improvements to breadcrumbs in autocat3, and there seems to be a little further work to do.

https://developers.google.com/search/docs/appearance/structured-data/breadcrumb

Here is the overview image from the report, then one image that shows one problem was fixed, and another image that shows a new problem was introduced.

Missing results in search

I verified this report, which arrived in the Project Gutenberg inbox:

https://www.gutenberg.org/ebooks
Quick Search for "Happened Otherwise"
it finds the book "It Might Have Happened Otherwise".

then
https://www.gutenberg.org/ebooks
search and browse
advanced search
title = "Happened Otherwise"
Search
0 results.

experimenting more:
a single word in the title box "Spell"
found 58 entries but not the book "Learning to Spell"

two words in the title box "to Spell"
found 27 entries, none of them having the string "to Spell"

"Learning to Spell" in the title box: no results.

Quick Search "Learning to Spell" works.

schema.org breadcrumbs markup

Breadcrumbs structured data issues detected in gutenberg.org

To the owner of gutenberg.org:

Search Console has identified that your site is affected by 2 Breadcrumbs structured data issue(s). The following issues were found on your site.

Top critical issues*

Either "name" or "item.name" should be specified (in "itemListElement")

Missing field "position" (in "itemListElement")

*Critical issues prevent your page or feature from appearing in Search results.

We recommend that you fix these issues when possible to enable the best experience and coverage in Google Search.

data-vocabulary.org is deprecated

I spent some time with the Google search console and it's complaining about data-vocabulary.org is deprecated.

This is used for breadcrumb annotation in autocat3. It's not used for breadcrumbs elsewhere in www.gutenberg.org that I saw.

Here is a page that describes how to transition to schema.org which is the current approach Google recommends: https://magefan.com/blog/data-vocabularyorg-schema-is-deprecated-error-fix-solution

Thanks for taking a look at removing this deprecated schema from autocat3.

add tests

We need to rip out a bunch of code, so the first thing we need to do is write some basic tests to make sure we don't totally screw up.

create a simple test that verifies app instantiation with a known environment
create a travis test script that builds an environment and runs tests

Landing pages and search results should display Title, not Uniform Title

There is a little logic in a few places in autocat3 that displays the uniform title rather than the title, when a uniform title exists. I see this in templates/bibrec.html and AdvSearchPage.py.

This is frequently reported as a problem/anomaly by our users (and cataloger). The situation is that a title search will yield results in a different language, and therefore no visibly matching title words.

Another situation is when an author landing page for an English book displays the title in a language other than English. For example, https://www.gutenberg.org/ebooks/author/85 .. you can see a listing for "Quatrevingt-treize. English." This is using the uniform title (field 245, I think). But then the landing page correctly uses the title field (240, I think): https://www.gutenberg.org/ebooks/49372

In short, I concur with our cataloger that we should always display titles, not uniform titles. Titles are more "correct" for the actual book contents. Uniform titles might be good to display as a field in the bibrec section of a landing page, but are not appropriate for search results.

I note that the specific author landing page for Victor Hugo does not come directly from autocat3 (it's a nightly cron job). But here is a quick search yielding the exact same behavior: https://www.gutenberg.org/ebooks/search/?query=a.victor+hugo&submit_search=Go%21 .. the nightly cron job leverages the same logic (I can help track that down, if needed, but fixing in autocat3 might also fix the cron job).

Update OPDS

Per an email exchange between Eric and Greg, we would like to update OPDS to version 2.0.

Our currently OPDS is 0.9 and not necessarily working properly.

This will yield the IA/OpenLibrary api which is stable and there are python wrappers for it.

The goal is for OPDS to serve as the main public-facing API offered by Project Gutenberg.

bibrec tab link to browse (static) pages, not search (autocat3) pages

Would you please update the app to change bibrec pages to link to the /browse tree?

This is for the gutenberg1 branch, currently enacted at dev.gutenberg.org

This should be done for authors (and other creator roles).

For example, see the bibrec here: https://dev.gutenberg.org/ebooks/44125

The author links to: https://dev.gutenberg.org/ebooks/author/42603

And this should instead go to: https://dev.gutenberg.org/browse/authors/p#a42603 (the "p#" is the first letter of the author's last name).

OPDS functionality seems to have stopped

We have a few hooks in gutenbergsite for OPDS, and it's evident in the autocat3 code.

Previously we advertised http://m.gutenberg.org/ebooks/?format=opds but https://www.gutenberg.org/ebooks/?format=opds should work as well.

Does it look like this is recoverable? We have a small but doughty fan base that sends messages to the PG helpdesk inbox when OPDS is unavailable. If we add this, I'd like to get the recommended link for the "offline catalogs" page. If it's permanently lost, then I could put a note there as well.

Ordering of creators

We have heard via submitters, and confirmed, that the order of authors and other creators is not preserved in the display of books at www.gutenberg.org

It seems that order is not part of the catalog database, for the various roles. Here are the roles:

gutenberg=> SELECT role, COUNT(DISTINCT author) AS unique_authors
FROM v_books
WHERE fk_books BETWEEN 60000 AND 69999
GROUP BY role;

I confirmed that the JSON file that is transmitted by dopush puts creators in the same order as submitted - i.e., the submission database for clearances retains order.

It seems the need is to update the catalog database to track the order of creators. Let's figure out how to approach this. I'm flagging this issue for autocat3 since that's where the landing page display happens. The input of the JSON to the catalog database is another component, and ebookmaker also consumes this to place the metadata in the headers of generated files.

Quell duplicates when images & no images are the same

In an eBook landing page, such as https://www.gutenberg.org/ebooks/60225, there are two rows each with links for Kindle and EPUB: One for "images" and one for "no images."

However, if there are no images in the source (HTML, text, or RST), then the two linked files are identical. In this case, there should only be ONE row with a link for Kindle, and another with a link for EPUB. This is what I'd like to see.

Future considerations: Stopping creation of duplicates, in ebookmaker, will likely also "fix" the extra row. And, someday we hope to have page covers generated for every eBook, and then there will always be an "images" version (and we can decide then whether to still make a "no images" when the only difference is a single JPG).

Remove "updated" entries from database and instead populate generated files with update dates from file timestamps

Predecessor discussion & background is here: gutenbergtools/libgutenberg#36 (comment)

My goals are:

Continue to include formatted "Most recently updated: ..." fields in generated files' headers, just like this: https://github.com/gutenbergtools/libgutenberg/assets/926513/464b6c07-2d31-4cae-a5f0-e58ee7c20151
Stop storing "Updated: Month Day, Year" or similar data in the 508 field in the catalog database. All those entries will be removed, so that only actual credit lines are stored in the catalog database 508 field. (We will permanently store those among the cache/feeds/ location, so people interested in that revision history can have it.)
Include the update date on the landing page (via autocat3), similarly to how it appears now on landing pages such as https://www.gutenberg.org/ebooks/10000, but not with the "Credit" table label. Perhaps simply, "Updated" in the left-hand column, and a date like "January 2, 2024" in the right-hand column.

Based on some discussion with Eric, my understanding is that code already exists to perform #1 by looking at the file timestamps. Basically, files that have been updated some time after initial posting (say, at least 14 days - TBC) should have a "^tMost recently updated: Month Day, Year" field in the header. Where ^t is a tab or similar indentation under the "Release date:" and "Month Day, Year" is something like "January 2, 2024".

There should only be zero or one "Most recently updated: ..." lines in the header. We don't need to track previous revisions - those are handled via the "old/" subdirectory or in the per-book github repository. If/when the source file (HTML or text, usually, in the 1/2/3 filesystem) is updated, then the file timestamp is updated and a new date will be put into the "Most recently updated:" header the next time generated files are rebuilt.

Incomplete search results on advanced search screen

Based on some email correspondence & clarification, the root cause may be an incomplete deletion or other inconsistency or problem in the catalog database.

To demonstrate the problem: There are three books by the same author. The most recent two were posted at the end of January.

The three books are correctly linked to the author:
https://www.gutenberg.org/ebooks/author/56829

Here they are:
https://www.gutenberg.org/ebooks/54449
https://www.gutenberg.org/ebooks/72833
https://www.gutenberg.org/ebooks/72841 <-- doesn't show up in advanced search

In the Advanced Search pane on https://www.gutenberg.org/ebooks, enter:

Author: topelius
Title: vanhoja

...and the message is "3 books found" but only the first two are listed.

This doesn't seem to be a general problem impacting all advanced search results. I tried author=shakespeare and title=nothing, and was correctly presented with 11 books (including one audiobook).