gutenbergtools / autocat3 Goto Github PK
View Code? Open in Web Editor NEWCherryPy App that serves dynamic content for Project Gutenberg
License: GNU General Public License v3.0
CherryPy App that serves dynamic content for Project Gutenberg
License: GNU General Public License v3.0
Currently, an absolute url is used for the cover image (reused in metadata)
There is a hidden column in the download table. Our speculation is that at some point in the past is was added for accessibility reasons.
It would be a good idea to review the website's accessibility with Accessibility Developer Tools or other tools and make changes as necessary.
This should happen after ebookconverter issue #38 is completed and in production.
Now that we have confidence in the generated files, I think it's safe to add a little logic to the landing pages:
When there is a generated format, don't list the "as submitted" format.
This mostly applies to plain text and HTML and, eventually, when PDF is generated. Occasionally we have RST, RDF and other input formats that result in HTML or PDF - those input formats shouldn't be listed on the download page.
Basically, if there is cache/epub/.xxx then 1/2/3/.../.xxx should not be listed.
After this change, "More files..." will be the only place where the as-submitted files will be.
The production team reported that truncation on the landing page doesn't seem quite right.
See: https://gutenberg.org/ebooks/71695
Truncation at the top of the landing page is "a", which seems incorrect.
The full subtitle is part of the bibrec & database. This is correct.
Truncation within the HTML & text is "China." This seems better than what's at the top of the landing page.
I'll paste in some screenshots of a new complaint from Google Search Console about breadcrumbs. There were already some recent improvements to breadcrumbs in autocat3, and there seems to be a little further work to do.
https://developers.google.com/search/docs/appearance/structured-data/breadcrumb
Here is the overview image from the report, then one image that shows one problem was fixed, and another image that shows a new problem was introduced.
I verified this report, which arrived in the Project Gutenberg inbox:
https://www.gutenberg.org/ebooks
Quick Search for "Happened Otherwise"
it finds the book "It Might Have Happened Otherwise".
then
https://www.gutenberg.org/ebooks
search and browse
advanced search
title = "Happened Otherwise"
Search
0 results.
experimenting more:
a single word in the title box "Spell"
found 58 entries but not the book "Learning to Spell"
two words in the title box "to Spell"
found 27 entries, none of them having the string "to Spell"
"Learning to Spell" in the title box: no results.
Quick Search "Learning to Spell" works.
Breadcrumbs structured data issues detected in gutenberg.org
To the owner of gutenberg.org:
Search Console has identified that your site is affected by 2 Breadcrumbs structured data issue(s). The following issues were found on your site.
Top critical issues*
Either "name" or "item.name" should be specified (in "itemListElement")
Missing field "position" (in "itemListElement")
*Critical issues prevent your page or feature from appearing in Search results.
We recommend that you fix these issues when possible to enable the best experience and coverage in Google Search.
I spent some time with the Google search console and it's complaining about data-vocabulary.org is deprecated.
This is used for breadcrumb annotation in autocat3. It's not used for breadcrumbs elsewhere in www.gutenberg.org that I saw.
Here is a page that describes how to transition to schema.org which is the current approach Google recommends: https://magefan.com/blog/data-vocabularyorg-schema-is-deprecated-error-fix-solution
Thanks for taking a look at removing this deprecated schema from autocat3.
We need to rip out a bunch of code, so the first thing we need to do is write some basic tests to make sure we don't totally screw up.
There is a little logic in a few places in autocat3 that displays the uniform title rather than the title, when a uniform title exists. I see this in templates/bibrec.html and AdvSearchPage.py.
This is frequently reported as a problem/anomaly by our users (and cataloger). The situation is that a title search will yield results in a different language, and therefore no visibly matching title words.
Another situation is when an author landing page for an English book displays the title in a language other than English. For example, https://www.gutenberg.org/ebooks/author/85 .. you can see a listing for "Quatrevingt-treize. English." This is using the uniform title (field 245, I think). But then the landing page correctly uses the title field (240, I think): https://www.gutenberg.org/ebooks/49372
In short, I concur with our cataloger that we should always display titles, not uniform titles. Titles are more "correct" for the actual book contents. Uniform titles might be good to display as a field in the bibrec section of a landing page, but are not appropriate for search results.
I note that the specific author landing page for Victor Hugo does not come directly from autocat3 (it's a nightly cron job). But here is a quick search yielding the exact same behavior: https://www.gutenberg.org/ebooks/search/?query=a.victor+hugo&submit_search=Go%21 .. the nightly cron job leverages the same logic (I can help track that down, if needed, but fixing in autocat3 might also fix the cron job).
Per an email exchange between Eric and Greg, we would like to update OPDS to version 2.0.
Our currently OPDS is 0.9 and not necessarily working properly.
This will yield the IA/OpenLibrary api which is stable and there are python wrappers for it.
The goal is for OPDS to serve as the main public-facing API offered by Project Gutenberg.
Would you please update the app to change bibrec pages to link to the /browse tree?
This is for the gutenberg1 branch, currently enacted at dev.gutenberg.org
This should be done for authors (and other creator roles).
For example, see the bibrec here: https://dev.gutenberg.org/ebooks/44125
The author links to: https://dev.gutenberg.org/ebooks/author/42603
And this should instead go to: https://dev.gutenberg.org/browse/authors/p#a42603 (the "p#" is the first letter of the author's last name).
We have a few hooks in gutenbergsite for OPDS, and it's evident in the autocat3 code.
Previously we advertised http://m.gutenberg.org/ebooks/?format=opds but https://www.gutenberg.org/ebooks/?format=opds should work as well.
Does it look like this is recoverable? We have a small but doughty fan base that sends messages to the PG helpdesk inbox when OPDS is unavailable. If we add this, I'd like to get the recommended link for the "offline catalogs" page. If it's permanently lost, then I could put a note there as well.
We have heard via submitters, and confirmed, that the order of authors and other creators is not preserved in the display of books at www.gutenberg.org
It seems that order is not part of the catalog database, for the various roles. Here are the roles:
gutenberg=> SELECT role, COUNT(DISTINCT author) AS unique_authors
FROM v_books
WHERE fk_books BETWEEN 60000 AND 69999
GROUP BY role;
I confirmed that the JSON file that is transmitted by dopush puts creators in the same order as submitted - i.e., the submission database for clearances retains order.
It seems the need is to update the catalog database to track the order of creators. Let's figure out how to approach this. I'm flagging this issue for autocat3 since that's where the landing page display happens. The input of the JSON to the catalog database is another component, and ebookmaker also consumes this to place the metadata in the headers of generated files.
In an eBook landing page, such as https://www.gutenberg.org/ebooks/60225, there are two rows each with links for Kindle and EPUB: One for "images" and one for "no images."
However, if there are no images in the source (HTML, text, or RST), then the two linked files are identical. In this case, there should only be ONE row with a link for Kindle, and another with a link for EPUB. This is what I'd like to see.
Future considerations: Stopping creation of duplicates, in ebookmaker, will likely also "fix" the extra row. And, someday we hope to have page covers generated for every eBook, and then there will always be an "images" version (and we can decide then whether to still make a "no images" when the only difference is a single JPG).
Predecessor discussion & background is here: gutenbergtools/libgutenberg#36 (comment)
My goals are:
Based on some discussion with Eric, my understanding is that code already exists to perform #1 by looking at the file timestamps. Basically, files that have been updated some time after initial posting (say, at least 14 days - TBC) should have a "^tMost recently updated: Month Day, Year" field in the header. Where ^t is a tab or similar indentation under the "Release date:" and "Month Day, Year" is something like "January 2, 2024".
There should only be zero or one "Most recently updated: ..." lines in the header. We don't need to track previous revisions - those are handled via the "old/" subdirectory or in the per-book github repository. If/when the source file (HTML or text, usually, in the 1/2/3 filesystem) is updated, then the file timestamp is updated and a new date will be put into the "Most recently updated:" header the next time generated files are rebuilt.
Based on some email correspondence & clarification, the root cause may be an incomplete deletion or other inconsistency or problem in the catalog database.
To demonstrate the problem: There are three books by the same author. The most recent two were posted at the end of January.
The three books are correctly linked to the author:
https://www.gutenberg.org/ebooks/author/56829
Here they are:
https://www.gutenberg.org/ebooks/54449
https://www.gutenberg.org/ebooks/72833
https://www.gutenberg.org/ebooks/72841 <-- doesn't show up in advanced search
In the Advanced Search pane on https://www.gutenberg.org/ebooks, enter:
Author: topelius
Title: vanhoja
...and the message is "3 books found" but only the first two are listed.
This doesn't seem to be a general problem impacting all advanced search results. I tried author=shakespeare and title=nothing, and was correctly presented with 11 books (including one audiobook).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.