Git Product home page Git Product logo

gutenberg-dammit's Introduction

Gutenberg, dammit

By Allison Parrish

Gutenberg, dammit is a corpus of every plaintext file in Project Gutenberg (up until June 2016), organized in a consistent fashion, with (mostly?) consistent metadata. The intended purpose of the corpus is to make it really easy to do creative things with this wonderful and amazing body of freely-available text.

Download the corpus here.

The name of the corpus was inspired by Leonard Richardson's Unicode, dammit.

Code in this repository relies on the data prepared by the GutenTag project (Brooke 2015) and the code is partially based on the GutenTag source code.

NOTE: Not all of the works in Project Gutenberg are in the public domain. Check the Copyright Status field in the metadata for each work you plan on using to be sure. I believe that all of the files in the corpus are redistributable, but it might not be okay for you to "reuse" any works in the corpus that are not in the public domain.

Working with the corpus

The gutenbergdammit.ziputils module has some functions for working with the corpus file in situ using Python's zipfile library, so you don't even have to decompress the file and make a big mess on your hard drive. You can copy/paste these functions, use them as a reference in your own implementation, or use them directly by installing this package from the repo:

pip install https://github.com/aparrish/gutenberg-dammit/archive/master.zip

First, download the ZIP archive and put it in the same directory as your Python code. Then, to (e.g.) retrieve the text of one particular file from the corpus:

>>> from gutenbergdammit.ziputils import retrieve_one
>>> text = retrieve_one("gutenberg-dammit-files-v002.zip", "123/12345.txt")
>>> text[:50]
'[Illustration: "I saw there something missing from'

To retrieve the metadata file:

>>> from gutenbergdammit.ziputils import loadmetadata
>>> metadata = loadmetadata("gutenberg-dammit-files-v002.zip")
>>> metadata[456]['Title']
['Essays in the Art of Writing']

To search for and retrieve files whose metadata contains particular strings:

>>> from gutenbergdammit.ziputils import searchandretrieve
>>> for info, text in searchandretrieve("gutenberg-dammit-files-v002.zip", {'Title': 'Made Easy'}):
...     print(info['Title'][0], len(text))
...
Entertaining Made Easy 108314
Reading Made Easy for Foreigners - Third Reader 209964
The Art of Cookery Made Easy and Refined 262990
Shaving Made Easy	What the Man Who Shaves Ought to Know 44982
Writing and Drawing Made Easy, Amusing and Instructive	Containing The Whole Alphabet in all the Characters now	us'd, Both in Printing and Penmanship 10036
Etiquette Made Easy 119770

Details

The corpus is arranged as multiple subdirectories, each with the first three digits of the number identifying the Gutenberg book. Plain text files for each book whose ID begins with those digits are located in that directory. For example, the book with Gutenberg ID 12345 has the relative path 123/12345.txt. This path fragment is present in the metadata for each file as the gd-path attribute; see below for more details. (Splitting up the files like this is intended to be a compromise that makes accessing each file easy while making life a little bit easier if you're poking around with your file browsing application or ls.)

The files themselves have had Project Gutenberg boilerplate headers and footers stripped away for your convenience. (The code used to strip the boilerplate is copied from GutenTag.) You may want to do your own sanity check on individual files of importance to guarantee that they have the contents you think they should have.

Metadata

The gutenberg-metadata.json file in the zip is a big JSON file with metadata on each book. The is a list of JSON objects with the following format:

{
    "Author": [ "Robert Carlton Brown" ],
    "Author Birth": [ 1886 ],
    "Author Death": [ 1959 ],
    "Author Given": [ "Robert Carlton" ],
    "Author Surname": [ "Brown" ],
    "Copyright Status": [ "Not copyrighted in the United States." ],
    "Language": [ "English" ],
    "LoC Class": [ "SF: Agriculture: Animal culture" ],
    "Num": "14293",
    "Subject": [ "Cookery (Cheese)", "Cheese" ],
    "Title": [ "The Complete Book of Cheese" ],
    "charset": "iso-8859-1",
    "gd-num-padded": "14293",
    "gd-path": "142/14293.txt",
    "href": "/1/4/2/9/14293/14293_8.zip"
}

The capitalized fields correspond to the fields in the official Project Gutenberg metadata, with information about the author broken out into the birth/death/given/surname fields when possible. Fields are presented as lists to accommodate books that (e.g.) have more than one author or title.

The lower-case fields are metadata specific to this corpus, explained below:

  • charset: The character set of the original file. All of the files in the ZIP are in UTF-8 encoding, so this is only helpful if (e.g.) you're using the metadata to refer back to the original file on the Gutenberg website.
  • gd-num-padded: The book number ("Gutenberg ID") left-padded to five digits with zeros.
  • gd-path: The path to the file inside the Gutenberg Dammit zip file, to be appended to the gutenberg-dammit-files/ directory present in the zip file itself.
  • href: The path to the file in the original GutenTag corpus.

NOTE: Not all records have every field, and not every field is guaranteed to be non-empty.

What was included, what was left out

First off, Gutenberg, dammit is based on files from Project Gutenberg, and doesn't include files from any of the related international projects (e.g. Project Gutenberg Canada, Project Gutenberg Australia).

Only Gutenberg items with plaintext files are included in this corpus. It doesn't include audiobooks, and it doesn't include any books only available in text formats other than plaintext (e.g., PDF or HTML).

In some cases, documents that are primarily available in some non-plaintext format will include a "stub" text file that just tells the reader to look at the other file. No attempt has been made to systematically exclude these from the present corpus.

Project Gutenberg includes a number of documents with content that is offensive. Given their possible academic and historical value, no effort has been made to systematically exclude these documents from this corpus. Please take care when including such documents (and portions thereof) in any analysis or creative reinterpretations. Just because a book is in the public domain doesn't mean you always have a right to use its words.

Character encodings

The included text files are all encoded as UTF-8. When decoding from Project Gutenberg, decoding is first attempted using the encoding declared in the file's metadata; if that decoding doesn't work, chardet's detect function is used to determine the most likely encoding, and that encoding is used instead. If Python still raises an error when attempting to decode using chardet's guess, ISO-8859-1 is tried as a last resort. If none of this worked, then the file is left out of the archive.

How to Gutenberg, dammit from scratch

If you just want to use the corpus, don't bother with any of the content that follows. If you want to be able to recreate the process of how I made the corpus, read on.

The scripts in this repository work on the files prepared by GutenTag. In order to use the scripts, you'll need to download their corpus ("Our (full) Project Gutenberg Corpus", ~7 GB ZIP file) and unzip it into a directory on your system.

The included package gutenbergdammit/build.py is designed to be used as a command-line script. Run it on the command line like so:

python -m gutenbergdammit.build --src-path=<path to your gutentag download> \
    --dest-path=output --metadata-file=output/gutenberg-metadata.json \

Help on the options:

Usage: build.py [options]

Options:
-h, --help            show this help message and exit
-s SRC_PATH, --src-path=SRC_PATH
                        path to GutenTag dump
-d DEST_PATH, --dest-path=DEST_PATH
                        path to output (will be created if it doesn't exist)
-m METADATA_FILE, --metadata-file=METADATA_FILE
                        path to metadata file for output (will be overwritten)
-l LIMIT, --limit=LIMIT
                        limit to n entries (good for testing)
-o OFFSET, --offset=OFFSET
                        start at index n (good for testing)

The --limit and --offset options are not required, and, if omitted, the tool will default to processing the entire archive.

Notes on implosion

Python's zipfile module doesn't support the compression algorithm used on some of the files in the Gutenberg archive ("implosion"). Whoops. Included in the repository is a script that unzips and re-zips these files using a modern compression algorithm. To run it:

python -m gutenbergdammit.findbadzips --src-path=<gutentag_dump> --fix

This will modify the ~100 files in your GutenTag dump with broken ZIP compression, and save copies of the originals (with -orig at the end of the filename). Leave off --fix to do a dry run (i.e., just show which files are bad, don't fix them).

To use this script, you'll need to have the zip and unzip binaries on your system and in your path. It also probably assumes UNIX-ey paths (i.e., separated with slashes), but a lot of stuff in here does. Pull requests welcome.

Next steps

  • Rework this process so it can construct a similarly-organized archive starting with a straight-up mirror of Project Gutenberg (rather than the GutenTag corpus, which is a combination of the 2010 DVD ISO and I think more recent entries collected via web scraping?)
  • Implement a process for adding newer files to the corpus (by looking at the RSS feed?)
  • Make the corpus zip file into a torrent or something so I'm not paying for every download

Works cited

Brooke, Julian, et al. “GutenTag: An NLP-Driven Tool for Digital Humanities Research in the Project Gutenberg Corpus.” CLfL@ NAACL-HLT, 2015, pp. 42–47.

Version history

  • v0.0.2 (2018-08-11): Fixed character encoding problems and released new version of the archive with resulting encodings
  • v0.0.1 (2018-08-10): Initial release.

License

In accordance with GutenTag's license:

This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license, visit https://creativecommons.org/licenses/by-sa/4.0/ or send a letter to Creative Commons, 444 Castro Street, Suite 900, Mountain View, California, 94041, USA.

gutenberg-dammit's People

Contributors

aparrish avatar hugovk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

gutenberg-dammit's Issues

cache/store chardet results per file

The chardet detection works pretty well but also takes a long time—I didn't time it exactly, but it felt like it added at least an hour to the time it took the corpus-building process to run on my MacBook Air. Since these files aren't going to change, it makes sense to pre-build and cache the results so that subsequent runs of the corpus-building script don't need to re-run the detection process.

remove dependency on GutenTag

GutenTag is an amazing project but I really only used its code and corpus as a way to quickly "bootstrap" the necessary code and files. I don't think it's a sustainable foundation moving forward, especially for keeping the files in this corpus up-to-date with the latest releases (and metadata changes) on Project Gutenberg itself. My best idea so far is to set up a Project Gutenberg mirror and modify the code to work directly on the files from the mirror, but that obviously takes a lot of effort (and hard drive space). Open to other suggestions.

check file extension when retrieving files from zip

A small subset of files (e.g. etext98/sesli10.zip, etext04/stryb10.zip) have a JPEG as the first entry in their ZIP file from the ISO, which the code blithely interprets as a text file (since it's only looking for the first entry in the ZIP, see this line). It should probably only look at files with a particular extension (i.e., .txt).

update metadata?

In some cases, it looks like the metadata from the GutenTag dump (itself based on the DVD ISO) is out-of-date with the live Project Gutenberg site. For example, Coleridge's Complete Poetical Works has a subject tag on the live site, but that subject tag is missing in the GutenTag HTML metadata (and thus from the metadata in the Gutenberg, dammit archive). Fixing this might depend on a fix for #3, but could also possibly be fixed by just using the most up-to-date RDFs from the catalog data?

metadata missing some titles

I dunno whether this is a helpful kind of issue to report here or if it just reflects some upstream problem, but reporting here is pretty easy, so:

The metadata has "?" for some titles. When I search teh internets for those works, I found some titles:

50624: Lorenzo de' Medici, the Magnificent (vol. 1 of 2)
50625: Lorenzo de' Medici, the Magnificent (vol. 2 of 2)
51307: This House to Let
51950: The Prodigal Son

When I eyeball these records, it looks like they're also missing Author info, maybe other stuff? I stopped looking.

UnicodeDecodeError/NameError when installing from setup.py (Windows-related?)

First of all, I wanted to say that this package is awesome! I have been wanting to interact with the Gutenberg corpus for a while now, but I always ended up running into obstacles and giving up prematurely. I'm glad someone else beat me to the punch!

So I've had some issues getting this package up and running on Windows. I'm running a 64-bit install of Python 3.6.5, for reference.

I initially cloned the repository and then attempted to install from setup.py. This is the output I saw:

>>> python setup.py install
Traceback (most recent call last):
  File "setup.py", line 4, in <module>
    readme = readme_file.read()
  File "C:\Python36\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 10949: character maps to <undefined>

I think that this issue is Windows-related. The open() function on line 3 seems to be defaulting to the console's encoding of Windows-1252 rather than UTF-8. I was able to fix it by specifying the encoding on line 3.

from setuptools import setup
    
with open('README.md', encoding='UTF-8') as readme_file:
    readme = readme_file.read()
...

After changing line 3, I tried installing again and received the following output:

>>> python setup.py install
Traceback (most recent call last):
  File "setup.py", line 14, in <module>
    packages=setuptools.find_packages(),
NameError: name 'setuptools' is not defined

This issue seems more related to the version of Python I'm running (at least, I doubt it is platform dependent like the UnicodeDecodeError). I was able to fix it by adding an explicit import at the top of the file:

import setuptools
from setuptools import setup
...

I'm willing to submit a pull request with the above changes if they all seem fine to you.

Around 50 files have broken encodings on french.

Hello,

For those of you that intend to use french documents in this corpus, know that on the 2647 french books included 49 have broken encoding and all accent letters are removed.
A quick way to find the culprits is to look the book for the letter 'é'.

at least one file's utf-8 encoding is wrong, presumably more?

Hi, thanks for this excellent work!

I suspect it's not an isolated incident, but don't presently have anything beyond a single anecdote:

  • 夢溪筆談 is valid UTF-8 Chinese text on the Project Gutenberg website.
  • But file 073/07317.txt in the gutenberg-dammit corpus is valid UTF-8 gibberish.
  • If you take the gutenberg-dammit file, and convert it from utf-8 to "latin-1", you end up with a file which chardet says is Big5-encoded text. This appears to be mostly correct, except that there is some garbage in it and so it can not be recoded successfully by any of the few different tools I tried.

Anyway that's the data I have for now…

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.