c-w / gutenberg Goto Github PK

View Code? Open in Web Editor NEW

320.0 20.0 60.0 6 MB

A simple interface to the Project Gutenberg corpus.

License: Apache License 2.0

Python 99.54% Dockerfile 0.46%

gutenberg-metadata gutenberg-ebooks python2 python3

gutenberg's People

Contributors

Stargazers

Watchers

gutenberg's Issues

How can I lend a hand?

I've been following your project since it HN about a month ago, the subsequent move to github and your recent work to use SPARQL to query.

I have a big need for your metadata parser. I forked PG to Github and I want a better way to access and handle book metadata.

I notice that you've really cleaned up the repo and reorganized. Could you give me a rough walkthrough of how you want new methods organized for the library search? I would like to make as much of the original RDF exposed via your library as possible.

Looking for a new maintainer

I'm no longer regularly working with Project Gutenberg data, so I'm no longer in the position to maintain this project. As such, I'm looking for a new maintainer.

What do you get as the maintainer?

Work on an actively used library.
Python 2/3 codebase.
Reasonable test coverage.
Some low-hanging fruit features to implement and bugs to fix so lots of stuff to do :)

If you are interested in taking over as the maintainer of the Gutenberg project, reply to this issue.

How do I get the id of each book?

Hello
I have been looking for ways to get ids of each book in an intuitive way.

Getting the id from the webpage of each book doesn't seem to work.
When I run 'text = strip_headers(load_etext(17384)).strip()', it says the book doesn't exist.

One way would be to look at catalogs.
http://www.gutenberg.org/dirs/GUTINDEX.1996
However these indices are not complete, and there are too many files.

I would like ideally to have a way to search with some keywords, get list of books, then using that title, or identifier, get the text out of it.

Cache creation warning, how to use BSB-DB?

Hi, followed your command/script to get metadata...

from gutenberg.acquire import get_metadata_cache
cache = get_metadata_cache()
cache.populate

And I get this... so I wonder how to run it so that it stores the data faster in the DB.
WARNING:root:Unable to create cache based on BSD-DB. Falling back to SQLite backend. Performance may be degraded significantly.

Thanks for sharing this!

Downloaded texts contain unicode replacement characters instead of accents

When I download a book, e.g.,

python3 -m gutenberg.acquire.text 2000 2000.txt

All the accented characters are replaced with u'\ufffd'. When I download the same file from Gutenberg directly (https://www.gutenberg.org/ebooks/2000.txt.utf-8), it is UTF-8 and all characters are intact.

Include note about Berkeley-DB version in readme

bsddb3 does not work with Berkeley-DB v6.x due to the latter changing its license to AGPL3. It should therefore be stated that it's recommended to install a version of bsddb between 4.8.30 and 5.x. within the readme somewhere (and perhaps notes on installation of Berkeley-DB for mac/linux/windows?)

The output of trying to install bsddb3:

λ ~/Github/gutenberg/ master pip3 install -r requirements-py3.pip
Collecting bsddb3>=6.1.0 (from -r requirements-py3.pip (line 1))
  Using cached bsddb3-6.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Trying to use the Berkeley DB you specified...
    Detected Berkeley DB version 6.1 from db.h

    ******* COMPILATION ABORTED *******

    You are linking a Berkeley DB version licensed under AGPL3 or have a commercial license.

    AGPL3 is a strong copyleft license and derivative works must be equivalently licensed.

    You have two choices:

      1. If your code is AGPL3 or you have a commercial Berkeley DB license from Oracle, please, define the environment variable 'YES_I_HAVE_THE_RIGHT_TO_USE_THIS_BERKELEY_DB_VERSION' to any value, and try to install this python library again.

      2. In any other case, you have to link to a previous version of Berkeley DB. Remove Berlekey DB version 6.x and let this python library try to locate an older version of the Berkeley DB library in your system. Alternatively, you can define the environment variable 'BERKELEYDB_DIR', or 'BERKELEYDB_INCDIR' and 'BERKELEYDB_LIBDIR', with the path of the Berkeley DB you want to use and try to install this python library again.

    Sorry for the inconvenience. I am trying to protect you.

    More details:

        https://forums.oracle.com/message/11184885
        http://lists.debian.org/debian-legal/2013/07/

    ******* COMPILATION ABORTED *******


    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/nj/knbr_l_s64b08h8blk4xbhnw0000gn/T/pip-build-q2wg_jiu/bsddb3/

Additionally, I believe the python2 bindings are only compatible up-to 4.8

Remove support for python 3.2

The library has some amount of incompatibilities with python 3.2 that it's probably best to just drop it. Current issues being that the requests library doesn't support it (thread on that here). A good deal of other popular packages have already dropped support so it's not unreasonable to do the same here (and there shouldn't really be any modern linux distro that's still using it).

Migrate RDFlib backend away from BSD-DB

Issues #26 and #28 show that there are some problems with RDFlib's default BSD-DB backend under Python3 and on Windows. This makes it worth to spend some time investigating switching away from BSD-DB and towards an alternative data-store, e.g. SQLite via RDFlib-SQLAlchemy.

Cannot install on Python 2.7 due to bsddb

macOS Sierra
Python 2.7.12
Gutenberg 0.4.2

$ pip install gutenberg
Collecting gutenberg
  Downloading Gutenberg-0.4.2.tar.gz
Collecting bsddb3>=6.1.0 (from gutenberg)
  Using cached bsddb3-6.2.1.tar.gz
    Complete output from command python setup.py egg_info:
    Can't find a local Berkeley DB installation.
    (suggestion: try the --berkeley-db=/path/to/bsddb option)

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/zm/drwrq_ld04s7vk50p00c9lnw0000gn/T/pip-build-TGvyoe/bsddb3/

The installation instructions in the README say bsddb is only needed for Python 3:

Installation

is project is on PyPI, so I'd recommend that you just install everything from there using your favourite Python package manager.
pip install gutenberg

...

Python 3

This package depends on BSD-DB. The bsddb module was removed from the Python standard library since version 2.7. This means that if you wish to use gutenberg on Python 3, you will need to manually install BSD-DB.

Is that not the case? Is bsddb3 needed for Python 2?

Daily cron build failing: E722 do not use bare 'except'

For example:

0.54s$ flake8 gutenberg
gutenberg/acquire/metadata.py:66:9: E722 do not use bare except'

https://travis-ci.org/c-w/gutenberg/builds/294364238

Make `get_etexts` case insensitive

Possible implementation: normalize the case of all literals added to the meta-data RDF graph on creation (e.g. lowercase everything) and apply the same normalization to get_etexts query values.

The cache is invalid or not created

Hi guys, I got the cache ... used it well, but after upgrading my computer's OS, I started getting this error. I know I have the cache in the same place, same permissions (everyone can read+write), and yet I get this error when invoking the get_metadata method. Any hints?

"The cache is invalid or not created"

I'd like to avoid having to recreate the cache, as it may take hours on end to download.

Thanks!

Query by language

Could you add a possibility to query by the language?

strip_headers failure cases

Here's a list of files that have some degree of failure in strip_headers(): https://gist.github.com/ikarth/49e10e7b5a66fe8d6732

This was made by grepping over the English-language, public domain text files from the subset corpus I've been assembling (April 2010 DVD + another archive from 2013) for the first mention of "Project Gutenberg".

There's a few false positives like "THE COMPLETE PROJECT GUTENBERG WORKS OF GEORGE MEREDITH" and a few transcriber notes, but the most common bits to stay untouched is the illustrated books that have a note about an html version with pictures being available.

Removing or not removing some of these may be an aesthetic call, but I figured the data might be useful.

Windows: bsddb error in rdflib

The library installs and text = strip_headers(load_etext(2701)).strip() works fine.

But when I run:

from gutenberg.query import get_metadata
print(get_metadata("title", 2701))

I get:

Traceback (most recent call last):
  File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\query\api.py", line 101, in _metadata
    return cls.__metadata
AttributeError: type object 'TitleExtractor' has no attribute '_MetadataExtractor__metadata'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\query\api.py", line 34, in get_metadata
    metadata_values = MetadataExtractor.get(feature_name).get_metadata(etextno)
  File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\query\extractors.py", line 30, in get_metadata
    query = cls._metadata()[etext:cls.predicate():]
  File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\query\api.py", line 103, in _metadata
    cls.__metadata = load_metadata()
  File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\acquire\metadata.py", line 130, in load_metadata
    return _open_or_create_metadata_graph()
  File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\acquire\metadata.py", line 108, in _open_or_create_metadata_graph
    _METADATA_DATABASE_SINGLETON = _create_metadata_graph()
  File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\acquire\metadata.py", line 91, in _create_metadata_graph
    return Graph(store=store, identifier='urn:gutenberg:metadata')
  File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\rdflib\graph.py", line 312, in __init__
    self.__store = store = plugin.get(store, Store)()
  File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\rdflib\plugins\sleepycat.py", line 52, in __init__
    "Unable to import bsddb/bsddb3, store is unusable.")
ImportError: Unable to import bsddb/bsddb3, store is unusable.

Near as I can tell, it seems to be an error with RDFlib relying on bsddb3, and bsddb3 not maintaining their Windows package properly.

Trying to install bsddb3 directly with pip install bsddb3 gets me:

Collecting bsddb3
  Using cached bsddb3-6.1.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "C:\Users\User\AppData\Local\Temp\pip-build-uvi4bvaa\bsddb3\setup.py", line 42, in <module>
        import setup3
      File "C:\Users\User\AppData\Local\Temp\pip-build-uvi4bvaa\bsddb3\setup3.py", line 375, in <module>
        with open(os.path.join(incdir, 'db.h'), 'r') as f :
    FileNotFoundError: [Errno 2] No such file or directory: 'db/include\\db.h'
    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\User\AppData\Local\Temp\pip-build-uvi4bvaa\bsddb3

I'm using Windows 7 and Python 3.4 x64.

Is there any easy way to switch the backend persistence store to, say, RDFLib-SQLAlchemy?

Multi-threaded creation of Gutenberg data

looks like it is using only one core while creating database. multi thread should speed up?

Fuzzy Searches on Author

It seems that at different times in this project's development, fuzzy searches on an author's name have been discussed and were even implemented at one stage (using WHERE author LIKE in https://github.com/c-w/Gutenberg/blob/710052ce5cab7ea45b101ab756c7f1b29091236a/gutenberg/corpus.py#L48), so I apologise if this is redundant.

My use case for this project is essentially grabbing the text content of texts written by a given author. Judging by the README, the functionality provided by `get_etexts("author", "Melville, Hermann") is able to provide the file IDs, which can then be easily downloaded.

However, get_etexts("author", "Melville, Herman") (i.e dropping the second "n", as the author's name is actually shown at http://www.gutenberg.org/ebooks/author/9) returns an empty frozenset.

Should I assume that this project currently does not support this feature? If not, I'd be happy to contribute, but the current RDF-based implementation isn't something I currently have any experience with.

Implement fuzzy matching for text retrieval

It would be great if there was an option to make the text retrieval functions on the Corpus class (like texts_for_author) perform fuzzy matching so that small spelling mistakes can automatically be corrected.

Doesn't run in Python 2.6

Your code doesn't work in Python 2.6. Please add a warning in your documentation.

New PyPI release?

Any chance of a new release? It's been over 8 months since the last one on PyPI and I just ran into the same problem and was about to submit a fix for #19 but discovered it's already fixed :)

Thanks!

Add more tests

The library is in dire need of more (doc|unit|regression) tests.

Performance optimization of get_etexts

The get_etexts function is very slow.

This is likely caused by the fact that with the current implementation, a call to get_etexts(A, B) finds all the triples (s, p, o) for which p==A and then discards all such triples for which o != B in a linear scan. Given that the set {(s, p, o) | p==A} is likely quite large, the linear scan is expensive.

Update urls to not use bitbucket

I feel like you should probably update the URL within the setup.py as well as on the pypi page that both point to the now defunct bitbucket repo.

On pypi:

git clone https://[email protected]/c-w/gutenberg.git && cd gutenberg

in setup.py:

    url='https://bitbucket.org/c-w/gutenberg/',

API v2

Now that we have a couple of use-cases for the Gutenberg library, I figure it would be a good time to refactor the API. The current implementation was okay to get off the ground, however, the current API also turns out to be hard to test and over-complicated.

@MasterOdin, @sethwoodworth:

What do you think about this?
What are your use-cases for this library i.e. what functionality do you really need?

Currently I'm thinking of a really simple API:

type Text = String
type Attribute = (String, String)
type Uid = Integer

_text_for_uid :: Uid -> Text -- returns the text for a given UID
_attributes_for_uid :: Uid -> [Attribute] -- returns the properties of a given UID e.g. author, 

texts_for_attribute :: Attribute -> [Text] -- returns all the texts for a given attribute-value combination such as ('author', 'Jules Verne'), ('year', '1996') or ('title', 'Moby Dick')

_attributes_for_uid parses the Gutenberg meta-data tarball using RDF to extract all attributes for all the texts. _text_for_uid takes care of downloading the texts (and/or storing them on disk). The last function is trivially defined in terms of the first two.

MemoryError: (12, 'Not enough space -- unable to allocate memory for mutex; resize mutex region')

I'm getting a MemoryError when using get_metadata. It was working for me earlier. I have about 7.5GB free on C:\ and 53GB free on E:, and 4GB free RAM.

C:\stufftodelete>pip freeze | grep Gutenberg
Gutenberg==0.4.0

C:\stufftodelete>python
Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from gutenberg.acquire import load_etext
INFO:rdflib:RDFLib Version: 4.2.1
>>> from gutenberg.cleanup import strip_headers
>>> text = strip_headers(load_etext(2701)).strip()
>>> len(text)
1237486
>>> text[:100]
u"MOBY DICK; OR THE WHALE\r\n\r\nBy Herman Melville\r\n\r\n\r\n\r\n\r\nOriginal Transcriber's Notes:\r\n\r\nThis text is
"
>>> 
>>> from gutenberg.query import get_etexts
>>> from gutenberg.query import get_metadata
>>> 
>>> get_metadata('title', 2701)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\gutenberg\query\api.py", line 34, in get_metadata
    metadata_values = MetadataExtractor.get(feature_name).get_metadata(etextno)
  File "C:\Python27\lib\site-packages\gutenberg\query\extractors.py", line 30, in get_metadata
    query = cls._metadata()[etext:cls.predicate():]
  File "C:\Python27\lib\site-packages\gutenberg\query\api.py", line 103, in _metadata
    cls.__metadata = load_metadata()
  File "C:\Python27\lib\site-packages\gutenberg\acquire\metadata.py", line 130, in load_metadata
    return _open_or_create_metadata_graph()
  File "C:\Python27\lib\site-packages\gutenberg\acquire\metadata.py", line 112, in _open_or_create_metadata_graph
    _METADATA_DATABASE_SINGLETON.open(_METADATA_CACHE, create=False)
  File "C:\Python27\lib\site-packages\rdflib\graph.py", line 376, in open
    return self.__store.open(configuration, create)
  File "C:\Python27\lib\site-packages\rdflib\plugins\sleepycat.py", line 89, in open
    db_env = self._init_db_environment(homeDir, create)
  File "C:\Python27\lib\site-packages\rdflib\plugins\sleepycat.py", line 75, in _init_db_environment
    db_env.open(homeDir, ENVFLAGS | db.DB_CREATE)
MemoryError: (12, 'Not enough space -- unable to allocate memory for mutex; resize mutex region')
>>> 
>>> get_metadata('author', 2701)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\gutenberg\query\api.py", line 34, in get_metadata
    metadata_values = MetadataExtractor.get(feature_name).get_metadata(etextno)
  File "C:\Python27\lib\site-packages\gutenberg\query\extractors.py", line 31, in get_metadata
    return frozenset(result.toPython() for result in query)
  File "C:\Python27\lib\site-packages\gutenberg\query\extractors.py", line 31, in <genexpr>
    return frozenset(result.toPython() for result in query)
  File "C:\Python27\lib\site-packages\rdflib\graph.py", line 634, in objects
    for s, p, o in self.triples((subject, predicate, None)):
  File "C:\Python27\lib\site-packages\rdflib\graph.py", line 421, in triples
    for _s, _o in p.eval(self, s, o):
  File "C:\Python27\lib\site-packages\rdflib\paths.py", line 227, in _eval_seq
    for s, o in evalPath(graph, (subj, paths[0], None)):
  File "C:\Python27\lib\site-packages\rdflib\paths.py", line 425, in <genexpr>
    return ((s, o) for s, p, o in graph.triples(t))
  File "C:\Python27\lib\site-packages\rdflib\graph.py", line 424, in triples
    for (s, p, o), cg in self.__store.triples((s, p, o), context=self):
  File "C:\Python27\lib\site-packages\rdflib\plugins\sleepycat.py", line 371, in triples
    assert self.__open, "The Store must be open."
AssertionError: The Store must be open.
>>>

Any ideas what's wrong?

Package on PyPi is missing requirements.pip

root@ol-dev:/openlibrary/Gutenberg# pip --no-cache-dir install gutenberg==0.4.1
Collecting gutenberg==0.4.1
/usr/local/lib/python2.7/dist-packages/pip-7.1.2-py2.7.egg/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
  InsecurePlatformWarning
  Downloading Gutenberg-0.4.1.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 20, in <module>
      File "/tmp/pip-build-3CKm7N/gutenberg/setup.py", line 19, in <module>
        install_requires=list(line.strip() for line in open('requirements.pip')))
    IOError: [Errno 2] No such file or directory: 'requirements.pip'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-3CKm7N/gutenberg

A fresh download of this repo + python setup.py sdist shows that the file is correctly included in the .tar.gz, but the PyPi one is clearly missing it:

root@ol-dev:/tmp# wget https://pypi.python.org/packages/source/G/Gutenberg/Gutenberg-0.4.1.tar.gz
--2016-01-03 15:13:19--  https://pypi.python.org/packages/source/G/Gutenberg/Gutenberg-0.4.1.tar.gz
Resolving pypi.python.org (pypi.python.org)... 23.235.47.223
Connecting to pypi.python.org (pypi.python.org)|23.235.47.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9933 (9.7K) [application/octet-stream]
Saving to: 'Gutenberg-0.4.1.tar.gz'

100%[==========================================================>] 9,933       --.-K/s   in 0s

2016-01-03 15:13:19 (97.7 MB/s) - 'Gutenberg-0.4.1.tar.gz' saved [9933/9933]

root@ol-dev:/tmp# tar tvfz Gutenberg-0.4.1.tar.gz | grep requirements
root@ol-dev:/tmp#

nosetest failures: UnicodeDecodeError

Downloading the source and running nosetest, I get eight UnicodeDecodeError test failures.

======================================================================                                        
ERROR: test_load_metadata (tests.test_acquire.TestLoadMetadata)                                               
----------------------------------------------------------------------                                        
Traceback (most recent call last):                                                                            
  File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_util.py", line 41, in setUp                      
    self.cache.populate()                                                                                     
  File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_util.py", line 58, in populate                   
    data = u('\n').join(item.rdf() for item in self.sample_data_factory())                                    
  File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_util.py", line 58, in <genexpr>                  
    data = u('\n').join(item.rdf() for item in self.sample_data_factory())                                    
  File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_sample_metadata.py", line 83, in all             
    yield SampleMetaData.for_etextno(int(etextno))                                                            
  File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_sample_metadata.py", line 77, in for_etextno     
    metadata = _load_metadata(etextno)                                                                        
  File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_sample_metadata.py", line 93, in _load_metadata  
    return json.load(open(data_path))                                                                         
  File "c:\tools\anaconda3\envs\genmoenv\Lib\json\__init__.py", line 265, in load                             
    return loads(fp.read(),                                                                                   
  File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\encodings\cp1252.py", line 23, in decode  
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]                                         
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 56: character maps to <undefined>

(I also get import sqlalchemy errors, which I'm assuming are unrelated.)

======================================================================
ERROR: test_delete (tests.test_metadata_cache.TestSqlite)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\test_metadata_cache.py", line 111, in setUp
    self.cache = SqliteMetadataCache(self.local_storage)
  File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\gutenberg\acquire\metadata.py", line 205, in __init__
    store = plugin.get('SQLAlchemy', Store)(identifier=_DB_IDENTIFIER)
  File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\rdflib\plugin.py", line 104, in get
    return p.getClass()
  File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\rdflib\plugin.py", line 81, in getClass
    self._class = self.ep.load()
  File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\pkg_resources\__init__.py", line 2249, in load
    return self.resolve()
  File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\pkg_resources\__init__.py", line 2255, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\rdflib_sqlalchemy\SQLAlchemy.py", line 6, in <module>
    from . import sqlalchemy
ImportError: cannot import name 'sqlalchemy'

Use bsddb3 for Python 2.7

While Python 2.7 does come with a builtin bsddb module, it was depreciated in 2.6. It seems to me that we should just migrate wholly to using bsddb3 for all supported versions and which would also simplify the setup_requires for the project and also allow us to add a Pipfile.

Thoughts @c-w, @hugovk?

Expose more meta-data

Currently this library only makes use of the author and title meta-data exposed by Project Gutenberg and does not leverage information such as genre, publication date, etc.

Making this information usable by the library is a pretty straight forward three-step process:

The TextSource.textinfo_converter method needs to be extended to parse the new meta-data attributes.
The new attributes need to be wired through to the TextInfo class.
A new method leveraging the new meta-data source should be added to the Corpus class (such as texts_for_genre or texts_for_year

Make TextSource state-aware

The TextSource object should probably track its state so that it only yields every text once (unless explicitly requested to re-yield all texts from the start).

load_etext() returns accent-pruned version of text

I came accross that issue, I am not sure if it's a bug in the lib or some process on gutenberg itself. eBook with id 10160 can be seen at http://www.gutenberg.org/cache/epub/10160/pg10160.txt and it does contains accents. Hosted on the cache there are two versions of the text file:

the one queried by default : http://aleph.gutenberg.org/1/0/1/6/10160/10160.txt
the one with -8: http://aleph.gutenberg.org/1/0/1/6/10160/10160-8.txt

Now, I am not sure if it's expected or not, but the one named 10160.txt does not contain any accent when the 10160-8.txt file do contains them. Changing the order of the extension in

gutenberg/gutenberg/acquire/text.py

Line 66 in e76d96e

extensions = ('.txt', '-8.txt', '-0.txt')

help working around the issue.

Cannot install on Windows to get to back up SQL option

Hi,

I'm a newbie, but I've spent a few hours trying to work through documentation & forums, and so far I have not found a work around for the Berkeley db dependency. It will not let me install gutenberg at all, so I can't make use of the back up sql option you mentioned putting in. Have others noted this problem, and is there a way to install more directly than pip that would allow me to install it?

Thank you for your time,
Liz

Cheeseshop version fails installation because of missing 'README.rst'

Cheeseshop version fails installation because of missing 'README.rst'. Aren't you missing a MANIFEST.in with "include README.rst"?

The error message is here:

Downloading Gutenberg-0.2.1.tar.gz
Running setup.py (path:/tmp/pip_build_root/Gutenberg/setup.py) egg_info for package Gutenberg
Traceback (most recent call last):
File "", line 17, in
File "/tmp/pip_build_root/Gutenberg/setup.py", line 15, in
long_description=open('README.rst').read(),
IOError: [Errno 2] No such file or directory: 'README.rst'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 17, in

File "/tmp/pip_build_root/Gutenberg/setup.py", line 15, in

long_description=open('README.rst').read(),

IOError: [Errno 2] No such file or directory: 'README.rst'

pip install gutenberg on debian looks for

I have solved it finally.
Once you install libdb5.3-dev
It is OK
but libdb5.1-dev is no longer available
and libdb5.3 is not the same:
It does not create the /usr/include/db.h file

I am now gonna test and use your module.
Thanks a lot in advance!! :)

I do not know whether this is really an issue or my lack of skills.
I tested:
apt-get install libdb5.3
this gives:
libdb5.3 is already the newest version.

This the output of the command line:

root@RoseWoodSamDebian:/home/arthurx/Python_course/google-python-exercises/basic# pip install gutenberg
Downloading/unpacking gutenberg
Downloading Gutenberg-0.4.2.tar.gz
Running setup.py (path:/tmp/pip-build-ZWxwvG/gutenberg/setup.py) egg_info for package gutenberg

Downloading/unpacking bsddb3>=6.1.0 (from gutenberg)
Downloading bsddb3-6.2.1.tar.gz (228kB): 228kB downloaded
Running setup.py (path:/tmp/pip-build-ZWxwvG/bsddb3/setup.py) egg_info for package bsddb3
Traceback (most recent call last):
File "", line 17, in
File "/tmp/pip-build-ZWxwvG/bsddb3/setup.py", line 40, in
import setup2
File "setup2.py", line 332, in
with open(os.path.join(incdir, 'db.h'), 'r') as f :
IOError: [Errno 2] No such file or directory: '/usr/include/db.h'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):

File "", line 17, in

File "/tmp/pip-build-ZWxwvG/bsddb3/setup.py", line 40, in

import setup2

File "setup2.py", line 332, in

with open(os.path.join(incdir, 'db.h'), 'r') as f :

IOError: [Errno 2] No such file or directory: '/usr/include/db.h'

Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip-build-ZWxwvG/bsddb3
Storing debug log for failure in /root/.pip/pip.log
root@RoseWoodSamDebian:/home/arthurx/Python_course/google-python-exercises/basic#

strip_headers strips too much of the complete works of Shakespeare

https://www.gutenberg.org/cache/epub/100/pg100.txt

The Project Gutenberg EBook of The Complete Works of William Shakespeare, by William Shakespeare

Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from gutenberg.acquire import load_etext
INFO:rdflib:RDFLib Version: 4.2.1
>>> from gutenberg.cleanup import strip_headers
>>> text = strip_headers(load_etext(100)).strip()
>>> len(text)
5373
>>> len(etext)
5589915

etext contains the full text from The Project Gutenberg EBook of The Complete Works of William Shakespeare up to and including *** END: FULL LICENSE ***.

But text only contains the text from *Project Gutenberg is proud to cooperate with The World Library* up to and including ["Small Print" V.12.08.93].

New Version

@c-w Do we want to release a new version now with the previous two PRs? Since we're moving to a new version of a dependency and adding a new flag, I'd say it should be 0.5.0 release, but I leave the final decision to you. I'm assuming that all new releases will go through you?

Note: You should be able to just draft a release through github, and then once it successfully builds on Travis, it'll be auto published to pypi, so it's nice and easy to do all through CI tools.

UnicodeEncodeError

When the parser tries to read http://www.gutenberg.lib.md.us/1/9/5/8/19581/19581.txt, it throws a UnicodeEncodeError when attempting to write the .gz file due to some ASCII characters in the text (such as '’').

Add Python3.2 support

Build is failing on Travis

More liberal on the requirements.txt

There are explicit version requirements in the requirements.txt. It would be better if they were specified with less strong requirements, e.g., just requiring larger than a specific version.

Remove u() shim

from six import u is considered unsafe. In #49 we discussed that we could remove a cumbersome workaround used to support python3.2. Now that 3.2 support has been dropped, it should be removed as it is not ideal for python2 tests docs.

Can't download the newest uploads

The two most recent uploads to Project Gutenberg are:

https://www.gutenberg.org/ebooks/50334
https://www.gutenberg.org/ebooks/50333

This code:

from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers

text = strip_headers(load_etext(50333)).strip()
print(text[:10])
text = strip_headers(load_etext(50334)).strip()
print(text[:10])

Produces:

Traceback (most recent call last):
  File "C:\temp\test.py", line 6, in <module>
    text = strip_headers(load_etext(50333)).strip()
  File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 71, in load_etext
    download_uri = _format_download_uri(etextno)
  File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 55, in _format_download_uri
    raise ValueError('download URI for {} not supported'.format(etextno))
ValueError: download URI for 50333 not supported

That's because it tries to download these but they don't exist:

http://www.gutenberg.lib.md.us/5/0/3/3/50333/50333.txt
http://www.gutenberg.lib.md.us/5/0/3/3/50333/50333-8.txt

Their Plain Text UTF-8 links look like this:

https://www.gutenberg.org/files/50334/50334-0.txt
https://www.gutenberg.org/cache/epub/50333/pg50333.txt

Is it because they're so new they've not been mirrored to http://www.gutenberg.lib.md.us, or something else?

Unexpected metadata for non-existent ebooks

The metadata database contains records for non-existent ebooks.

Exists: https://www.gutenberg.org/ebooks/1
Doesn't exist: https://www.gutenberg.org/ebooks/182

Example code:

from __future__ import print_function
from gutenberg.query import get_metadata

def get_all_metadata(etextno):
    for feature_name in ['author', 'formaturi', 'language',
                         'rights', 'subject', 'title']:
        print("{}\t{}\t{}".format(
            etextno,
            feature_name,
            get_metadata(feature_name, etextno)))
    print()

get_all_metadata(1)  # US Declaration of Independence
get_all_metadata(182)  # no such ebook

Actual output:

1       author  frozenset([u'United States President (1801-1809)'])
1       formaturi       frozenset([u'http://www.gutenberg.org/ebooks/1.txt.utf-8', u'http://www.gutenberg.org/ebooks/1.e
pub.noimages', u'http://www.gutenberg.org/6/5/2/6527/6527-t/6527-t.tex', u'http://www.gutenberg.org/ebooks/1.html.noimag
es', u'http://www.gutenberg.org/files/1/1.zip', u'http://www.gutenberg.org/ebooks/1.epub.images', u'http://www.gutenberg
.org/ebooks/1.rdf', u'http://www.gutenberg.org/ebooks/1.kindle.noimages', u'http://www.gutenberg.org/files/1/1.txt', u'h
ttp://www.gutenberg.org/ebooks/1.html.images', u'http://www.gutenberg.org/6/5/2/6527/6527-t.zip', u'http://www.gutenberg
.org/ebooks/1.kindle.images'])
1       language        frozenset([u'en'])
1       rights  frozenset([u'Public domain in the USA.'])
1       subject frozenset([u'E201', u'United States. Declaration of Independence', u'United States -- History -- Revolut
ion, 1775-1783 -- Sources', u'JK'])
1       title   frozenset([u'The Declaration of Independence of the United States of America'])

182     author  frozenset([])
182     formaturi       frozenset([])
182     language        frozenset([u'en'])
182     rights  frozenset([u'None'])
182     subject frozenset([])
182     title   frozenset([])

Expected output:

I'd expect get_metadata("language", 182) and get_metadata("rights", 182) to both return frozenset([]) instead of frozenset([u'en']) and frozenset([u'None']).

Or better, as there's no such ebook, perhaps it should it return None or raise an exception: maybe IndexError or something custom like NoEbookIndex, or just don't add it to the database in the first place and let that raise whatever it would raise when an index is not found.

frozenset on query

The query example (print(get_etexts('title', 'Moby Dick; Or, The Whale'))) is returning frozenset([]) for me is this a current bug or has pip just set this up wrongly ?

Better Gutenberg type handling

The RDF files for Gutenberg contain things such as Image, Sound, etc. and right now they're all being treated as TextInfo objects. This might not be optimal as if I query for "United States", I don't particularly want Amendments to the United States Constitution Reading as it's not really a book and the file is predominantly just license stuff. It's not possible right now for an enduser to know these things and filter as appropriate.

Additionally, some of these audio/image folders use uid-readme.txt instead of just uid.txt so knowing that would also allow for pulling more files successfully if when looking for that file, look for uid.txt first than -readme (but never do that for regular text).

There are a couple options I think:

In textsource._raw_source filter out anything that isn't Text

                ebook = next(iter(graph.query('''
                    SELECT
                        ?type
                    WHERE {
                        ?ebook dcterms:type [ rdf:value ?type ].
                    }
                    LIMIT 1
                ''')))
                if ebook.type != rdflib.Literal('Text'):
                    continue

Add new metadata to TextInfo named "type". Then could add check to _fulltext to prevent type != 'text' while adding something like _fullimage and stuff for other types?
(would TextInfo still be the best name then as it expands beyond text)

Potential types (uid example):

Sound (10759)
Collection (10802)
Image (11001)
StillImage (114)
MovingImage (116)
DataSet (11775)

Performance optimization of metadata loading

The load_metadata function that makes the Project Gutenberg meta-data RDF graph available to the meta-data extractors takes for ever to run as it is loading ~130MB of data into memory from a flat file (~850MB uncompressed).

A low-cost way to address this is to investigate database-backed stores for the RDF graph instead of loading it all into memory.

Alternative ways to tackle the issue include sharding the meta-data file along multiple dimensions (e.g. text identifier, author, etc.) but this would require adding code to create the pertinent shards to every MetadataExtractor i.e., extending the library involves more work in the future.

Download URI for {} not supported

This error was mentioned in Issue #23 , which was closed, but I'm running into the same error for what seems to be a different reason, so I thought I would create a new one.

I'm having trouble with same error as in #23 for The Iliad here: https://www.gutenberg.org/ebooks/6130

File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 72, in load_etext
    download_uri = _format_download_uri(etextno)
  File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 56, in _format_download_uri
    raise ValueError('download URI for {0} not supported'.format(etextno))
ValueError: download URI for 2000 not supported

But this text doesn't seem to be super recent - it was uploaded in 2004. To check and make sure that the errordidn't happen for all books, I tried a couple of others.
http://www.gutenberg.org/ebooks/16452, another version of The Iliad, has the same URI error, as does Welsh Fairy Tales (http://www.gutenberg.org/ebooks/9368).
Books with bibrec numbers 7250, 5000, 3000, 2000, and 1300 all have the same error.

On The Origin of Species, http://www.gutenberg.org/ebooks/1228, works just fine, but The First Part of Henry the Sixth (http://www.gutenberg.org/ebooks/1100), which is chronologically before On The Origin of Species, raises the URI error as well, which makes it seem unlikely to me that it's an issue with recency of upload (as in #23)

Any idea what the issue might be?

Native support for complex queries

Originally reported by @MasterOdin in #11

I would argue that it would be important for get_etexts to not only support one feature/value pair but potentially multiple.

A potential use case that illustrates this would be how would I get only the german books by the author "Various".

The solution as proposed by your API would be:

texts = get_etexts('author','various')
final_list = []
for text in texts:
    if get_metadata('language', text) != 'german':
        pass
    final_list.append(text)

which is somewhat weird as I'd kind of like the API to handle this internally (especially if I want to get even more specific with criteria and don't want to build up that if statement!)

So I'd say maybe change get_etexts to support passing in either two strings for one feature, or probably easier, a dictionary which would allow for any number of arguments:

texts = get_etexts({'author':'various','language':'german'})

get_metadata('author') behavior (alias vs. name)

I have installed the library via pip with python2.7 on ubuntu 14.04.

I was wondering if this kind of author behavior to be expected for some titles:

>>> from gutenberg.query import get_etexts
INFO:rdflib:RDFLib Version: 4.2.1
>>> from gutenberg.query import get_metadata
>>> print(get_metadata('title', 1)) 
frozenset([u'The Declaration of Independence of the United States of America'])
>>> print(get_metadata('author', 1)) 
frozenset([u'United States President (1801-1809)'])

This seems like the relevant parts of the RDF from the dump I downloaded:

    <dcterms:creator>
      <pgterms:agent rdf:about="2009/agents/1638">
        <pgterms:webpage rdf:resource="http://en.wikipedia.org/wiki/Thomas_Jefferson"/>
        <pgterms:birthdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1743</pgterms:birthdate>
        <pgterms:name>Jefferson, Thomas</pgterms:name>
        <pgterms:deathdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1826</pgterms:deathdate>
        <pgterms:alias>United States President (1801-1809)</pgterms:alias>
      </pgterms:agent>
    </dcterms:creator>

Unicode error

Hi there,

I'm working in a multilang project using gutenberg book module to test things:
Using python2.7 (latest release)

pip install gutenberg --upgrade
Requirement already up-to-date: gutenberg in /home/aloha/code/Narralyzer/env/lib/python2.7/site-packages

gutenberg_test_id = 31727
print(strip_headers(load_etext(gutenberg_test_id)).strip()[100:200])

also:

python -m gutenberg.acquire.text 31727 test.txt

output:

Problem wird, offenbart. Aber gleichzeitig haftete diesem merkw�rdigen
und tiefsinnigen Erleber aller begrifflichen Probleme die
verh�ngnisvolle Schw�che an, da� er sofort jeden Boden verlor, sowie er
aus dem Kreis seiner innerlichsten Spekulation heraustrat in die

source:
http://www.gutenberg.org/ebooks/31727

Great module, keep it up!

never mind

sys.getdefaultencoding()
'ascii'
;)

get text by language fails

Populated my cache with the default berkley db but when I do

eng = get_etexts('language','en') 
or get_etexts('language','eng')
or get_etexts('language','En')

The query returns the empty frozenset.

Is this a known bug?

c-w / gutenberg Goto Github PK

gutenberg's People

Contributors

Stargazers

Watchers

Forkers

gutenberg's Issues

Installation

Python 3

Recommend Projects

Recommend Topics

Recommend Org