c-w / gutenberg Goto Github PK
View Code? Open in Web Editor NEWA simple interface to the Project Gutenberg corpus.
License: Apache License 2.0
A simple interface to the Project Gutenberg corpus.
License: Apache License 2.0
I've been following your project since it HN about a month ago, the subsequent move to github and your recent work to use SPARQL to query.
I have a big need for your metadata parser. I forked PG to Github and I want a better way to access and handle book metadata.
I notice that you've really cleaned up the repo and reorganized. Could you give me a rough walkthrough of how you want new methods organized for the library search? I would like to make as much of the original RDF exposed via your library as possible.
I'm no longer regularly working with Project Gutenberg data, so I'm no longer in the position to maintain this project. As such, I'm looking for a new maintainer.
What do you get as the maintainer?
If you are interested in taking over as the maintainer of the Gutenberg project, reply to this issue.
Hello
I have been looking for ways to get ids of each book in an intuitive way.
Getting the id from the webpage of each book doesn't seem to work.
When I run 'text = strip_headers(load_etext(17384)).strip()', it says the book doesn't exist.
One way would be to look at catalogs.
http://www.gutenberg.org/dirs/GUTINDEX.1996
However these indices are not complete, and there are too many files.
I would like ideally to have a way to search with some keywords, get list of books, then using that title, or identifier, get the text out of it.
Hi, followed your command/script to get metadata...
from gutenberg.acquire import get_metadata_cache
cache = get_metadata_cache()
cache.populate
And I get this... so I wonder how to run it so that it stores the data faster in the DB.
WARNING:root:Unable to create cache based on BSD-DB. Falling back to SQLite backend. Performance may be degraded significantly.
Thanks for sharing this!
When I download a book, e.g.,
python3 -m gutenberg.acquire.text 2000 2000.txt
All the accented characters are replaced with u'\ufffd'. When I download the same file from Gutenberg directly (https://www.gutenberg.org/ebooks/2000.txt.utf-8), it is UTF-8 and all characters are intact.
bsddb3 does not work with Berkeley-DB v6.x due to the latter changing its license to AGPL3. It should therefore be stated that it's recommended to install a version of bsddb between 4.8.30 and 5.x. within the readme somewhere (and perhaps notes on installation of Berkeley-DB for mac/linux/windows?)
The output of trying to install bsddb3:
λ ~/Github/gutenberg/ master pip3 install -r requirements-py3.pip
Collecting bsddb3>=6.1.0 (from -r requirements-py3.pip (line 1))
Using cached bsddb3-6.2.1.tar.gz
Complete output from command python setup.py egg_info:
Trying to use the Berkeley DB you specified...
Detected Berkeley DB version 6.1 from db.h
******* COMPILATION ABORTED *******
You are linking a Berkeley DB version licensed under AGPL3 or have a commercial license.
AGPL3 is a strong copyleft license and derivative works must be equivalently licensed.
You have two choices:
1. If your code is AGPL3 or you have a commercial Berkeley DB license from Oracle, please, define the environment variable 'YES_I_HAVE_THE_RIGHT_TO_USE_THIS_BERKELEY_DB_VERSION' to any value, and try to install this python library again.
2. In any other case, you have to link to a previous version of Berkeley DB. Remove Berlekey DB version 6.x and let this python library try to locate an older version of the Berkeley DB library in your system. Alternatively, you can define the environment variable 'BERKELEYDB_DIR', or 'BERKELEYDB_INCDIR' and 'BERKELEYDB_LIBDIR', with the path of the Berkeley DB you want to use and try to install this python library again.
Sorry for the inconvenience. I am trying to protect you.
More details:
https://forums.oracle.com/message/11184885
http://lists.debian.org/debian-legal/2013/07/
******* COMPILATION ABORTED *******
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/nj/knbr_l_s64b08h8blk4xbhnw0000gn/T/pip-build-q2wg_jiu/bsddb3/
Additionally, I believe the python2 bindings are only compatible up-to 4.8
The library has some amount of incompatibilities with python 3.2 that it's probably best to just drop it. Current issues being that the requests library doesn't support it (thread on that here). A good deal of other popular packages have already dropped support so it's not unreasonable to do the same here (and there shouldn't really be any modern linux distro that's still using it).
Issues #26 and #28 show that there are some problems with RDFlib's default BSD-DB backend under Python3 and on Windows. This makes it worth to spend some time investigating switching away from BSD-DB and towards an alternative data-store, e.g. SQLite via RDFlib-SQLAlchemy.
$ pip install gutenberg
Collecting gutenberg
Downloading Gutenberg-0.4.2.tar.gz
Collecting bsddb3>=6.1.0 (from gutenberg)
Using cached bsddb3-6.2.1.tar.gz
Complete output from command python setup.py egg_info:
Can't find a local Berkeley DB installation.
(suggestion: try the --berkeley-db=/path/to/bsddb option)
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/zm/drwrq_ld04s7vk50p00c9lnw0000gn/T/pip-build-TGvyoe/bsddb3/
The installation instructions in the README say bsddb is only needed for Python 3:
Installation
is project is on PyPI, so I'd recommend that you just install everything from there using your favourite Python package manager.
pip install gutenberg
...
Python 3
This package depends on BSD-DB. The bsddb module was removed from the Python standard library since version 2.7. This means that if you wish to use gutenberg on Python 3, you will need to manually install BSD-DB.
Is that not the case? Is bsddb3 needed for Python 2?
For example:
0.54s$ flake8 gutenberg
gutenberg/acquire/metadata.py:66:9: E722 do not use bare except'
Possible implementation: normalize the case of all literals added to the meta-data RDF graph on creation (e.g. lowercase everything) and apply the same normalization to get_etexts
query values.
Hi guys, I got the cache ... used it well, but after upgrading my computer's OS, I started getting this error. I know I have the cache in the same place, same permissions (everyone can read+write), and yet I get this error when invoking the get_metadata method. Any hints?
"The cache is invalid or not created"
I'd like to avoid having to recreate the cache, as it may take hours on end to download.
Thanks!
Could you add a possibility to query by the language?
Here's a list of files that have some degree of failure in strip_headers(): https://gist.github.com/ikarth/49e10e7b5a66fe8d6732
This was made by grepping over the English-language, public domain text files from the subset corpus I've been assembling (April 2010 DVD + another archive from 2013) for the first mention of "Project Gutenberg".
There's a few false positives like "THE COMPLETE PROJECT GUTENBERG WORKS OF GEORGE MEREDITH" and a few transcriber notes, but the most common bits to stay untouched is the illustrated books that have a note about an html version with pictures being available.
Removing or not removing some of these may be an aesthetic call, but I figured the data might be useful.
The library installs and text = strip_headers(load_etext(2701)).strip()
works fine.
But when I run:
from gutenberg.query import get_metadata
print(get_metadata("title", 2701))
I get:
Traceback (most recent call last):
File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\query\api.py", line 101, in _metadata
return cls.__metadata
AttributeError: type object 'TitleExtractor' has no attribute '_MetadataExtractor__metadata'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\query\api.py", line 34, in get_metadata
metadata_values = MetadataExtractor.get(feature_name).get_metadata(etextno)
File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\query\extractors.py", line 30, in get_metadata
query = cls._metadata()[etext:cls.predicate():]
File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\query\api.py", line 103, in _metadata
cls.__metadata = load_metadata()
File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\acquire\metadata.py", line 130, in load_metadata
return _open_or_create_metadata_graph()
File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\acquire\metadata.py", line 108, in _open_or_create_metadata_graph
_METADATA_DATABASE_SINGLETON = _create_metadata_graph()
File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\gutenberg\acquire\metadata.py", line 91, in _create_metadata_graph
return Graph(store=store, identifier='urn:gutenberg:metadata')
File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\rdflib\graph.py", line 312, in __init__
self.__store = store = plugin.get(store, Store)()
File "F:\Isaac\Dev\nanogenmo2015\pyenv\lib\site-packages\rdflib\plugins\sleepycat.py", line 52, in __init__
"Unable to import bsddb/bsddb3, store is unusable.")
ImportError: Unable to import bsddb/bsddb3, store is unusable.
Near as I can tell, it seems to be an error with RDFlib relying on bsddb3, and bsddb3 not maintaining their Windows package properly.
Trying to install bsddb3 directly with pip install bsddb3
gets me:
Collecting bsddb3
Using cached bsddb3-6.1.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 20, in <module>
File "C:\Users\User\AppData\Local\Temp\pip-build-uvi4bvaa\bsddb3\setup.py", line 42, in <module>
import setup3
File "C:\Users\User\AppData\Local\Temp\pip-build-uvi4bvaa\bsddb3\setup3.py", line 375, in <module>
with open(os.path.join(incdir, 'db.h'), 'r') as f :
FileNotFoundError: [Errno 2] No such file or directory: 'db/include\\db.h'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\User\AppData\Local\Temp\pip-build-uvi4bvaa\bsddb3
I'm using Windows 7 and Python 3.4 x64.
Is there any easy way to switch the backend persistence store to, say, RDFLib-SQLAlchemy?
looks like it is using only one core while creating database. multi thread should speed up?
It seems that at different times in this project's development, fuzzy searches on an author's name have been discussed and were even implemented at one stage (using WHERE author LIKE
in https://github.com/c-w/Gutenberg/blob/710052ce5cab7ea45b101ab756c7f1b29091236a/gutenberg/corpus.py#L48), so I apologise if this is redundant.
My use case for this project is essentially grabbing the text content of texts written by a given author. Judging by the README, the functionality provided by `get_etexts("author", "Melville, Hermann") is able to provide the file IDs, which can then be easily downloaded.
However, get_etexts("author", "Melville, Herman")
(i.e dropping the second "n", as the author's name is actually shown at http://www.gutenberg.org/ebooks/author/9) returns an empty frozenset.
Should I assume that this project currently does not support this feature? If not, I'd be happy to contribute, but the current RDF-based implementation isn't something I currently have any experience with.
It would be great if there was an option to make the text retrieval functions on the Corpus
class (like texts_for_author
) perform fuzzy matching so that small spelling mistakes can automatically be corrected.
Your code doesn't work in Python 2.6. Please add a warning in your documentation.
Any chance of a new release? It's been over 8 months since the last one on PyPI and I just ran into the same problem and was about to submit a fix for #19 but discovered it's already fixed :)
Thanks!
The library is in dire need of more (doc|unit|regression) tests.
The get_etexts
function is very slow.
This is likely caused by the fact that with the current implementation, a call to get_etexts(A, B)
finds all the triples (s, p, o)
for which p==A
and then discards all such triples for which o != B
in a linear scan. Given that the set {(s, p, o) | p==A}
is likely quite large, the linear scan is expensive.
I feel like you should probably update the URL within the setup.py as well as on the pypi page that both point to the now defunct bitbucket repo.
On pypi:
git clone https://[email protected]/c-w/gutenberg.git && cd gutenberg
in setup.py:
url='https://bitbucket.org/c-w/gutenberg/',
Now that we have a couple of use-cases for the Gutenberg library, I figure it would be a good time to refactor the API. The current implementation was okay to get off the ground, however, the current API also turns out to be hard to test and over-complicated.
Currently I'm thinking of a really simple API:
type Text = String
type Attribute = (String, String)
type Uid = Integer
_text_for_uid :: Uid -> Text -- returns the text for a given UID
_attributes_for_uid :: Uid -> [Attribute] -- returns the properties of a given UID e.g. author,
texts_for_attribute :: Attribute -> [Text] -- returns all the texts for a given attribute-value combination such as ('author', 'Jules Verne'), ('year', '1996') or ('title', 'Moby Dick')
_attributes_for_uid
parses the Gutenberg meta-data tarball using RDF to extract all attributes for all the texts. _text_for_uid
takes care of downloading the texts (and/or storing them on disk). The last function is trivially defined in terms of the first two.
I'm getting a MemoryError
when using get_metadata
. It was working for me earlier. I have about 7.5GB free on C:\ and 53GB free on E:, and 4GB free RAM.
C:\stufftodelete>pip freeze | grep Gutenberg
Gutenberg==0.4.0
C:\stufftodelete>python
Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from gutenberg.acquire import load_etext
INFO:rdflib:RDFLib Version: 4.2.1
>>> from gutenberg.cleanup import strip_headers
>>> text = strip_headers(load_etext(2701)).strip()
>>> len(text)
1237486
>>> text[:100]
u"MOBY DICK; OR THE WHALE\r\n\r\nBy Herman Melville\r\n\r\n\r\n\r\n\r\nOriginal Transcriber's Notes:\r\n\r\nThis text is
"
>>>
>>> from gutenberg.query import get_etexts
>>> from gutenberg.query import get_metadata
>>>
>>> get_metadata('title', 2701)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\gutenberg\query\api.py", line 34, in get_metadata
metadata_values = MetadataExtractor.get(feature_name).get_metadata(etextno)
File "C:\Python27\lib\site-packages\gutenberg\query\extractors.py", line 30, in get_metadata
query = cls._metadata()[etext:cls.predicate():]
File "C:\Python27\lib\site-packages\gutenberg\query\api.py", line 103, in _metadata
cls.__metadata = load_metadata()
File "C:\Python27\lib\site-packages\gutenberg\acquire\metadata.py", line 130, in load_metadata
return _open_or_create_metadata_graph()
File "C:\Python27\lib\site-packages\gutenberg\acquire\metadata.py", line 112, in _open_or_create_metadata_graph
_METADATA_DATABASE_SINGLETON.open(_METADATA_CACHE, create=False)
File "C:\Python27\lib\site-packages\rdflib\graph.py", line 376, in open
return self.__store.open(configuration, create)
File "C:\Python27\lib\site-packages\rdflib\plugins\sleepycat.py", line 89, in open
db_env = self._init_db_environment(homeDir, create)
File "C:\Python27\lib\site-packages\rdflib\plugins\sleepycat.py", line 75, in _init_db_environment
db_env.open(homeDir, ENVFLAGS | db.DB_CREATE)
MemoryError: (12, 'Not enough space -- unable to allocate memory for mutex; resize mutex region')
>>>
>>> get_metadata('author', 2701)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\gutenberg\query\api.py", line 34, in get_metadata
metadata_values = MetadataExtractor.get(feature_name).get_metadata(etextno)
File "C:\Python27\lib\site-packages\gutenberg\query\extractors.py", line 31, in get_metadata
return frozenset(result.toPython() for result in query)
File "C:\Python27\lib\site-packages\gutenberg\query\extractors.py", line 31, in <genexpr>
return frozenset(result.toPython() for result in query)
File "C:\Python27\lib\site-packages\rdflib\graph.py", line 634, in objects
for s, p, o in self.triples((subject, predicate, None)):
File "C:\Python27\lib\site-packages\rdflib\graph.py", line 421, in triples
for _s, _o in p.eval(self, s, o):
File "C:\Python27\lib\site-packages\rdflib\paths.py", line 227, in _eval_seq
for s, o in evalPath(graph, (subj, paths[0], None)):
File "C:\Python27\lib\site-packages\rdflib\paths.py", line 425, in <genexpr>
return ((s, o) for s, p, o in graph.triples(t))
File "C:\Python27\lib\site-packages\rdflib\graph.py", line 424, in triples
for (s, p, o), cg in self.__store.triples((s, p, o), context=self):
File "C:\Python27\lib\site-packages\rdflib\plugins\sleepycat.py", line 371, in triples
assert self.__open, "The Store must be open."
AssertionError: The Store must be open.
>>>
Any ideas what's wrong?
root@ol-dev:/openlibrary/Gutenberg# pip --no-cache-dir install gutenberg==0.4.1
Collecting gutenberg==0.4.1
/usr/local/lib/python2.7/dist-packages/pip-7.1.2-py2.7.egg/pip/_vendor/requests/packages/urllib3/util/ssl_.py:90: InsecurePlatformWarning: A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately and may cause certain SSL connections to fail. For more information, see https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
InsecurePlatformWarning
Downloading Gutenberg-0.4.1.tar.gz
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "<string>", line 20, in <module>
File "/tmp/pip-build-3CKm7N/gutenberg/setup.py", line 19, in <module>
install_requires=list(line.strip() for line in open('requirements.pip')))
IOError: [Errno 2] No such file or directory: 'requirements.pip'
----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-build-3CKm7N/gutenberg
A fresh download of this repo + python setup.py sdist
shows that the file is correctly included in the .tar.gz, but the PyPi one is clearly missing it:
root@ol-dev:/tmp# wget https://pypi.python.org/packages/source/G/Gutenberg/Gutenberg-0.4.1.tar.gz
--2016-01-03 15:13:19-- https://pypi.python.org/packages/source/G/Gutenberg/Gutenberg-0.4.1.tar.gz
Resolving pypi.python.org (pypi.python.org)... 23.235.47.223
Connecting to pypi.python.org (pypi.python.org)|23.235.47.223|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 9933 (9.7K) [application/octet-stream]
Saving to: 'Gutenberg-0.4.1.tar.gz'
100%[==========================================================>] 9,933 --.-K/s in 0s
2016-01-03 15:13:19 (97.7 MB/s) - 'Gutenberg-0.4.1.tar.gz' saved [9933/9933]
root@ol-dev:/tmp# tar tvfz Gutenberg-0.4.1.tar.gz | grep requirements
root@ol-dev:/tmp#
Downloading the source and running nosetest, I get eight UnicodeDecodeError test failures.
======================================================================
ERROR: test_load_metadata (tests.test_acquire.TestLoadMetadata)
----------------------------------------------------------------------
Traceback (most recent call last):
File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_util.py", line 41, in setUp
self.cache.populate()
File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_util.py", line 58, in populate
data = u('\n').join(item.rdf() for item in self.sample_data_factory())
File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_util.py", line 58, in <genexpr>
data = u('\n').join(item.rdf() for item in self.sample_data_factory())
File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_sample_metadata.py", line 83, in all
yield SampleMetaData.for_etextno(int(etextno))
File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_sample_metadata.py", line 77, in for_etextno
metadata = _load_metadata(etextno)
File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\_sample_metadata.py", line 93, in _load_metadata
return json.load(open(data_path))
File "c:\tools\anaconda3\envs\genmoenv\Lib\json\__init__.py", line 265, in load
return loads(fp.read(),
File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 56: character maps to <undefined>
(I also get import sqlalchemy errors, which I'm assuming are unrelated.)
======================================================================
ERROR: test_delete (tests.test_metadata_cache.TestSqlite)
----------------------------------------------------------------------
Traceback (most recent call last):
File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\tests\test_metadata_cache.py", line 111, in setUp
self.cache = SqliteMetadataCache(self.local_storage)
File "F:\Isaac\Dev\nanogenmo2016\projects\Gutenberg\gutenberg\acquire\metadata.py", line 205, in __init__
store = plugin.get('SQLAlchemy', Store)(identifier=_DB_IDENTIFIER)
File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\rdflib\plugin.py", line 104, in get
return p.getClass()
File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\rdflib\plugin.py", line 81, in getClass
self._class = self.ep.load()
File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\pkg_resources\__init__.py", line 2249, in load
return self.resolve()
File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\pkg_resources\__init__.py", line 2255, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "f:\isaac\dev\nanogenmo2016\projects\virtualenvgutenberg\lib\site-packages\rdflib_sqlalchemy\SQLAlchemy.py", line 6, in <module>
from . import sqlalchemy
ImportError: cannot import name 'sqlalchemy'
Currently this library only makes use of the author and title meta-data exposed by Project Gutenberg and does not leverage information such as genre, publication date, etc.
Making this information usable by the library is a pretty straight forward three-step process:
TextSource.textinfo_converter
method needs to be extended to parse the new meta-data attributes.TextInfo
class.Corpus
class (such as texts_for_genre
or texts_for_year
The TextSource
object should probably track its state so that it only yields every text once (unless explicitly requested to re-yield all texts from the start).
I came accross that issue, I am not sure if it's a bug in the lib or some process on gutenberg itself. eBook with id 10160 can be seen at http://www.gutenberg.org/cache/epub/10160/pg10160.txt and it does contains accents. Hosted on the cache there are two versions of the text file:
-8
: http://aleph.gutenberg.org/1/0/1/6/10160/10160-8.txtNow, I am not sure if it's expected or not, but the one named 10160.txt
does not contain any accent when the 10160-8.txt
file do contains them. Changing the order of the extension in
gutenberg/gutenberg/acquire/text.py
Line 66 in e76d96e
Hi,
I'm a newbie, but I've spent a few hours trying to work through documentation & forums, and so far I have not found a work around for the Berkeley db dependency. It will not let me install gutenberg at all, so I can't make use of the back up sql option you mentioned putting in. Have others noted this problem, and is there a way to install more directly than pip that would allow me to install it?
Thank you for your time,
Liz
Cheeseshop version fails installation because of missing 'README.rst'. Aren't you missing a MANIFEST.in with "include README.rst"?
The error message is here:
Downloading Gutenberg-0.2.1.tar.gz
Running setup.py (path:/tmp/pip_build_root/Gutenberg/setup.py) egg_info for package Gutenberg
Traceback (most recent call last):
File "", line 17, in
File "/tmp/pip_build_root/Gutenberg/setup.py", line 15, in
long_description=open('README.rst').read(),
IOError: [Errno 2] No such file or directory: 'README.rst'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 17, in
File "/tmp/pip_build_root/Gutenberg/setup.py", line 15, in
long_description=open('README.rst').read(),
IOError: [Errno 2] No such file or directory: 'README.rst'
I have solved it finally.
Once you install libdb5.3-dev
It is OK
but libdb5.1-dev is no longer available
and libdb5.3 is not the same:
It does not create the /usr/include/db.h file
I am now gonna test and use your module.
Thanks a lot in advance!! :)
I do not know whether this is really an issue or my lack of skills.
I tested:
apt-get install libdb5.3
this gives:
libdb5.3 is already the newest version.
This the output of the command line:
root@RoseWoodSamDebian:/home/arthurx/Python_course/google-python-exercises/basic# pip install gutenberg
Downloading/unpacking gutenberg
Downloading Gutenberg-0.4.2.tar.gz
Running setup.py (path:/tmp/pip-build-ZWxwvG/gutenberg/setup.py) egg_info for package gutenberg
Downloading/unpacking bsddb3>=6.1.0 (from gutenberg)
Downloading bsddb3-6.2.1.tar.gz (228kB): 228kB downloaded
Running setup.py (path:/tmp/pip-build-ZWxwvG/bsddb3/setup.py) egg_info for package bsddb3
Traceback (most recent call last):
File "", line 17, in
File "/tmp/pip-build-ZWxwvG/bsddb3/setup.py", line 40, in
import setup2
File "setup2.py", line 332, in
with open(os.path.join(incdir, 'db.h'), 'r') as f :
IOError: [Errno 2] No such file or directory: '/usr/include/db.h'
Complete output from command python setup.py egg_info:
Traceback (most recent call last):
File "", line 17, in
File "/tmp/pip-build-ZWxwvG/bsddb3/setup.py", line 40, in
import setup2
File "setup2.py", line 332, in
with open(os.path.join(incdir, 'db.h'), 'r') as f :
IOError: [Errno 2] No such file or directory: '/usr/include/db.h'
Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip-build-ZWxwvG/bsddb3
Storing debug log for failure in /root/.pip/pip.log
root@RoseWoodSamDebian:/home/arthurx/Python_course/google-python-exercises/basic#
https://www.gutenberg.org/cache/epub/100/pg100.txt
The Project Gutenberg EBook of The Complete Works of William Shakespeare, by William Shakespeare
Python 2.7.9 (default, Dec 10 2014, 12:28:03) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from gutenberg.acquire import load_etext
INFO:rdflib:RDFLib Version: 4.2.1
>>> from gutenberg.cleanup import strip_headers
>>> text = strip_headers(load_etext(100)).strip()
>>> len(text)
5373
>>> len(etext)
5589915
etext
contains the full text from The Project Gutenberg EBook of The Complete Works of William Shakespeare
up to and including *** END: FULL LICENSE ***
.
But text
only contains the text from *Project Gutenberg is proud to cooperate with The World Library*
up to and including ["Small Print" V.12.08.93]
.
@c-w Do we want to release a new version now with the previous two PRs? Since we're moving to a new version of a dependency and adding a new flag, I'd say it should be 0.5.0 release, but I leave the final decision to you. I'm assuming that all new releases will go through you?
Note: You should be able to just draft a release through github, and then once it successfully builds on Travis, it'll be auto published to pypi, so it's nice and easy to do all through CI tools.
When the parser tries to read http://www.gutenberg.lib.md.us/1/9/5/8/19581/19581.txt, it throws a UnicodeEncodeError when attempting to write the .gz file due to some ASCII characters in the text (such as '’').
Build is failing on Travis
There are explicit version requirements in the requirements.txt. It would be better if they were specified with less strong requirements, e.g., just requiring larger than a specific version.
The two most recent uploads to Project Gutenberg are:
https://www.gutenberg.org/ebooks/50334
https://www.gutenberg.org/ebooks/50333
This code:
from gutenberg.acquire import load_etext
from gutenberg.cleanup import strip_headers
text = strip_headers(load_etext(50333)).strip()
print(text[:10])
text = strip_headers(load_etext(50334)).strip()
print(text[:10])
Produces:
Traceback (most recent call last):
File "C:\temp\test.py", line 6, in <module>
text = strip_headers(load_etext(50333)).strip()
File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 71, in load_etext
download_uri = _format_download_uri(etextno)
File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 55, in _format_download_uri
raise ValueError('download URI for {} not supported'.format(etextno))
ValueError: download URI for 50333 not supported
That's because it tries to download these but they don't exist:
http://www.gutenberg.lib.md.us/5/0/3/3/50333/50333.txt
http://www.gutenberg.lib.md.us/5/0/3/3/50333/50333-8.txt
Their Plain Text UTF-8 links look like this:
https://www.gutenberg.org/files/50334/50334-0.txt
https://www.gutenberg.org/cache/epub/50333/pg50333.txt
Is it because they're so new they've not been mirrored to http://www.gutenberg.lib.md.us, or something else?
The metadata database contains records for non-existent ebooks.
Exists: https://www.gutenberg.org/ebooks/1
Doesn't exist: https://www.gutenberg.org/ebooks/182
Example code:
from __future__ import print_function
from gutenberg.query import get_metadata
def get_all_metadata(etextno):
for feature_name in ['author', 'formaturi', 'language',
'rights', 'subject', 'title']:
print("{}\t{}\t{}".format(
etextno,
feature_name,
get_metadata(feature_name, etextno)))
print()
get_all_metadata(1) # US Declaration of Independence
get_all_metadata(182) # no such ebook
Actual output:
1 author frozenset([u'United States President (1801-1809)'])
1 formaturi frozenset([u'http://www.gutenberg.org/ebooks/1.txt.utf-8', u'http://www.gutenberg.org/ebooks/1.e
pub.noimages', u'http://www.gutenberg.org/6/5/2/6527/6527-t/6527-t.tex', u'http://www.gutenberg.org/ebooks/1.html.noimag
es', u'http://www.gutenberg.org/files/1/1.zip', u'http://www.gutenberg.org/ebooks/1.epub.images', u'http://www.gutenberg
.org/ebooks/1.rdf', u'http://www.gutenberg.org/ebooks/1.kindle.noimages', u'http://www.gutenberg.org/files/1/1.txt', u'h
ttp://www.gutenberg.org/ebooks/1.html.images', u'http://www.gutenberg.org/6/5/2/6527/6527-t.zip', u'http://www.gutenberg
.org/ebooks/1.kindle.images'])
1 language frozenset([u'en'])
1 rights frozenset([u'Public domain in the USA.'])
1 subject frozenset([u'E201', u'United States. Declaration of Independence', u'United States -- History -- Revolut
ion, 1775-1783 -- Sources', u'JK'])
1 title frozenset([u'The Declaration of Independence of the United States of America'])
182 author frozenset([])
182 formaturi frozenset([])
182 language frozenset([u'en'])
182 rights frozenset([u'None'])
182 subject frozenset([])
182 title frozenset([])
Expected output:
I'd expect get_metadata("language", 182)
and get_metadata("rights", 182)
to both return frozenset([])
instead of frozenset([u'en'])
and frozenset([u'None'])
.
Or better, as there's no such ebook, perhaps it should it return None or raise an exception: maybe IndexError or something custom like NoEbookIndex, or just don't add it to the database in the first place and let that raise whatever it would raise when an index is not found.
The query example (print(get_etexts('title', 'Moby Dick; Or, The Whale')))
is returning frozenset([]) for me is this a current bug or has pip just set this up wrongly ?
The RDF files for Gutenberg contain things such as Image, Sound, etc. and right now they're all being treated as TextInfo objects. This might not be optimal as if I query for "United States", I don't particularly want Amendments to the United States Constitution Reading as it's not really a book and the file is predominantly just license stuff. It's not possible right now for an enduser to know these things and filter as appropriate.
Additionally, some of these audio/image folders use uid-readme.txt instead of just uid.txt so knowing that would also allow for pulling more files successfully if when looking for that file, look for uid.txt first than -readme (but never do that for regular text).
There are a couple options I think:
ebook = next(iter(graph.query('''
SELECT
?type
WHERE {
?ebook dcterms:type [ rdf:value ?type ].
}
LIMIT 1
''')))
if ebook.type != rdflib.Literal('Text'):
continue
Potential types (uid example):
The load_metadata
function that makes the Project Gutenberg meta-data RDF graph available to the meta-data extractors takes for ever to run as it is loading ~130MB of data into memory from a flat file (~850MB uncompressed).
A low-cost way to address this is to investigate database-backed stores for the RDF graph instead of loading it all into memory.
Alternative ways to tackle the issue include sharding the meta-data file along multiple dimensions (e.g. text identifier, author, etc.) but this would require adding code to create the pertinent shards to every MetadataExtractor i.e., extending the library involves more work in the future.
This error was mentioned in Issue #23 , which was closed, but I'm running into the same error for what seems to be a different reason, so I thought I would create a new one.
I'm having trouble with same error as in #23 for The Iliad here: https://www.gutenberg.org/ebooks/6130
File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 72, in load_etext
download_uri = _format_download_uri(etextno)
File "C:\Python27\lib\site-packages\gutenberg\acquire\text.py", line 56, in _format_download_uri
raise ValueError('download URI for {0} not supported'.format(etextno))
ValueError: download URI for 2000 not supported
But this text doesn't seem to be super recent - it was uploaded in 2004. To check and make sure that the errordidn't happen for all books, I tried a couple of others.
http://www.gutenberg.org/ebooks/16452, another version of The Iliad, has the same URI error, as does Welsh Fairy Tales (http://www.gutenberg.org/ebooks/9368).
Books with bibrec numbers 7250, 5000, 3000, 2000, and 1300 all have the same error.
On The Origin of Species, http://www.gutenberg.org/ebooks/1228, works just fine, but The First Part of Henry the Sixth (http://www.gutenberg.org/ebooks/1100), which is chronologically before On The Origin of Species, raises the URI error as well, which makes it seem unlikely to me that it's an issue with recency of upload (as in #23)
Any idea what the issue might be?
Originally reported by @MasterOdin in #11
I would argue that it would be important for get_etexts to not only support one feature/value pair but potentially multiple.
A potential use case that illustrates this would be how would I get only the german books by the author "Various".
The solution as proposed by your API would be:
texts = get_etexts('author','various')
final_list = []
for text in texts:
if get_metadata('language', text) != 'german':
pass
final_list.append(text)
which is somewhat weird as I'd kind of like the API to handle this internally (especially if I want to get even more specific with criteria and don't want to build up that if statement!)
So I'd say maybe change get_etexts to support passing in either two strings for one feature, or probably easier, a dictionary which would allow for any number of arguments:
texts = get_etexts({'author':'various','language':'german'})
I have installed the library via pip with python2.7 on ubuntu 14.04.
I was wondering if this kind of author behavior to be expected for some titles:
>>> from gutenberg.query import get_etexts
INFO:rdflib:RDFLib Version: 4.2.1
>>> from gutenberg.query import get_metadata
>>> print(get_metadata('title', 1))
frozenset([u'The Declaration of Independence of the United States of America'])
>>> print(get_metadata('author', 1))
frozenset([u'United States President (1801-1809)'])
This seems like the relevant parts of the RDF from the dump I downloaded:
<dcterms:creator>
<pgterms:agent rdf:about="2009/agents/1638">
<pgterms:webpage rdf:resource="http://en.wikipedia.org/wiki/Thomas_Jefferson"/>
<pgterms:birthdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1743</pgterms:birthdate>
<pgterms:name>Jefferson, Thomas</pgterms:name>
<pgterms:deathdate rdf:datatype="http://www.w3.org/2001/XMLSchema#integer">1826</pgterms:deathdate>
<pgterms:alias>United States President (1801-1809)</pgterms:alias>
</pgterms:agent>
</dcterms:creator>
Hi there,
I'm working in a multilang project using gutenberg book module to test things:
Using python2.7 (latest release)
pip install gutenberg --upgrade
Requirement already up-to-date: gutenberg in /home/aloha/code/Narralyzer/env/lib/python2.7/site-packages
gutenberg_test_id = 31727
print(strip_headers(load_etext(gutenberg_test_id)).strip()[100:200])
also:
python -m gutenberg.acquire.text 31727 test.txt
output:
Problem wird, offenbart. Aber gleichzeitig haftete diesem merkw�rdigen
und tiefsinnigen Erleber aller begrifflichen Probleme die
verh�ngnisvolle Schw�che an, da� er sofort jeden Boden verlor, sowie er
aus dem Kreis seiner innerlichsten Spekulation heraustrat in die
source:
http://www.gutenberg.org/ebooks/31727
Great module, keep it up!
never mind
sys.getdefaultencoding()
'ascii'
;)
Populated my cache with the default berkley db but when I do
eng = get_etexts('language','en')
or get_etexts('language','eng')
or get_etexts('language','En')
The query returns the empty frozenset.
Is this a known bug?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.