aparrish / gutenberg-poetry-corpus Goto Github PK

View Code? Open in Web Editor NEW

183.0 183.0 22.0 14 KB

A corpus of poetry from Project Gutenberg

Python 13.25% Jupyter Notebook 86.75%

gutenberg-poetry-corpus's People

Contributors

Stargazers

Watchers

gutenberg-poetry-corpus's Issues

find a more sound methodology for classifying lines as "poetry" and "not poetry"

Right now this is accomplished using a set of checks based on surface-level textual characteristics, and while these checks produce okay results, they're brittle and unsophisticated. Since this is really just a straightforward text classification task, here's what I think is needed, at minimum:

a collection of several thousand (or more?) examples of poem lines and non-poem lines, labelled by hand
a suite of tests to check the accuracy of any classification method (and tweaks to those methods) against the hand-labelled set
a statistical model that produces high accuracy on the hand-labelled set.

I suspect just like... a random forest classifier trained on n-grams would produce pretty good results. A side benefit of this would be that the same classifier could likely be used to find stretches of poetry even in Project Gutenberg books that aren't labelled as "Poetry" in the subject metadata.

be more principled about which texts to include

I believe that appropriative "remix" artwork, especially such artwork that "punches up" and/or uses material in the public domain, is fundamentally progressive: a way to loosen the stranglehold of power structures established in culture. In that spirit, the original intention of this corpus was to provide an ecumenical source of copyright-free "raw material" for evocative poetic text generation that has the cadence and form of stereotypical Poetry-with-a-capital-P.

Of course, the idea of "material" being "raw" sometimes serves only to obscure the (sometimes problematic) ways in which a material comes into existence, and textual raw material is no different—the texts in this corpus in particular carry with them the politics and points of view of the people that originally authored them. Though I've taken some effort to mitigate this, In some cases text that you get by randomly sampling this corpus will contain offensive content, or works and authors whose viewpoints are unacceptable. The demographic of authors included in the corpus is also very particular (mostly dead white men from America or Great Britain).

It's impossible to completely circumvent this problem, of course (there's no such thing as a neutral corpus), but I do think it's possible to mitigate it, and to appropriately set expectations for users of the corpus, by being more principled about which source texts to include. (This might include introducing texts that are not presently in Project Gutenberg.) I'd like to come up with a list of criteria that determine whether or not a text should be included, with "in the public domain" being the cornerstone.

http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz broken

Looks like the corpus link http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz is no longer working, is the corpus file available anywhere else? Thanks for building this!

aparrish / gutenberg-poetry-corpus Goto Github PK

gutenberg-poetry-corpus's People

Contributors

Stargazers

Watchers

Forkers

gutenberg-poetry-corpus's Issues

find a more sound methodology for classifying lines as "poetry" and "not poetry"

be more principled about which texts to include

http://static.decontextualize.com/gutenberg-poetry-v001.ndjson.gz broken

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent