Git Product home page Git Product logo

gutenberg-poetry-corpus's People

Contributors

aparrish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

gutenberg-poetry-corpus's Issues

find a more sound methodology for classifying lines as "poetry" and "not poetry"

Right now this is accomplished using a set of checks based on surface-level textual characteristics, and while these checks produce okay results, they're brittle and unsophisticated. Since this is really just a straightforward text classification task, here's what I think is needed, at minimum:

  • a collection of several thousand (or more?) examples of poem lines and non-poem lines, labelled by hand
  • a suite of tests to check the accuracy of any classification method (and tweaks to those methods) against the hand-labelled set
  • a statistical model that produces high accuracy on the hand-labelled set.

I suspect just like... a random forest classifier trained on n-grams would produce pretty good results. A side benefit of this would be that the same classifier could likely be used to find stretches of poetry even in Project Gutenberg books that aren't labelled as "Poetry" in the subject metadata.

be more principled about which texts to include

I believe that appropriative "remix" artwork, especially such artwork that "punches up" and/or uses material in the public domain, is fundamentally progressive: a way to loosen the stranglehold of power structures established in culture. In that spirit, the original intention of this corpus was to provide an ecumenical source of copyright-free "raw material" for evocative poetic text generation that has the cadence and form of stereotypical Poetry-with-a-capital-P.

Of course, the idea of "material" being "raw" sometimes serves only to obscure the (sometimes problematic) ways in which a material comes into existence, and textual raw material is no different—the texts in this corpus in particular carry with them the politics and points of view of the people that originally authored them. Though I've taken some effort to mitigate this, In some cases text that you get by randomly sampling this corpus will contain offensive content, or works and authors whose viewpoints are unacceptable. The demographic of authors included in the corpus is also very particular (mostly dead white men from America or Great Britain).

It's impossible to completely circumvent this problem, of course (there's no such thing as a neutral corpus), but I do think it's possible to mitigate it, and to appropriately set expectations for users of the corpus, by being more principled about which source texts to include. (This might include introducing texts that are not presently in Project Gutenberg.) I'd like to come up with a list of criteria that determine whether or not a text should be included, with "in the public domain" being the cornerstone.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.