Git Product home page Git Product logo

Comments (12)

leonweber avatar leonweber commented on June 8, 2024 2

Raw texts downloaded and deduplicated with https://github.com/pgcorpus/gutenberg (thanks @clancyoftheoverflow for the hint!) can now be found here: https://huggingface.co/datasets/bigscience-catalogue-data/project-gutenberg/

from data_tooling.

clancyoftheoverflow avatar clancyoftheoverflow commented on June 8, 2024

What about this: https://github.com/pgcorpus/gutenberg. It seems it has the whole Gutenberg corpus (plus metadata).

from data_tooling.

leonweber avatar leonweber commented on June 8, 2024

#self-assign

from data_tooling.

albertvillanova avatar albertvillanova commented on June 8, 2024

Need support for ZIP:

from data_tooling.

albertvillanova avatar albertvillanova commented on June 8, 2024

Once ZIP support merged into master, note that default Datasets "text" packaged module, yields each line as an example. First 20 examples:

{'text': "\ufeffProject Gutenberg's Martin Hyde, The Duke's Messenger, by John Masefield"}
{'text': ''}
{'text': 'This eBook is for the use of anyone anywhere at no cost and with'}
{'text': 'almost no restrictions whatsoever.  You may copy it, give it away or'}
{'text': 're-use it under the terms of the Project Gutenberg License included'}
{'text': 'with this eBook or online at www.gutenberg.org'}
{'text': ''}
{'text': ''}
{'text': "Title: Martin Hyde, The Duke's Messenger"}
{'text': ''}
{'text': 'Author: John Masefield'}
{'text': ''}
{'text': 'Posting Date: August 24, 2008 [EBook #1274]'}
{'text': 'Release Date: April, 1998'}
{'text': 'Last Updated: March 16, 2018'}
{'text': ''}
{'text': ''}
{'text': 'Language: English'}
{'text': ''}
{'text': 'Character set encoding: UTF-8'}
{'text': ''}
{'text': "*** START OF THIS PROJECT GUTENBERG EBOOK MARTIN HYDE, THE DUKE'S MESSENGER ***"}

Maybe worth passing a configuration parameter to split it into documents or paragraphs (instead of lines), once merged:

from data_tooling.

albertvillanova avatar albertvillanova commented on June 8, 2024

Thanks @leonweber.

from data_tooling.

lvwerra avatar lvwerra commented on June 8, 2024

#self-assign

from data_tooling.

lvwerra avatar lvwerra commented on June 8, 2024

Added dataset scripts for LM here (en, zh, fr, pt, es):

https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fr_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_pt_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_es_project_gutenberg

from data_tooling.

albertvillanova avatar albertvillanova commented on June 8, 2024

Thanks @lvwerra.

Sample:

{
  'text': '\ufeffThe Project Gutenberg EBook of The Magna Carta, by... about new eBooks.\n'
}

from data_tooling.

albertvillanova avatar albertvillanova commented on June 8, 2024

Sample with meta:

{
  'text': 'The Project Gutenberg EBook of The Magna Carta, by... about new eBooks.',
  'meta': "{'file': 'PG10000_raw.txt'}"
}

from data_tooling.

albertvillanova avatar albertvillanova commented on June 8, 2024

Please note that the text field, independently of the specific language, always contains English text:

  • metadata in the header
  • license in the footer

Example for Spanish ("es"):

{
  'text': 'The Project Gutenberg EBook of Relacion historica de los sucesos de la\nrebelion de Jose Gabriel Tupac-Amaru en las provincias del Peru, el ano\nde 1780, by Anonymous\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org\n\n\nTitle: Relacion historica de los sucesos de la rebelion de Jose Gabriel\n       Tupac-Amaru en las provincias del Peru, el ano de 1780\n\nAuthor: Anonymous\n\nRelease Date: November 26, 2003 [EBook #10293]\n\nLanguage: Spanish\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK RELACION HISTORICA ***\n\n\n\n\nProduced by Miranda van de Heijning, Virginia Paque and PG Distributed\nProofreaders. This file was produced from images generously made\navailable by the Biblioth que nationale de France (BnF/Gallica) at\nhttp://gallica.bnf.fr.\n\n\n\n\n\n[Nota del Transcriptor: Las irregularidades en acentuación y ortografía\nencontradas en este libro son consistentes con la flexibilidad de las\nreglas en uso en 1836, y así no deben ser consideradas "errores" sino\nun elemento del estilo de la época.]\n\n\n     RELACION HISTORICA\n\n     DE LOS\n\n     SUCESOS DE LA REBELION\n\n     DE\n\n     JOSE GABRIEL TUPAC-AMARU,\n\n     EN LAS\n\n     PROVINCIAS DEL PERU,\n\n     EL AÑO DE 1780.\n\n\n\n\n     Primera  Edicion.\n\n     BUENOS-AIRES.\n\n     IMPRENTA DEL ESTADO.\n\n     1836\n\n\n\n\n     DISCURSO PRELIMINAR\n\n     A LA\n\n     REVOLUCION DE TUPAC-AMARU.\n\n\n       *       *       *       *       *\n\n\nLas extorsiones de los corregidores, y la impunidad de que disfrutaban\nen las _Audiencias_, produgeron en 1780 una fuerte conmocion entre...
           ...
           dia 3 de Julio de 1781, con las pocas tropas que le\nhabian quedado: diligencia que no pudo verificar Orellana con el\nvecindario de Puno, que convoyaba hasta el 5 del mismo, así por la\ndetencion que habia hecho, como por haberse visto precisado á seguir una\nmarcha mas lenta, á causa de las dificultades que le ocurrieron, por la\npoca comodidad y proporciones de las familias que le seguian.\n\n\n\n\n\nEnd of the Project Gutenberg EBook of Relacion historica de los sucesos de\nla rebelion de Jose Gabriel Tupac-Amaru en las provincias del Peru,\nel ano de 1780, by Anonymous\n\n*** END OF THIS PROJECT GUTENBERG EBOOK RELACION HISTORICA ***\n\n***** This file should be named 10293-8.txt or 10293-8.zip *****\nThis and all associated files of various formats will be found in:\n        https://www.gutenberg.org/1/0/2/9/10293/\n\nProduced by Miranda van de Heijning, Virginia Paque and PG Distributed\nProofreaders. This file was produced from images generously made\navailable by the Biblioth que nationale de France (BnF/Gallica) at\nhttp://gallica.bnf.fr.\n\n\nUpdated editions will replace the previous one--the old editions\nwill be renamed.\n\nCreating the works from public domain print editions means that no\none owns a United States copyright in these works, so the Foundation\n(and you!) can copy and distribute it in the United States without\npermission and without paying copyright royalties.  Special rules,\nset forth in the General Terms of Use part of this license, apply to\ncopying and distributing Project Gutenberg-tm electronic works to\nprotect the PROJECT GUTENBERG-tm concept and trademark.  Project\nGutenberg is a registered trademark, and may not be used if you\ncharge for the eBooks, unless you receive specific permission...
           ...
           posted since November 2003, with etext numbers OVER #10000, are\nfiled in a different way.  The year of a release date is no longer part\nof the directory path.  The path is based on the etext number (which is\nidentical to the filename).  The path to the file is made up of single\ndigits corresponding to all but the last digit in the filename.  For\nexample an eBook of filename 10234 would be found at:\n\n     https://www.gutenberg.org/1/0/2/3/10234\n\nor filename 24689 would be found at:\n     https://www.gutenberg.org/2/4/6/8/24689\n\nAn alternative method of locating eBooks:\n     https://www.gutenberg.org/GUTINDEX.ALL',
  'meta': "{'file': 'PG10293_raw.txt'}"
}

I'm fixing it.

from data_tooling.

albertvillanova avatar albertvillanova commented on June 8, 2024

It is fixed:

{'text': 'Produced by Miranda van de Heijning, Virginia Paque and PG Distributed\nProofreaders. This file was produced from images generously made\navailable by the Biblioth que nationale de France (BnF/Gallica) at\nhttp://gallica.bnf.fr.\n\n\n\n\n\n[Nota del Transcriptor: Las irregularidades en acentuación y ortografía\nencontradas en...
          ...
          la\npoca comodidad y proporciones de las familias que le seguian.\n\n\n\n\n\nEnd of the Project Gutenberg EBook of Relacion historica de los sucesos de\nla rebelion de Jose Gabriel Tupac-Amaru en las provincias del Peru,\nel ano de 1780, by Anonymous',
 'meta': "{'file': 'PG10293_raw.txt'}"}

from data_tooling.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.