Create dataset from Project Gutenberg,about bigscience-workshop/data_tooling

Comments (12)

leonweber commented on June 8, 2024 2

Raw texts downloaded and deduplicated with https://github.com/pgcorpus/gutenberg (thanks @clancyoftheoverflow for the hint!) can now be found here: https://huggingface.co/datasets/bigscience-catalogue-data/project-gutenberg/

from data_tooling.

clancyoftheoverflow commented on June 8, 2024

What about this: https://github.com/pgcorpus/gutenberg. It seems it has the whole Gutenberg corpus (plus metadata).

from data_tooling.

leonweber commented on June 8, 2024

#self-assign

from data_tooling.

albertvillanova commented on June 8, 2024

Need support for ZIP:

huggingface/datasets#3375

from data_tooling.

albertvillanova commented on June 8, 2024

Once ZIP support merged into master, note that default Datasets "text" packaged module, yields each line as an example. First 20 examples:

{'text': "\ufeffProject Gutenberg's Martin Hyde, The Duke's Messenger, by John Masefield"}
{'text': ''}
{'text': 'This eBook is for the use of anyone anywhere at no cost and with'}
{'text': 'almost no restrictions whatsoever.  You may copy it, give it away or'}
{'text': 're-use it under the terms of the Project Gutenberg License included'}
{'text': 'with this eBook or online at www.gutenberg.org'}
{'text': ''}
{'text': ''}
{'text': "Title: Martin Hyde, The Duke's Messenger"}
{'text': ''}
{'text': 'Author: John Masefield'}
{'text': ''}
{'text': 'Posting Date: August 24, 2008 [EBook #1274]'}
{'text': 'Release Date: April, 1998'}
{'text': 'Last Updated: March 16, 2018'}
{'text': ''}
{'text': ''}
{'text': 'Language: English'}
{'text': ''}
{'text': 'Character set encoding: UTF-8'}
{'text': ''}
{'text': "*** START OF THIS PROJECT GUTENBERG EBOOK MARTIN HYDE, THE DUKE'S MESSENGER ***"}

Maybe worth passing a configuration parameter to split it into documents or paragraphs (instead of lines), once merged:

huggingface/datasets#3442

from data_tooling.

albertvillanova commented on June 8, 2024

Thanks @leonweber.

from data_tooling.

lvwerra commented on June 8, 2024

#self-assign

from data_tooling.

lvwerra commented on June 8, 2024

Added dataset scripts for LM here (en, zh, fr, pt, es):

https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fr_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_pt_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_es_project_gutenberg

from data_tooling.

albertvillanova commented on June 8, 2024

Thanks @lvwerra.

Sample:

{
  'text': '\ufeffThe Project Gutenberg EBook of The Magna Carta, by... about new eBooks.\n'
}

from data_tooling.

albertvillanova commented on June 8, 2024

Sample with meta:

{
  'text': 'The Project Gutenberg EBook of The Magna Carta, by... about new eBooks.',
  'meta': "{'file': 'PG10000_raw.txt'}"
}

from data_tooling.

albertvillanova commented on June 8, 2024

Please note that the text field, independently of the specific language, always contains English text:

metadata in the header
license in the footer

Example for Spanish ("es"):

{
  'text': 'The Project Gutenberg EBook of Relacion historica de los sucesos de la\nrebelion de Jose Gabriel Tupac-Amaru en las provincias del Peru, el ano\nde 1780, by Anonymous\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever.  You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org\n\n\nTitle: Relacion historica de los sucesos de la rebelion de Jose Gabriel\n       Tupac-Amaru en las provincias del Peru, el ano de 1780\n\nAuthor: Anonymous\n\nRelease Date: November 26, 2003 [EBook #10293]\n\nLanguage: Spanish\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK RELACION HISTORICA ***\n\n\n\n\nProduced by Miranda van de Heijning, Virginia Paque and PG Distributed\nProofreaders. This file was produced from images generously made\navailable by the Biblioth que nationale de France (BnF/Gallica) at\nhttp://gallica.bnf.fr.\n\n\n\n\n\n[Nota del Transcriptor: Las irregularidades en acentuación y ortografía\nencontradas en este libro son consistentes con la flexibilidad de las\nreglas en uso en 1836, y así no deben ser consideradas "errores" sino\nun elemento del estilo de la época.]\n\n\n     RELACION HISTORICA\n\n     DE LOS\n\n     SUCESOS DE LA REBELION\n\n     DE\n\n     JOSE GABRIEL TUPAC-AMARU,\n\n     EN LAS\n\n     PROVINCIAS DEL PERU,\n\n     EL AÑO DE 1780.\n\n\n\n\n     Primera  Edicion.\n\n     BUENOS-AIRES.\n\n     IMPRENTA DEL ESTADO.\n\n     1836\n\n\n\n\n     DISCURSO PRELIMINAR\n\n     A LA\n\n     REVOLUCION DE TUPAC-AMARU.\n\n\n       *       *       *       *       *\n\n\nLas extorsiones de los corregidores, y la impunidad de que disfrutaban\nen las _Audiencias_, produgeron en 1780 una fuerte conmocion entre...
           ...
           dia 3 de Julio de 1781, con las pocas tropas que le\nhabian quedado: diligencia que no pudo verificar Orellana con el\nvecindario de Puno, que convoyaba hasta el 5 del mismo, así por la\ndetencion que habia hecho, como por haberse visto precisado á seguir una\nmarcha mas lenta, á causa de las dificultades que le ocurrieron, por la\npoca comodidad y proporciones de las familias que le seguian.\n\n\n\n\n\nEnd of the Project Gutenberg EBook of Relacion historica de los sucesos de\nla rebelion de Jose Gabriel Tupac-Amaru en las provincias del Peru,\nel ano de 1780, by Anonymous\n\n*** END OF THIS PROJECT GUTENBERG EBOOK RELACION HISTORICA ***\n\n***** This file should be named 10293-8.txt or 10293-8.zip *****\nThis and all associated files of various formats will be found in:\n        https://www.gutenberg.org/1/0/2/9/10293/\n\nProduced by Miranda van de Heijning, Virginia Paque and PG Distributed\nProofreaders. This file was produced from images generously made\navailable by the Biblioth que nationale de France (BnF/Gallica) at\nhttp://gallica.bnf.fr.\n\n\nUpdated editions will replace the previous one--the old editions\nwill be renamed.\n\nCreating the works from public domain print editions means that no\none owns a United States copyright in these works, so the Foundation\n(and you!) can copy and distribute it in the United States without\npermission and without paying copyright royalties.  Special rules,\nset forth in the General Terms of Use part of this license, apply to\ncopying and distributing Project Gutenberg-tm electronic works to\nprotect the PROJECT GUTENBERG-tm concept and trademark.  Project\nGutenberg is a registered trademark, and may not be used if you\ncharge for the eBooks, unless you receive specific permission...
           ...
           posted since November 2003, with etext numbers OVER #10000, are\nfiled in a different way.  The year of a release date is no longer part\nof the directory path.  The path is based on the etext number (which is\nidentical to the filename).  The path to the file is made up of single\ndigits corresponding to all but the last digit in the filename.  For\nexample an eBook of filename 10234 would be found at:\n\n     https://www.gutenberg.org/1/0/2/3/10234\n\nor filename 24689 would be found at:\n     https://www.gutenberg.org/2/4/6/8/24689\n\nAn alternative method of locating eBooks:\n     https://www.gutenberg.org/GUTINDEX.ALL',
  'meta': "{'file': 'PG10293_raw.txt'}"
}

I'm fixing it.

from data_tooling.

albertvillanova commented on June 8, 2024

It is fixed:

{'text': 'Produced by Miranda van de Heijning, Virginia Paque and PG Distributed\nProofreaders. This file was produced from images generously made\navailable by the Biblioth que nationale de France (BnF/Gallica) at\nhttp://gallica.bnf.fr.\n\n\n\n\n\n[Nota del Transcriptor: Las irregularidades en acentuación y ortografía\nencontradas en...
          ...
          la\npoca comodidad y proporciones de las familias que le seguian.\n\n\n\n\n\nEnd of the Project Gutenberg EBook of Relacion historica de los sucesos de\nla rebelion de Jose Gabriel Tupac-Amaru en las provincias del Peru,\nel ano de 1780, by Anonymous',
 'meta': "{'file': 'PG10293_raw.txt'}"}

from data_tooling.

Create dataset from Project Gutenberg about data_tooling HOT 12 CLOSED

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent