Comments (12)
Raw texts downloaded and deduplicated with https://github.com/pgcorpus/gutenberg (thanks @clancyoftheoverflow for the hint!) can now be found here: https://huggingface.co/datasets/bigscience-catalogue-data/project-gutenberg/
from data_tooling.
What about this: https://github.com/pgcorpus/gutenberg. It seems it has the whole Gutenberg corpus (plus metadata).
from data_tooling.
#self-assign
from data_tooling.
Need support for ZIP:
from data_tooling.
Once ZIP support merged into master, note that default Datasets "text" packaged module, yields each line as an example. First 20 examples:
{'text': "\ufeffProject Gutenberg's Martin Hyde, The Duke's Messenger, by John Masefield"}
{'text': ''}
{'text': 'This eBook is for the use of anyone anywhere at no cost and with'}
{'text': 'almost no restrictions whatsoever. You may copy it, give it away or'}
{'text': 're-use it under the terms of the Project Gutenberg License included'}
{'text': 'with this eBook or online at www.gutenberg.org'}
{'text': ''}
{'text': ''}
{'text': "Title: Martin Hyde, The Duke's Messenger"}
{'text': ''}
{'text': 'Author: John Masefield'}
{'text': ''}
{'text': 'Posting Date: August 24, 2008 [EBook #1274]'}
{'text': 'Release Date: April, 1998'}
{'text': 'Last Updated: March 16, 2018'}
{'text': ''}
{'text': ''}
{'text': 'Language: English'}
{'text': ''}
{'text': 'Character set encoding: UTF-8'}
{'text': ''}
{'text': "*** START OF THIS PROJECT GUTENBERG EBOOK MARTIN HYDE, THE DUKE'S MESSENGER ***"}
Maybe worth passing a configuration parameter to split it into documents or paragraphs (instead of lines), once merged:
from data_tooling.
Thanks @leonweber.
from data_tooling.
#self-assign
from data_tooling.
Added dataset scripts for LM here (en, zh, fr, pt, es):
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_fr_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_zh_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_pt_project_gutenberg
https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_es_project_gutenberg
from data_tooling.
Thanks @lvwerra.
Sample:
{
'text': '\ufeffThe Project Gutenberg EBook of The Magna Carta, by... about new eBooks.\n'
}
from data_tooling.
Sample with meta:
{
'text': 'The Project Gutenberg EBook of The Magna Carta, by... about new eBooks.',
'meta': "{'file': 'PG10000_raw.txt'}"
}
from data_tooling.
Please note that the text field, independently of the specific language, always contains English text:
- metadata in the header
- license in the footer
Example for Spanish ("es"):
{
'text': 'The Project Gutenberg EBook of Relacion historica de los sucesos de la\nrebelion de Jose Gabriel Tupac-Amaru en las provincias del Peru, el ano\nde 1780, by Anonymous\n\nThis eBook is for the use of anyone anywhere at no cost and with\nalmost no restrictions whatsoever. You may copy it, give it away or\nre-use it under the terms of the Project Gutenberg License included\nwith this eBook or online at www.gutenberg.org\n\n\nTitle: Relacion historica de los sucesos de la rebelion de Jose Gabriel\n Tupac-Amaru en las provincias del Peru, el ano de 1780\n\nAuthor: Anonymous\n\nRelease Date: November 26, 2003 [EBook #10293]\n\nLanguage: Spanish\n\n\n*** START OF THIS PROJECT GUTENBERG EBOOK RELACION HISTORICA ***\n\n\n\n\nProduced by Miranda van de Heijning, Virginia Paque and PG Distributed\nProofreaders. This file was produced from images generously made\navailable by the Biblioth que nationale de France (BnF/Gallica) at\nhttp://gallica.bnf.fr.\n\n\n\n\n\n[Nota del Transcriptor: Las irregularidades en acentuación y ortografía\nencontradas en este libro son consistentes con la flexibilidad de las\nreglas en uso en 1836, y así no deben ser consideradas "errores" sino\nun elemento del estilo de la época.]\n\n\n RELACION HISTORICA\n\n DE LOS\n\n SUCESOS DE LA REBELION\n\n DE\n\n JOSE GABRIEL TUPAC-AMARU,\n\n EN LAS\n\n PROVINCIAS DEL PERU,\n\n EL AÑO DE 1780.\n\n\n\n\n Primera Edicion.\n\n BUENOS-AIRES.\n\n IMPRENTA DEL ESTADO.\n\n 1836\n\n\n\n\n DISCURSO PRELIMINAR\n\n A LA\n\n REVOLUCION DE TUPAC-AMARU.\n\n\n * * * * *\n\n\nLas extorsiones de los corregidores, y la impunidad de que disfrutaban\nen las _Audiencias_, produgeron en 1780 una fuerte conmocion entre...
...
dia 3 de Julio de 1781, con las pocas tropas que le\nhabian quedado: diligencia que no pudo verificar Orellana con el\nvecindario de Puno, que convoyaba hasta el 5 del mismo, así por la\ndetencion que habia hecho, como por haberse visto precisado á seguir una\nmarcha mas lenta, á causa de las dificultades que le ocurrieron, por la\npoca comodidad y proporciones de las familias que le seguian.\n\n\n\n\n\nEnd of the Project Gutenberg EBook of Relacion historica de los sucesos de\nla rebelion de Jose Gabriel Tupac-Amaru en las provincias del Peru,\nel ano de 1780, by Anonymous\n\n*** END OF THIS PROJECT GUTENBERG EBOOK RELACION HISTORICA ***\n\n***** This file should be named 10293-8.txt or 10293-8.zip *****\nThis and all associated files of various formats will be found in:\n https://www.gutenberg.org/1/0/2/9/10293/\n\nProduced by Miranda van de Heijning, Virginia Paque and PG Distributed\nProofreaders. This file was produced from images generously made\navailable by the Biblioth que nationale de France (BnF/Gallica) at\nhttp://gallica.bnf.fr.\n\n\nUpdated editions will replace the previous one--the old editions\nwill be renamed.\n\nCreating the works from public domain print editions means that no\none owns a United States copyright in these works, so the Foundation\n(and you!) can copy and distribute it in the United States without\npermission and without paying copyright royalties. Special rules,\nset forth in the General Terms of Use part of this license, apply to\ncopying and distributing Project Gutenberg-tm electronic works to\nprotect the PROJECT GUTENBERG-tm concept and trademark. Project\nGutenberg is a registered trademark, and may not be used if you\ncharge for the eBooks, unless you receive specific permission...
...
posted since November 2003, with etext numbers OVER #10000, are\nfiled in a different way. The year of a release date is no longer part\nof the directory path. The path is based on the etext number (which is\nidentical to the filename). The path to the file is made up of single\ndigits corresponding to all but the last digit in the filename. For\nexample an eBook of filename 10234 would be found at:\n\n https://www.gutenberg.org/1/0/2/3/10234\n\nor filename 24689 would be found at:\n https://www.gutenberg.org/2/4/6/8/24689\n\nAn alternative method of locating eBooks:\n https://www.gutenberg.org/GUTINDEX.ALL',
'meta': "{'file': 'PG10293_raw.txt'}"
}
I'm fixing it.
from data_tooling.
It is fixed:
{'text': 'Produced by Miranda van de Heijning, Virginia Paque and PG Distributed\nProofreaders. This file was produced from images generously made\navailable by the Biblioth que nationale de France (BnF/Gallica) at\nhttp://gallica.bnf.fr.\n\n\n\n\n\n[Nota del Transcriptor: Las irregularidades en acentuación y ortografía\nencontradas en...
...
la\npoca comodidad y proporciones de las familias que le seguian.\n\n\n\n\n\nEnd of the Project Gutenberg EBook of Relacion historica de los sucesos de\nla rebelion de Jose Gabriel Tupac-Amaru en las provincias del Peru,\nel ano de 1780, by Anonymous',
'meta': "{'file': 'PG10293_raw.txt'}"}
from data_tooling.
Related Issues (20)
- Create dataset xnli
- Create dataset indonesian_news_articles_2017 HOT 4
- Create dataset tsac
- Create dataset science_magazing_aaas_academic_journal_ HOT 1
- Create dataset ekantipur_com
- Create dataset nurition_fact
- Create dataset information_week_digital_magazine
- Create dataset du_reader HOT 4
- Create dataset wikihow_vietnamese_human_instructions HOT 2
- Create dataset MT_Vi_Mono_VLSP2020 HOT 4
- Create dataset malindomorph__morphological_dictionary_and_analyser_for_malay_indonesian
- Create dataset human_instructions_in_indonesian_extracted_from_wikihow
- Create dataset mind_body_green
- Create dataset vanguard_daily_media
- Create dataset opus_100 HOT 2
- Create dataset odiencorp2_0 HOT 4
- Create license-compliant version of the Pile: Stack Exchange HOT 1
- Create license-compliant version of the Pile: EuroParl HOT 1
- Citing this resource HOT 4
- Reason for not applying remove_non_prining_characters normalization HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data_tooling.