jcpeterson / openwebtext Goto Github PK

Open clone of OpenAI's unreleased WebText dataset scraper. This version uses pushshift.io files instead of the API for speed.

License: GNU General Public License v3.0

Python 100.00%

openwebtext's People

Contributors

Stargazers

Watchers

Forkers

shafiahmed ryanmetz rvaughan cxz yet-another-account edwinyzh neohimu daniellsm tilmanrpk calclavia reactual 8enmann nickmvincent ricklentz sofq stanxii submitcode hughperkins b2220333 bradfordlynch alykhan naushadzaman ibenthinkin scape1989 stephwag tomarraj008 xrosliang faisalsouz yaroschiffelers isagrue sunilsivadas mooniker johngiorgi silence48 srmykola cikeyuansu murilo fagan2888 sandyhouse milescranmer johannahrodgers anshkumar adedeji-x tatellos prettywork2021 reeshogue splevine ihatedebug aaronkn tlntin cloudnepal trellixvulnteam niccolox freespeech4ever gaohuan2015 kmung spacetral 311dada genghanqiang devedu-ai zaidzentan gavinchen1314 xiawu zeratu ethicalsecurity-agency sohaib0399 lightcast mawuva fabricelifaa labyrinthine-unreal lihuibng gianlucailardo habibzadeh wangdq1989 ashwaniverma27 hongjiedai murattut

openwebtext's Issues

How to cite this version of openwebtext?

Hi! Do you have any standard bibtex that you recommend to use when citing your work?
I understand that this is a reproduction of Radford et al., but your work deserves credit.

Idea for further filtering

I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).

It's a fairly crude filter but I haven't seen any false positives

import re
import datasets

ds = datasets.load_dataset("openwebtext", split="train")
ds_filtered = ds.filter(lambda sample: not re.search("(?i)the|that|and|with|this", sample["text"]))

Samples of the docs are things like this:

Printed with

for doc in ds_filtered:
    print(doc["text"].replace("\n", " | ")[:400])
    print("\n")

Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.

Undeclared requirement

This project has an undeclared requirement of 'htmlmin'.

getting a datasets.utils.info_utils.NonMatchingSplitsSizesError when downloading the dataset from huggingface

hello,
I am trying to download the openwebtext dataset from huggingface, but I keep getting the following error:

Downloading data: 100%|________________________________________________________________________________________________________________| 12.9G/12.9G [25:43<00:00, 8.35MB/s]
/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/download/download_manager.py:527: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
  warnings.warn(
Extracting data files: 100%|________________________________________________________________________________________________________| 20610/20610 [9:43:42<00:00,  1.70s/it]
Traceback (most recent call last):
  File "ssd_process_data.py", line 485, in <module>
    main()
  File "ssd_process_data.py", line 369, in main
    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 100, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=39769494896, num_examples=8013769, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=39769065791, num_examples=8013740, shard_lengths=[101000, 100000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 101000, 101000, 101000, 101000, 102000, 102000, 100000, 101000, 100000, 101000, 102000, 101000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 101000, 101000, 102000, 101000, 102000, 101000, 101000, 100000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 100000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 102000, 102000, 101000, 101000, 102000, 102000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 102000, 101000, 13740], dataset_name='openwebtext')}]
(ssdlm) sloboda1@dgx02:~/controlled_reduction/decoding_approaches/ssd-lm$ less  /home/nlp/sloboda1/.cache/huggingface/datasets/openwebtext/plain_text/1.0.0/85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1_builder.lock

I have tried forcing the redownloading of the dataset by passing the download_mode="force_redownload" parameter, but it yield the same error.

I have also tried passing the ignore_verifications=True parameter, but this in turn yielded the following error:

    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1754, in load_dataset
    verification_mode = VerificationMode(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 339, in __call__
    return cls.__new__(cls, value)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 663, in __new__
    raise ve_exc
ValueError: 'none' is not a valid VerificationMode

I am at loss here, and would really appreciate some guidance as to how to address this problem.
Thanks.

Filtering extracted results

this is more of a question than an issue - I noticed that in my scrape there is a large number of spurious results like:

Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.

The actual openwebtext corpus seems pretty clean, so I'm wondering what if any heuristics were used to remove these pages, in order to reproduce the openwebtext corpus?

The corpus page mentions post-filtering using fasttext - is this something that will be added to this project at some point?

Finally, the readme implies that bs4 would be a better extractor than newspaper - is that the case? It's not an option in extract_text.py so it's difficult to compare.

Why is Newspaper3k used for html scraping?

I noticed newspaper is used for downloading articles while using raw scraper. Why not use a simpler (and probably less performance hungry) approach like requests, etc. ? It just seems unnessary complicated. So is there a specific reason for that?

(Also) parsing structured data while you're at it

One might as well extract structured data from each element of such a dataset.

Linked data.
https://5stardata.info/

Useful features:

Relations to e.g. https://schema.org/Dataset (s)

Reified edges to other https://schema.org/ScholarlyArticle (s) indicating whether A seems to confirm or disprove B

URIs for columns in CSV and CSVW datasets

https://www.w3.org/TR/tabular-data-primer/ (CSVW)

... from chiphuyen/lazynlp#1

pycurl error: transfer closed with X bytes remaining to read

(base) user@desktop:/data/openwebtext$ python fetch_urls.py
Downloaded RS_2012-12.bz2
Downloaded RS_v2_2008-06.xz
Downloaded RS_2012-05.bz2
Downloaded RS_v2_2009-09.xz
Downloaded RS_v2_2007-11.xz
Downloaded RS_v2_2010-05.xz
Downloaded RS_2012-01.bz2
Downloaded RS_2020-04.zst
Downloaded RS_v2_2006-06.xz
Downloaded RS_v2_2010-02.xz
Downloaded RS_v2_2006-02.xz
Downloaded RS_2012-09.bz2
Downloaded RS_v2_2010-04.xz
Traceback (most recent call last):
  File "fetch_urls.py", line 39, in <module>
    main()
  File "fetch_urls.py", line 33, in main
    curl.perform()
pycurl.error: (18, 'transfer closed with 1265259864 bytes remaining to read')
(base) user@desktop:/data/openwebtext$

Error with get_state in download.py

Hi, I downloaded the pre-filtered URL list from here, and then tried to extract the text with download.py as per the readme

python download.py url_dumps_deduped/RS_2018-07.xz.deduped.txt \
    --n_procs 40 \
    --scraper bs4 \
    --chunk_size 100000 \
    --compress \
    --timeout 30

For plenty of .txt files, I face this error

Traceback (most recent call last):
  File "download.py", line 235, in <module>
    completed_uids, state_fp, prev_cid = get_state(month, args.output_dir)
  File "download.py", line 210, in get_state
    latest_cid = max([int(a.split("-")[-1].split("_")[0]) for a in archives])
ValueError: max() arg is an empty sequence

Is this a known error? I am planning to dig through the code to try and debug this but I first wanted to see if anyone else is facing this issue and knows the fix/cause

pre-filtered URLs can no longer be accessed

As soon in the pic, the pre-filtered URLs can no longer be accessed. Can someone take a look at it?

Thanks!

Can't download the 2G links

When I clicked on the link https://mega.nz/fm/9BRTBABA, it leads me to my own MEGA homepage. I'm just wondering if anyone else has that problem?

A question about dealing with dataset

Hello, thanks for your nice work, I find the openwebtext dataset in the link https://huggingface.co/datasets/openwebtext/tree/main/subsets, but I don't know how to use it. could you give me some advice?

How to resume download after an error?

Trying to restart a download after a failure (either due to a pycurl error, or a disconnect), throws an error:
When trying to resume, the script throws an error since the folder already exists.

(base) user@desktop:/data/openwebtext$ python fetch_urls.py
Traceback (most recent call last):
  File "fetch_urls.py", line 39, in <module>
    main()
  File "fetch_urls.py", line 26, in main
    os.makedirs(OUTPUT_DIR)
  File "/home/user/miniconda3/lib/python3.7/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: 'pushshift_dumps_full'

This means I need to restart the download from the beginning. Is there a way to resume downloading? I already downloaded a large amount of data and I don't want to throw it away.

extract_text.py is very slow and does not fully utilize multiprocessing

I noticed that the extract_text.py script is very slow. Most of the time only one process is used.

I think that most of the time is spend in uncompressing the tar archive and not in the newspaper text extraction. I will work on an script that makes it faster!

BPE

I'd love to help out and implement BPE. Can you give me some pointers to get started?

Estimated disk space usage of scraped data?

Hello,

Per the readme downloads are processed a month at a time.

Is there an estimate of the average size of data scraped in these chunks? As well as an estimate of the final total size of the scraped results?

It might also be useful to add the total disk space requirement post-scrape to the readme - as I imagine disk space can be a prohibitive requirement for some.

Thank you!

Getting the karma score from pushshift

Thanks for making such an awesome library!

I just had a question about how the package gets submissions and comments with karma >= min_karma. My understanding from the Pushshift FAQ is that Pushshift doesn't have live scores. How did you get around that limitation?

Thanks again :)

missing argument `--html_archive` in extract_text.py instructions

Hi, thank you for making this repo :)

In the instructiosn for extracting text from html, you write:

python extract_text.py scraped/RS_20XX-XX-X_data.xz --n_procs 100

I think that this is missing --html_archive, before the filepath?

Quick question

In the readme, it is mentioned that after doing all the preprocessing we will be left with 23M "good" links. Now

is my assumption correct that the 23M good links are not unique
Also if not then is there any information of how many unique links are present in this dataset?