Git Product home page Git Product logo

openwebtext's People

Contributors

8enmann avatar ddbourgin avatar dependabot[bot] avatar isagrue avatar jcpeterson avatar johngiorgi avatar simonfall avatar smeylan avatar tilmanrpk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

openwebtext's Issues

How to cite this version of openwebtext?

Hi! Do you have any standard bibtex that you recommend to use when citing your work?
I understand that this is a reproduction of Radford et al., but your work deserves credit.

Idea for further filtering

I've just run a quick filter to find non-English docs and found 5,052 such cases (of the total 8 million).

It's a fairly crude filter but I haven't seen any false positives

import re
import datasets

ds = datasets.load_dataset("openwebtext", split="train")
ds_filtered = ds.filter(lambda sample: not re.search("(?i)the|that|and|with|this", sample["text"]))

Samples of the docs are things like this:

image

Printed with

for doc in ds_filtered:
    print(doc["text"].replace("\n", " | ")[:400])
    print("\n")

Feel free to close if you have no plans for future versions of the dataset, just thought you might like to know.

getting a datasets.utils.info_utils.NonMatchingSplitsSizesError when downloading the dataset from huggingface

hello,
I am trying to download the openwebtext dataset from huggingface, but I keep getting the following error:

Downloading data: 100%|________________________________________________________________________________________________________________| 12.9G/12.9G [25:43<00:00, 8.35MB/s]
/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/download/download_manager.py:527: FutureWarning: 'num_proc' was deprecated in version 2.6.2 and will be removed in 3.0.0. Pass `DownloadConfig(num_proc=<num_proc>)` to the initializer instead.
  warnings.warn(
Extracting data files: 100%|________________________________________________________________________________________________________| 20610/20610 [9:43:42<00:00,  1.70s/it]
Traceback (most recent call last):
  File "ssd_process_data.py", line 485, in <module>
    main()
  File "ssd_process_data.py", line 369, in main
    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1782, in load_dataset
    builder_instance.download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 872, in download_and_prepare
    self._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 1649, in _download_and_prepare
    super()._download_and_prepare(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/builder.py", line 985, in _download_and_prepare
    verify_splits(self.info.splits, split_dict)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/utils/info_utils.py", line 100, in verify_splits
    raise NonMatchingSplitsSizesError(str(bad_splits))
datasets.utils.info_utils.NonMatchingSplitsSizesError: [{'expected': SplitInfo(name='train', num_bytes=39769494896, num_examples=8013769, shard_lengths=None, dataset_name=None), 'recorded': SplitInfo(name='train', num_bytes=39769065791, num_examples=8013740, shard_lengths=[101000, 100000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 101000, 101000, 101000, 101000, 102000, 102000, 100000, 101000, 100000, 101000, 102000, 101000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 101000, 101000, 102000, 101000, 102000, 101000, 101000, 100000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 101000, 100000, 101000, 102000, 101000, 101000, 101000, 101000, 101000, 102000, 102000, 101000, 102000, 101000, 102000, 102000, 101000, 101000, 102000, 102000, 102000, 101000, 102000, 102000, 102000, 101000, 101000, 102000, 101000, 13740], dataset_name='openwebtext')}]
(ssdlm) sloboda1@dgx02:~/controlled_reduction/decoding_approaches/ssd-lm$ less  /home/nlp/sloboda1/.cache/huggingface/datasets/openwebtext/plain_text/1.0.0/85b3ae7051d2d72e7c5fdf6dfb462603aaa26e9ed506202bf3a24d261c6c40a1_builder.lock

I have tried forcing the redownloading of the dataset by passing the download_mode="force_redownload" parameter, but it yield the same error.

I have also tried passing the ignore_verifications=True parameter, but this in turn yielded the following error:

    raw_datasets["train"] = load_dataset(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/site-packages/datasets/load.py", line 1754, in load_dataset
    verification_mode = VerificationMode(
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 339, in __call__
    return cls.__new__(cls, value)
  File "/home/nlp/sloboda1/anaconda3/envs/ssdlm/lib/python3.8/enum.py", line 663, in __new__
    raise ve_exc
ValueError: 'none' is not a valid VerificationMode

I am at loss here, and would really appreciate some guidance as to how to address this problem.
Thanks.

Filtering extracted results

this is more of a question than an issue - I noticed that in my scrape there is a large number of spurious results like:

Sorry, we just need to make sure you're not a robot. For best results, please make sure your browser is accepting cookies.

The actual openwebtext corpus seems pretty clean, so I'm wondering what if any heuristics were used to remove these pages, in order to reproduce the openwebtext corpus?

The corpus page mentions post-filtering using fasttext - is this something that will be added to this project at some point?

Finally, the readme implies that bs4 would be a better extractor than newspaper - is that the case? It's not an option in extract_text.py so it's difficult to compare.

Why is Newspaper3k used for html scraping?

I noticed newspaper is used for downloading articles while using raw scraper. Why not use a simpler (and probably less performance hungry) approach like requests, etc. ? It just seems unnessary complicated. So is there a specific reason for that?

pycurl error: transfer closed with X bytes remaining to read

(base) user@desktop:/data/openwebtext$ python fetch_urls.py
Downloaded RS_2012-12.bz2
Downloaded RS_v2_2008-06.xz
Downloaded RS_2012-05.bz2
Downloaded RS_v2_2009-09.xz
Downloaded RS_v2_2007-11.xz
Downloaded RS_v2_2010-05.xz
Downloaded RS_2012-01.bz2
Downloaded RS_2020-04.zst
Downloaded RS_v2_2006-06.xz
Downloaded RS_v2_2010-02.xz
Downloaded RS_v2_2006-02.xz
Downloaded RS_2012-09.bz2
Downloaded RS_v2_2010-04.xz
Traceback (most recent call last):
  File "fetch_urls.py", line 39, in <module>
    main()
  File "fetch_urls.py", line 33, in main
    curl.perform()
pycurl.error: (18, 'transfer closed with 1265259864 bytes remaining to read')
(base) user@desktop:/data/openwebtext$ 

Error with get_state in download.py

Hi, I downloaded the pre-filtered URL list from here, and then tried to extract the text with download.py as per the readme

python download.py url_dumps_deduped/RS_2018-07.xz.deduped.txt \
    --n_procs 40 \
    --scraper bs4 \
    --chunk_size 100000 \
    --compress \
    --timeout 30

For plenty of .txt files, I face this error

Traceback (most recent call last):
  File "download.py", line 235, in <module>
    completed_uids, state_fp, prev_cid = get_state(month, args.output_dir)
  File "download.py", line 210, in get_state
    latest_cid = max([int(a.split("-")[-1].split("_")[0]) for a in archives])
ValueError: max() arg is an empty sequence

Is this a known error? I am planning to dig through the code to try and debug this but I first wanted to see if anyone else is facing this issue and knows the fix/cause

How to resume download after an error?

Trying to restart a download after a failure (either due to a pycurl error, or a disconnect), throws an error:
When trying to resume, the script throws an error since the folder already exists.

(base) user@desktop:/data/openwebtext$ python fetch_urls.py
Traceback (most recent call last):
  File "fetch_urls.py", line 39, in <module>
    main()
  File "fetch_urls.py", line 26, in main
    os.makedirs(OUTPUT_DIR)
  File "/home/user/miniconda3/lib/python3.7/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: 'pushshift_dumps_full'

This means I need to restart the download from the beginning. Is there a way to resume downloading? I already downloaded a large amount of data and I don't want to throw it away.

BPE

I'd love to help out and implement BPE. Can you give me some pointers to get started?

Estimated disk space usage of scraped data?

Hello,

Per the readme downloads are processed a month at a time.

Is there an estimate of the average size of data scraped in these chunks? As well as an estimate of the final total size of the scraped results?

It might also be useful to add the total disk space requirement post-scrape to the readme - as I imagine disk space can be a prohibitive requirement for some.

Thank you!

Getting the karma score from pushshift

Thanks for making such an awesome library!

I just had a question about how the package gets submissions and comments with karma >= min_karma. My understanding from the Pushshift FAQ is that Pushshift doesn't have live scores. How did you get around that limitation?

Thanks again :)

Quick question

In the readme, it is mentioned that after doing all the preprocessing we will be left with 23M "good" links. Now

  1. is my assumption correct that the 23M good links are not unique
  2. Also if not then is there any information of how many unique links are present in this dataset?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.