Git Product home page Git Product logo

dolma's Introduction

Dolma's official logo. It's dolma written in yellow, round lowercase letters over a blue background.

Dolma is two things:

  1. Dolma Dataset: an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
  2. Dolma Toolkit: a high-performance toolkit for curating datasets for language modeling -- this repo contains the source code for the Dolma Toolkit.

Dolma Dataset

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It was created as a training corpus for OLMo, a language model from the Allen Institute for AI (AI2).

Dolma is available for download on the HuggingFace ๐Ÿค— Hub: huggingface.co/datasets/allenai/dolma. Dolma is licensed under ODC-BY; see our blog post for explanation.

You can also read more about Dolma in our announcement, as well as by consulting its data sheet.

Dolma Toolkit

This repository houses the Dolma Toolkit, which enables curation of large datasets for (pre)-training ML models. Its key features are:

  1. High Performance โšก: Can process billions of documents concurrently thanks to built-in parallelism.
  2. Portability ๐Ÿงณ: Works on a single machine, a cluster, or cloud environment.
  3. Built-In Taggers ๐Ÿท: Includes ready-to-use taggers commonly used to curate datasets such as Gopher, C4, and OpenWebText.
  4. Fast Deduplication ๐Ÿ—‘: Speedy document deduplication using a Rust Bloom filter.
  5. Extensibility ๐Ÿงฉ & Cloud Support โ˜: Supports custom taggers and AWS S3-compatible locations.

To install, simply type pip install dolma in your terminal.

To learn more about how to use the Dolma Toolkit, please visit the documentation.

Citation

If you use the Dolma dataset or toolkit, please cite the following items:

@article{dolma,
  title = {{Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}},
  author={Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo},
  year={2024},
  journal={arXiv preprint},
  url={https://arxiv.org/abs/2402.00159}
}

dolma's People

Contributors

arnavic avatar benbogin avatar chris-ha458 avatar dependabot[bot] avatar dirkgr avatar drschwenk avatar eltociear avatar ianand avatar ianmagnusson avatar kennethenevoldsen avatar kyleclo avatar muennighoff avatar peterbjorgensen avatar rodneykinney avatar rohitrathore1 avatar simonw avatar soldni avatar undfined avatar whattabatt avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

dolma's Issues

Provenance license?

Hi I'm researching provenance license/consent risk for clients. The risk being managed is "risk of litigation requiring derivative works such as LLMs to be taken down as a result of copyright violation".

I can't immediately find any resources regarding dolma that address this. I can see some ways that it could be by only crawling content that has a clear statement of the content license (such as Creative Commons).

Apologies if this was made clear somewhere!

๐Ÿ™ in advanceโ€ฆ

Data sheet link in README is broken

This link is a 404:

You can also read more about Dolma in [our announcement](https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64), as well as by consulting its [data sheet](docs/assets/dolma-datasheet-v0.1.pdf).

docs/assets/dolma-datasheet-v0.1.pdf

make_wikipedia.py: long running time

Hi, Thank you for sharing this outstanding repository!

I have been trying to use scripts/make_wikipedia_py to process a German wikipedia dump:

python scripts/make_wikipedia.py --output wikipedia --lang de  --date 20240201 --processes 16

Unfortunately, it has been running for several days and judging from the outputs it seems to have made only little progress if I interpret the output correctly:

[...]
WARNING:root:Template errors in article 'Buckenhof' (395836): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Imsterberg' (395533): title(0) recursion(7929961, 0, 0)
WARNING:root:Template errors in article 'Spardorf' (395843): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Marloffstein' (395848): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Karres' (395572): title(0) recursion(7929961, 0, 0)
[...]

At this speed, it would take weeks to complete. Using htop I can see that all processes are busy, so I don't think that this is a multiprocessing problem (#58), however, I am also running it on a Linux machine.

This is likely a problem of the underlying wikiextractor library, but since there seems to be little to no activity and I am interested in your experience of using this script. Is it normal for this to take so long?

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

While running taggers on the hplt dataset, I encountered a problem that means that the not_alphanum_paragraph_v1 stalls forever. In order to debug the problem I have created a minimum working example by copy pasting some code from the TaggerProcessor. I have attached the debugging code in this archive with some text that triggers the problem.
mwe.tar.gz

It looks like long sequences of emojis stalls the tagger forever. Here are some timings of emoji text from the hplt dataset:

InputSpec(id='7', text='๐Ÿ˜  ๐Ÿ˜ก', source='hplt1.2', version=None)
took 0.000039 seconds

InputSpec(id='4', text='๐Ÿ˜  ๐Ÿ˜ก ๐Ÿ˜ค ๐Ÿ˜‹ ๐Ÿ˜Ž ๐ŸŒฆ ๐ŸŒง ๐ŸŒœ ๐ŸŒˆ ๐Ÿ ๐ŸŽ…\n\nAnti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000025 seconds

InputSpec(id='11', text='๐Ÿ˜  ๐Ÿ˜ก ๐Ÿ˜ค ๐Ÿ˜‹ ๐Ÿ˜Ž ๐Ÿ˜ด ๐Ÿ˜ˆ ๐Ÿ˜‡ ๐Ÿ˜• ๐Ÿ˜ ๐Ÿ˜‘ ๐Ÿ‘ฒ ๐Ÿ‘ฎ ๐Ÿ’‚ ๐Ÿ‘ถ โค ๐Ÿ’” ๐Ÿ’• ๐Ÿ’˜ ๐Ÿ’Œ ๐Ÿ’‹ ๐ŸŽ ๐Ÿ’ฐ ๐Ÿ’ ๐Ÿ‘ ๐Ÿ‘Ž ๐Ÿ‘Œ โœŒ๏ธ ๐Ÿค˜ ๐Ÿ‘ ๐ŸŽต โ˜•๏ธ ๐Ÿต Anti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000021 seconds

InputSpec(id='5', text='๐Ÿ˜  ๐Ÿ˜ก ๐Ÿ˜ค ๐Ÿ˜‹ ๐Ÿ˜Ž ๐Ÿ˜ด ๐Ÿ˜ˆ ๐Ÿ˜‡ ๐Ÿ˜• ๐Ÿ˜ ๐Ÿ˜‘ ๐Ÿ‘ฒ ๐Ÿ‘ฎ ๐Ÿ’‚ ๐Ÿ‘ถ โค ๐Ÿ’” ๐Ÿ’• ๐Ÿ’˜ ๐Ÿ’Œ ๐Ÿ’‹ ๐ŸŽ ๐Ÿ’ฐ ๐Ÿ’ ๐Ÿ‘ ๐Ÿ‘Ž ๐Ÿ‘Œ โœŒ๏ธ ๐Ÿค˜ ๐Ÿ‘ ๐ŸŽต โ˜•๏ธ ๐Ÿต ๐Ÿบ ๐Ÿท ๐Ÿผ โ˜€๏ธ ๐ŸŒค ๐ŸŒฆ ๐ŸŒง ๐ŸŒœ ๐ŸŒˆ ๐Ÿ ๐ŸŽ…\n\nAnti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 64.204857 seconds

InputSpec(id='3', text='\nGรฆstebogs indlรฆg: *\n๐Ÿ˜„ ๐Ÿ˜ƒ ๐Ÿ˜Š ๐Ÿ˜‰ ๐Ÿ˜ ๐Ÿ˜š ๐Ÿ˜— ๐Ÿ˜œ ๐Ÿ˜› ๐Ÿ˜ณ ๐Ÿ˜ ๐Ÿ˜ฌ ๐Ÿ˜Œ ๐Ÿ˜ž ๐Ÿ˜ข ๐Ÿ˜‚ ๐Ÿ˜ญ ๐Ÿ˜… ๐Ÿ˜“ ๐Ÿ˜ฉ ๐Ÿ˜ฎ ๐Ÿ˜ฑ ๐Ÿ˜  ๐Ÿ˜ก ๐Ÿ˜ค ๐Ÿ˜‹ ๐Ÿ˜Ž ๐Ÿ˜ด ๐Ÿ˜ˆ ๐Ÿ˜‡ ๐Ÿ˜• ๐Ÿ˜ ๐Ÿ˜‘ ๐Ÿ‘ฒ ๐Ÿ‘ฎ ๐Ÿ’‚ ๐Ÿ‘ถ โค ๐Ÿ’” ๐Ÿ’• ๐Ÿ’˜ ๐Ÿ’Œ ๐Ÿ’‹ ๐ŸŽ ๐Ÿ’ฐ ๐Ÿ’ ๐Ÿ‘ ๐Ÿ‘Ž ๐Ÿ‘Œ โœŒ๏ธ ๐Ÿค˜ ๐Ÿ‘ ๐ŸŽต โ˜•๏ธ ๐Ÿต ๐Ÿบ ๐Ÿท ๐Ÿผ โ˜€๏ธ ๐ŸŒค ๐ŸŒฆ ๐ŸŒง ๐ŸŒœ ๐ŸŒˆ ๐Ÿ ๐ŸŽ…\n\nAnti-Spam: *\nSpรธrgsmรฅl: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
... takes 'forever'

It seems to be a bug in the regex python package. If I swap the regex package with the standard library re package it takes only ms again. I am not sure what feature this regex package has that makes it necessary, but this bug make me question whether it will encounter something similar with other regex queries.

We encountered the bug while trying to create an overview of the taggers:
centre-for-humanities-computing/danish-foundation-models#207 (comment)

Tokenizer name or path must be found error

I am experiencing an issue while tokenizing the Wikipedia dataset mentioned in the following step. I am having my tokenizer file in the root of this repository and my relative path is following: dolma/EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json.

The traceback of the error is following:

~/dolma$ dolma tokens \
>     --documents "wikipedia/example0/documents/*.gz" \
>     --tokenizer_name_or_path "EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json" \
>     --destination wikipedia/example0/tokens \
>     --processes 16
batch_size: 10000
debug: false
destination: wikipedia/example0/tokens
documents:
- wikipedia/example0/documents/*.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 16
ring_size: 8
seed: 3920
tokenizer:
  bos_token_id: null
  eos_token_id: null
  name_or_path: null
  pad_token_id: null
  segment_before_tokenization: false
tokenizer_name_or_path: dolma/EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
work_dir:
  input: null
  output: null
Traceback (most recent call last):
  File "/home/TeAmP0is0N/anaconda3/envs/dolma/bin/dolma", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/TeAmP0is0N/dolma/python/dolma/cli/__main__.py", line 91, in main
    return cli.run_from_args(args=args, config=config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/TeAmP0is0N/dolma/python/dolma/cli/__init__.py", line 190, in run_from_args
    return cls.run(parsed_config)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/TeAmP0is0N/dolma/python/dolma/cli/tokenizer.py", line 181, in run
    raise DolmaConfigError("Tokenizer name or path must be provided.")
dolma.core.errors.DolmaConfigError: Tokenizer name or path must be provided.

Change bloom_filter implementation of hash

Currently, bloom_filter.rs implements ahash for the internal hasher.

This is problematic since ahash has an unstable representation:

different computers or computers on different versions of the code will observe different hash values. As such, aHash is not recommended for use other than in-memory maps. Specifically, aHash is not intended for network use or in applications which persist hashed values.

I would love to learn if the dolma developers have found a way to serialize it in a way that maintains some kind of portability, but that is not a supported use case and I feel there is benefit in moving to a stable hash.

Recommendations

  • Rust's Default hash is ensured to be reasonably fast and cryptographically secure. Currently it is siphash1-3 and it supports keyed hashing (which can be used as a seeded hash)
  • Blake3 is one of the fastest if not the fastest cryptographic hash. It also supports keyed hashing
  • xxhash (more specifically xxh3 iteration) is one of the fastest if not the fastest hasher that passes SMHasher.

make_wikipedia in getting_started.md

The code for running make_wikipedia needs to be edited. The one currently in the document

python scripts/make_wikipedia.py
--output wikipedia
--languages simple
--date 20231001
--lang simple
--num_processes 16

should be written as below:

python scripts/make_wikipedia.py
--output wikipedia
--date 20231001
--lang simple
--processes 16

Blank documents in common crawl data

Hi, I've been exploring the common crawl data which I downloaded from huggingface, and I noticed there seem to be a lot of rows with blank text.

For example, in data/common-crawl/cc_en_head/cc_en_head-0000.json.gz, I found that ~12.25% of rows had empty text fields.

image

Query:

select
count(*) as all_rows
, sum(case when text = '' then 1 else 0 end) as null_text
, sum(case when text = '' then 1 else 0 end)/count(*) as ratio
from cc_en_head_0000
where true 

Example ID:
http://009.housedems.com/article/dems-call-restoration-stolen-wages-mi-workers

make_wikipedia.py fails on linux

Traceback (most recent call last):
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 283, in _multiprocessing_run_all
    multiprocessing.set_start_method("spawn")
  File "/usr/lib/python3.11/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 289, in <module>
    main()
  File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 285, in main
    processor(date=args.date, lang=args.lang)
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 390, in __call__
    fn(
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 285, in _multiprocessing_run_all
    assert multiprocessing.get_start_method() == "spawn", "Multiprocessing start method must be spawn"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Multiprocessing start method must be spawn

The bug can be fixed by setting
multiprocessing.set_start_method("spawn")
in the __main__ environment.

Perhaps the dolma core/parallel.py should use multiprocessing.get_context("spawn") to avoid this.

Dolma stat crashes because number of bins overflows python integer

The NUM_BINS constant in python/dolma/core/analyzer.py is 100k by default and this value overflows the 10**NUM_BINS expression in FixedBucketsValTracker. The _make_tracker function does not use the number of bins from the config but use the constant value. I guess this is because the counts are then summarised to the correct number of bins in the end?

I have tried using the InferBucketsValTracker instead and it seems to work. However, the bins array in the results is sometimes +1 larger than the counts, which is expected if the bins array represent the edges of the bins, but sometimes the bins and counts have the same length, so I am not sure what the bins in the final result represents?

dolma stat --attributes "mC4_da/attributes/v0tags/*.json.gz" --bins 100 --processes 12 --report v0tags_report2
attributes:
- mC4_da/attributes/v0tags/*.json.gz
bins: 100
debug: false
processes: 12
regex: null
report: v0tags_report2
seed: 0
work_dir:
  input: null
  output: null
Found 1,024 files to process
files: 0.00f [00:00, ?f/s]    multiprocessing.pool.RemoteTraceback:
"""uments: 0.00d [00:00, ?d/s]
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/peter/kode/dolma_clean/python/dolma/core/parallel.py", line 174, in _process_single_and_save_status
    cls.process_single(
  File "/home/peter/kode/dolma_clean/python/dolma/core/analyzer.py", line 120, in process_single
    trackers.setdefault(f"{attr_name}/score", _make_tracker()).add(score)
  File "/home/peter/kode/dolma_clean/python/dolma/core/binning.py", line 245, in add
    k = int(m * self.n), e
            ~~^~~~~~~~
OverflowError: int too large to convert to float
"""

How does Exact paragraph deduplication performed?

hi, thank you for your great work.

I am wondering how exactly does the "Exact paragraph deduplication" operation is carried out?

For my understanding, "Exact paragraph deduplication" follows these steps:

  1. split each document to paragraphs
  2. detect dupliacted paragraphs (using the bloom filter)
  3. remove dupliacted paragraphs.

However, there are a few questions:

  1. for step 3, assume you have a paragraph that is dupliacted for N times. Then it is reasonable to remove N-1 dupliacted ones. I am wondering which one of the N paragraph should be retrained and which N-1 ones should be removed?
  2. removing a paragraph from a document will usually hurt the original document. Will it down grade the data quality?

Only the attributes written by the last tagger in the tagger list gets written in version 1.0.0

Since upgrading dolma to version 1.0.0 I only get the attributes from the last tagger in the list.
I think the problem is here:

# if not set; it will potentially not write to the output stream
# in case a tagger emits no spans
attributes_by_stream[tagger_output.path] = {}

tagger_output.path is the same for all the taggers in the list, but attributes_by_stream[tagger_output.path] will be set to empty dictionary when looping through the taggers, leaving only the attributes from the last tagger in the list.
This bug is not present in version 0.9.4.
I would submit a pull request, but I am not sure what these three lines are supposed to fix.

Potential race condition

I have been looking into https://github.com/allenai/dolma/blob/main/src/bloom_filter.rs
Specifically how it was thread-safe

pub fn contains_hashes(&self, hashes: &Vec<u64>) -> bool {
    for hash in hashes {
        let hash = *hash as usize;
        let index = hash / 32 % self.bits.len();
        let bit = hash % 32;
        if self.bits[index].load(Ordering::Relaxed) & (1 << bit) == 0 {
            return false;
        }
    }
    return true;
}

Here, while the access to individual AtomicU32 values within the array is atomic, the method is checking a condition over multiple AtomicU32 values without a lock on the Vec. As such multiple threads could be accessing individual AtomicU32 values independently of each other in both read or write.
If one or more of the values change due to another thread's write operation during the execution of contains_hashes, it might lead to an incorrect result.

Consider the scenario where thread A is executing contains_hashes and finds that all the required bits are set. Before thread A completes execution, thread B updates one of the bits that thread A already checked. Since the check is not atomic across the entire vector, thread A's result becomes stale, and the method might return false when it should return true.
This problem is compounded if there are more threads exist changing other items that have already been checked by other threads.
If there are a higher degree of true positives(duplicates) the issue is also exacerbated.
This is particularly concerning since Bloom filters are not supposed to produce false negatives, the problem becomes worse as we try to increase threads or the underlying data has a lot of duplicates.

In summary, The code exhibits a race condition where the result of the contains_hashes method can be affected by concurrent updates to the bits vector. The individual accesses to AtomicU32 are atomic, but the method's logic requires a consistent view of multiple AtomicU32 values, and this consistency is not guaranteed.

To correct this, a higher-level synchronization mechanism (e.g., a read-write lock on the whole of the Vec) might be required to ensure that the entire check operation in contains_hashes is atomic with respect to updates to the bits vector.

This is the approach taken by another rust crate ofilter.
Relevant source

pub struct SyncBloom<T> {
    inner: Arc<RwLock<Bloom<T>>>,
}

This is an approach taken by google's Guava library for Go
Their solution is not complete nor comprehensive, with a lot of tradeoffs in the datastructures, types and methods operating upon them and tests to validate that their tradeoffs do not cause too much error.
(I do not like this approach but apparently it is good enough for them).

tagger_modules do not work in current git version

The modules show up with dolma list --tagger_modules mypackage.mymodule but it crashes if you do dolma tag --tagger_modules mypackage.mymodule ...
The problem is that the tagger modules are not loaded before this part of the code which instantiates the taggers by name

for tagger_name in taggers:
# instantiate the taggers here to make sure they are all valid + download any necessary resources
tagger = TaggerRegistry.get(tagger_name)
# delete the tagger after we are done with it so that we don't keep it in memory
del tagger

This means the dolma tag commands crashes with ValueError: Unknown tagger mytagger ...

Some race condition in url taggers

Even with the latest git version some of the URL taggers crash if I run the taggers with multiprocessing. I can't figure out where this race condition happens. If I run the taggers with --processes 1 first and then do --processes 16 it works.

Support providing streams into mixer via CLI

@IanMagnusson asks

I'm trying to figure out how to mix using the dolma cli args instead of the config. I want to do something like this but I cant figure out how to index the streams arg correctly:

dolma mix --streams[0].name "$name"
            --streams[0].documents "$input_prefix/$file" \
            --streams[0].output.path "$output_prefix/$file" \
            --streams[0].output.max_size_in_bytes 1000000000 \
            --streams[0].attributes s2orc-eval \
            --streams[0].filter.exculde "$@.attributes[?(@.bff_duplicate_paragraph_spans_decontamination && @.bff_duplicate_paragraph_spans_decontamination[0] && @.bff_duplicate_paragraph_spans_decontamination[0][2] >= 1.0)]" 

We should support this use case. As a stopgap, we should support echo '{...}' | dolma -c - mix, i.e. allow passing config through stdin.

make_wikipedia.py hardcoded to simple

How to fix:
Change the URL from:
https://dumps.wikimedia.org/simplewiki/{date}/{lang}wiki-{date}-pages-articles-multistream.xml.bz2
to:
https://dumps.wikimedia.org/{lang}wiki/{date}/{lang}wiki-{date}-pages-articles-multistream.xml.bz2

deduplication examples does not work

I tested the document deduplication command on a sample dataset got the following error.

To replicate the problem I encountered, I've generated a dummy dataset consisting of six documents:

import gzip
import json

documents = [
    {"id": str(i), "text": 'dummy', "source": "peS2o"} for i in range(5)
]

documents.append({"id": str(5), "text": 'dummy_test', "source": "peS2o"})

file_path = '/foo/test_on_dummy_dataset/dummy_data.jsonl.gz'

with gzip.open(file_path, 'wt', encoding='UTF-8') as f:
    for document in documents:
        f.write(json.dumps(document) + '\n')

Then tried to run:

RUST_BACKTRACE=full dolma dedupe \
    --documents "/foo/test_on_dummy_dataset/documents/dummy_data.jsonl.gz" \
    --dedupe.documents.attribute_name 'duplicate_documents' \
    --dedupe.skip_empty \
    --bloom_filter.file /tmp/deduper_bloom_filter_test.bin \
    --no-bloom_filter.read_only \
    --bloom_filter.estimated_doc_count '5' \
    --bloom_filter.desired_false_positive_rate '0.0001' \
    --processes 16

(P.S I have also tried to reduce the number of processes to 1, didn't work out either.)

where I obtain the error:

bloom_filter:
  desired_false_positive_rate: 0.0001
  estimated_doc_count: 5
  file: deduper_bloom_filter_test.bin
  read_only: false
  size_in_bytes: 0
dedupe:
  documents:
    attribute_name: duplicate_documents
    key: ???
  name: duplicate_documents
  skip_empty: true
documents:
- /work/github/test_on_dummy_dataset/documents/dummy_data.jsonl.gz
processes: 16
work_dir:
  input: /tmp/dolma-input-4uglco2b
  output: /tmp/dolma-output-wrrv_wfd
[2023-12-18T15:23:01Z INFO  dolma::bloom_filter] Loading bloom filter from "deduper_bloom_filter_test.bin"...
[2023-12-18T15:23:01Z INFO  dolma::deduper] Writing attributes for /work/github/test_on_dummy_dataset/documents/dummy_data.jsonl.gz to /work/github/test_on_dummy_dataset/attributes/duplicate_documents/dummy_data.jsonl.gz.tmp
[2023-12-18T15:23:01Z INFO  dolma::deduper] Writing attributes for /work/github/test_on_dummy_dataset/documents/dummy_data.jsonl.gz to /work/github/test_on_dummy_dataset/attributes/duplicate_documents/dummy_data.jsonl.gz.tmp
thread '<unnamed>' panicked at src/deduper.rs:152:26:
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: " --> 1:1\n  |\n1 | ???\n  | ^---\n  |\n  = expected chain" }
stack backtrace:
   0:     0x7f01e9d139ec - std::backtrace_rs::backtrace::libunwind::trace::he43a6a3949163f8c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x7f01e9d139ec - std::backtrace_rs::backtrace::trace_unsynchronized::h50db52ca99f692e7
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7f01e9d139ec - std::sys_common::backtrace::_print_fmt::hd37d595f2ceb2d3c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x7f01e9d139ec - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h678bbcf9da6d7d75
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7f01e9d4297c - core::fmt::rt::Argument::fmt::h3a159adc080a6fc9
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/rt.rs:138:9
   5:     0x7f01e9d4297c - core::fmt::write::hb8eaf5a8e45a738e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/mod.rs:1094:21
   6:     0x7f01e9d0fd8e - std::io::Write::write_fmt::h9663fe36b2ee08f9
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/io/mod.rs:1714:15
   7:     0x7f01e9d137d4 - std::sys_common::backtrace::_print::hcd4834796ee88ad2
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x7f01e9d137d4 - std::sys_common::backtrace::print::h1360e9450e4f922a
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x7f01e9d14f03 - std::panicking::default_hook::{{closure}}::h2609fa95cd5ab1f4
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:270:22
  10:     0x7f01e9d14c1c - std::panicking::default_hook::h6d75f5747cab6e8d
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:290:9
  11:     0x7f01e9d15489 - std::panicking::rust_panic_with_hook::h57e78470c47c84de
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:707:13
  12:     0x7f01e9d15387 - std::panicking::begin_panic_handler::{{closure}}::h3dfd2453cf356ecb
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:599:13
  13:     0x7f01e9d13f16 - std::sys_common::backtrace::__rust_end_short_backtrace::hdb177d43678e4d7e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:170:18
  14:     0x7f01e9d150d2 - rust_begin_unwind
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
  15:     0x7f01e9628673 - core::panicking::panic_fmt::hd1e971d8d7c78e0e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
  16:     0x7f01e9628a7a - core::result::unwrap_failed::hccb456d39e9c31fc
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1652:5
  17:     0x7f01e9709c58 - <F as threadpool::FnBox>::call_box::h26117aa625de9352
  18:     0x7f01e9cc82b0 - std::sys_common::backtrace::__rust_begin_short_backtrace::he93b09d651d5d863
  19:     0x7f01e9cc403a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hcf05def3d47db391
  20:     0x7f01e9d1a955 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::haadd4e5af2ab0d62
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  21:     0x7f01e9d1a955 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::he4ba1fb09c16d807
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  22:     0x7f01e9d1a955 - std::sys::unix::thread::Thread::new::thread_start::he524ecf4b47bee95
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
  23:     0x7f01ea51eac3 - <unknown>
  24:     0x7f01ea5b0a40 - <unknown>
  25:                0x0 - <unknown>
[2023-12-18T15:23:01Z INFO  dolma::deduper] Writing bloom filter to "deduper_bloom_filter_test.bin"...
[2023-12-18T15:23:02Z INFO  dolma::deduper] Bloom filter written.
[2023-12-18T15:23:02Z INFO  dolma::deduper] Done!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.