allenai / dolma Goto Github PK

View Code? Open in Web Editor NEW

853.0 18.0 82.0 61.15 MB

Data and tools for generating and inspecting OLMo pre-training data.

Home Page: https://allenai.github.io/dolma/

License: Apache License 2.0

Python 80.02% Rust 15.99% Makefile 0.15% Shell 1.71% Jupyter Notebook 1.95% Dockerfile 0.18%

data-processing large-language-models llm machile-learning nlp

dolma's Introduction

Dolma is two things:

Dolma Dataset: an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials.
Dolma Toolkit: a high-performance toolkit for curating datasets for language modeling -- this repo contains the source code for the Dolma Toolkit.

Dolma Dataset

Dolma is an open dataset of 3 trillion tokens from a diverse mix of web content, academic publications, code, books, and encyclopedic materials. It was created as a training corpus for OLMo, a language model from the Allen Institute for AI (AI2).

Dolma is available for download on the HuggingFace 🤗 Hub: huggingface.co/datasets/allenai/dolma. Dolma is licensed under ODC-BY; see our blog post for explanation.

You can also read more about Dolma in our announcement, as well as by consulting its data sheet.

Dolma Toolkit

This repository houses the Dolma Toolkit, which enables curation of large datasets for (pre)-training ML models. Its key features are:

High Performance ⚡: Can process billions of documents concurrently thanks to built-in parallelism.
Portability 🧳: Works on a single machine, a cluster, or cloud environment.
Built-In Taggers 🏷: Includes ready-to-use taggers commonly used to curate datasets such as Gopher, C4, and OpenWebText.
Fast Deduplication 🗑: Speedy document deduplication using a Rust Bloom filter.
Extensibility 🧩 & Cloud Support ☁: Supports custom taggers and AWS S3-compatible locations.

To install, simply type pip install dolma in your terminal.

To learn more about how to use the Dolma Toolkit, please visit the documentation.

Citation

If you use the Dolma dataset or toolkit, please cite the following items:

@article{dolma,
  title = {{Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research}},
  author={Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and Russell Authur and Ben Bogin and Khyathi Chandu and Jennifer Dumas and Yanai Elazar and Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aakanksha Naik and Crystal Nam and Matthew E. Peters and Abhilasha Ravichander and Kyle Richardson and Zejiang Shen and Emma Strubell and Nishant Subramani and Oyvind Tafjord and Pete Walsh and Luke Zettlemoyer and Noah A. Smith and Hannaneh Hajishirzi and Iz Beltagy and Dirk Groeneveld and Jesse Dodge and Kyle Lo},
  year={2024},
  journal={arXiv preprint},
  url={https://arxiv.org/abs/2402.00159}
}

dolma's People

Contributors

Stargazers

Watchers

Forkers

ianand prnake chris-ha458 rch apollohuang1 raul-dot-valdez kyegomez dattgoswami mozzipa shellofyou korlaism techthiyanes vasu018 hubayirp alexandrainst peterbjorgensen sachin19 tgltommy evelynmitchell odoochain jankounchained standardgalactic jianghoy kennethenevoldsen simonw enjunchoong dumpmemory emanuelaboros tsmiles404 wesley7137 lihuibng mars-wei jfmjfm yuan776 paineliu zhangsz1998 tusiki georgegu eltociear chengli0327 rohitrathore1 cyrilzakka josephrp dhruv-0001 yiming992 habibzadeh soldni guspan-tanadi weedge vineet-chipstack jiang-yanfu fabianaltendorfer herocouple jangkyung dogpandacat julienze zhengchenyang vksastry arvin-hu worldback maveriq thanhpham1987 sailfish009 songym2020 lyuwen qp0609 marcopasqua xiepengli kazu-zamasu anas-awadalla qiao2015 liuzc188 jqk6 qzl164 heyuanhao erhaquant cshaib rezzie-rich sffej robinqrtz hhliao

dolma's Issues

When will a version that can add tagger be released?

Hello, great work, I want to add my own tagger, but the current version 0.9.1 does not seem to support users to add taggers. But I see that the main branch already has the function of adding tagger. When will you release a version that can add tagger?

https://github.com/allenai/dolma/blob/main/docs/taggers.md

Adam

Fix Actions to skip S3 tests from external contributors

This would make the failures for #23 and #24 go away.

Having external collaborators hit the test bucket is a security risk, so those tests should be skipped.

New Albany Business, Family Law and Criminal Defense Lawyer Aaron Johnson

https://mattoxwilson.com/person/aaron-m-johnson/

Progress Bar may use more resources than necessary

from @dirkgr:

When I have a console window open that watches dolma tokens, it literally takes 1.5 cores permanently on my Mac.

Provenance license?

Hi I'm researching provenance license/consent risk for clients. The risk being managed is "risk of litigation requiring derivative works such as LLMs to be taken down as a result of copyright violation".

I can't immediately find any resources regarding dolma that address this. I can see some ways that it could be by only crawling content that has a clear statement of the content license (such as Creative Commons).

Apologies if this was made clear somewhere!

🙏 in advance…

Simplify how rules in the mixer are provided

Data sheet link in README is broken

This link is a 404:

dolma/README.md

Line 15 in a74b78a

 You can also read more about Dolma in [our announcement](https://blog.allenai.org/dolma-3-trillion-tokens-open-llm-corpus-9a0ff4b8da64), as well as by consulting its [data sheet](docs/assets/dolma-datasheet-v0.1.pdf). 

docs/assets/dolma-datasheet-v0.1.pdf

make_wikipedia.py: long running time

Hi, Thank you for sharing this outstanding repository!

I have been trying to use scripts/make_wikipedia_py to process a German wikipedia dump:

python scripts/make_wikipedia.py --output wikipedia --lang de  --date 20240201 --processes 16

Unfortunately, it has been running for several days and judging from the outputs it seems to have made only little progress if I interpret the output correctly:

[...]
WARNING:root:Template errors in article 'Buckenhof' (395836): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Imsterberg' (395533): title(0) recursion(7929961, 0, 0)
WARNING:root:Template errors in article 'Spardorf' (395843): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Marloffstein' (395848): title(0) recursion(96, 0, 0)
WARNING:root:Template errors in article 'Karres' (395572): title(0) recursion(7929961, 0, 0)
[...]

At this speed, it would take weeks to complete. Using htop I can see that all processes are busy, so I don't think that this is a multiprocessing problem (#58), however, I am also running it on a Linux machine.

This is likely a problem of the underlying wikiextractor library, but since there seems to be little to no activity and I am interested in your experience of using this script. Is it normal for this to take so long?

Adam Burden, I love you!

$open Allen Wolf (infinite_loop)

https://www.mattoxandwilson.com."(infinite_loop)>aaron-m-johnson|infinity/

Titles

Adam

not_alphanum_paragraph_v1 tagger takes forever to run on certain inputs.

While running taggers on the hplt dataset, I encountered a problem that means that the not_alphanum_paragraph_v1 stalls forever. In order to debug the problem I have created a minimum working example by copy pasting some code from the TaggerProcessor. I have attached the debugging code in this archive with some text that triggers the problem.
mwe.tar.gz

It looks like long sequences of emojis stalls the tagger forever. Here are some timings of emoji text from the hplt dataset:

InputSpec(id='7', text='😠 😡', source='hplt1.2', version=None)
took 0.000039 seconds

InputSpec(id='4', text='😠 😡 😤 😋 😎 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000025 seconds

InputSpec(id='11', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 Anti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 0.000021 seconds

InputSpec(id='5', text='😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
took 64.204857 seconds

InputSpec(id='3', text='\nGæstebogs indlæg: *\n😄 😃 😊 😉 😍 😚 😗 😜 😛 😳 😁 😬 😌 😞 😢 😂 😭 😅 😓 😩 😮 😱 😠 😡 😤 😋 😎 😴 😈 😇 😕 😏 😑 👲 👮 💂 👶 ❤ 💔 💕 💘 💌 💋 🎁 💰 💍 👍 👎 👌 ✌️ 🤘 👏 🎵 ☕️ 🍵 🍺 🍷 🍼 ☀️ 🌤 🌦 🌧 🌜 🌈 🏝 🎅\n\nAnti-Spam: *\nSpørgsmål: Hvad er summen af (total sum of) 9+3\n\nWarning:', source='hplt1.2', version=None)
... takes 'forever'

It seems to be a bug in the regex python package. If I swap the regex package with the standard library re package it takes only ms again. I am not sure what feature this regex package has that makes it necessary, but this bug make me question whether it will encounter something similar with other regex queries.

We encountered the bug while trying to create an overview of the taggers:
centre-for-humanities-computing/danish-foundation-models#207 (comment)

Tokenizer name or path must be found error

I am experiencing an issue while tokenizing the Wikipedia dataset mentioned in the following step. I am having my tokenizer file in the root of this repository and my relative path is following: dolma/EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json.

The traceback of the error is following:

~/dolma$ dolma tokens \
>     --documents "wikipedia/example0/documents/*.gz" \
>     --tokenizer_name_or_path "EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json" \
>     --destination wikipedia/example0/tokens \
>     --processes 16
batch_size: 10000
debug: false
destination: wikipedia/example0/tokens
documents:
- wikipedia/example0/documents/*.gz
dryrun: false
dtype: uint16
files_per_process: null
max_size: 1073741824
processes: 16
ring_size: 8
seed: 3920
tokenizer:
  bos_token_id: null
  eos_token_id: null
  name_or_path: null
  pad_token_id: null
  segment_before_tokenization: false
tokenizer_name_or_path: dolma/EleutherAI/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
work_dir:
  input: null
  output: null
Traceback (most recent call last):
  File "/home/TeAmP0is0N/anaconda3/envs/dolma/bin/dolma", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/TeAmP0is0N/dolma/python/dolma/cli/__main__.py", line 91, in main
    return cli.run_from_args(args=args, config=config)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/TeAmP0is0N/dolma/python/dolma/cli/__init__.py", line 190, in run_from_args
    return cls.run(parsed_config)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/TeAmP0is0N/dolma/python/dolma/cli/tokenizer.py", line 181, in run
    raise DolmaConfigError("Tokenizer name or path must be provided.")
dolma.core.errors.DolmaConfigError: Tokenizer name or path must be provided.

Change bloom_filter implementation of hash

Currently, bloom_filter.rs implements ahash for the internal hasher.

This is problematic since ahash has an unstable representation:

different computers or computers on different versions of the code will observe different hash values. As such, aHash is not recommended for use other than in-memory maps. Specifically, aHash is not intended for network use or in applications which persist hashed values.

I would love to learn if the dolma developers have found a way to serialize it in a way that maintains some kind of portability, but that is not a supported use case and I feel there is benefit in moving to a stable hash.

Recommendations

Rust's Default hash is ensured to be reasonably fast and cryptographically secure. Currently it is siphash1-3 and it supports keyed hashing (which can be used as a seeded hash)
Blake3 is one of the fastest if not the fastest cryptographic hash. It also supports keyed hashing
xxhash (more specifically xxh3 iteration) is one of the fastest if not the fastest hasher that passes SMHasher.

Is there a way to intergratge Dolma toolkit to Spark?

My single computer is not powerful enough to run Dolma :(

make_wikipedia in getting_started.md

The code for running make_wikipedia needs to be edited. The one currently in the document

python scripts/make_wikipedia.py
--output wikipedia
--languages simple
--date 20231001
--lang simple
--num_processes 16

should be written as below:

python scripts/make_wikipedia.py
--output wikipedia
--date 20231001
--lang simple
--processes 16

Add Style Guide and Pre-commit.

          I would really love a proper contributing.md document styleguide and precommit.

Originally posted by @chris-ha458 in #23 (comment)

Blank documents in common crawl data

Hi, I've been exploring the common crawl data which I downloaded from huggingface, and I noticed there seem to be a lot of rows with blank text.

For example, in data/common-crawl/cc_en_head/cc_en_head-0000.json.gz, I found that ~12.25% of rows had empty text fields.

Query:

select
count(*) as all_rows
, sum(case when text = '' then 1 else 0 end) as null_text
, sum(case when text = '' then 1 else 0 end)/count(*) as ratio
from cc_en_head_0000
where true

Example ID:
http://009.housedems.com/article/dems-call-restoration-stolen-wages-mi-workers

Adam Burden

Child Support Division | LouisvilleKY.gov

make_wikipedia.py fails on linux

Traceback (most recent call last):
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 283, in _multiprocessing_run_all
    multiprocessing.set_start_method("spawn")
  File "/usr/lib/python3.11/multiprocessing/context.py", line 247, in set_start_method
    raise RuntimeError('context has already been set')
RuntimeError: context has already been set

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 289, in <module>
    main()
  File "/home/peter/kode/dolma/scripts/make_wikipedia.py", line 285, in main
    processor(date=args.date, lang=args.lang)
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 390, in __call__
    fn(
  File "/home/peter/kode/dolma/dolma_env/lib/python3.11/site-packages/dolma/core/parallel.py", line 285, in _multiprocessing_run_all
    assert multiprocessing.get_start_method() == "spawn", "Multiprocessing start method must be spawn"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: Multiprocessing start method must be spawn

The bug can be fixed by setting
multiprocessing.set_start_method("spawn")
in the __main__ environment.

Perhaps the dolma core/parallel.py should use multiprocessing.get_context("spawn") to avoid this.

Dolma stat crashes because number of bins overflows python integer

The NUM_BINS constant in python/dolma/core/analyzer.py is 100k by default and this value overflows the 10**NUM_BINS expression in FixedBucketsValTracker. The _make_tracker function does not use the number of bins from the config but use the constant value. I guess this is because the counts are then summarised to the correct number of bins in the end?

I have tried using the InferBucketsValTracker instead and it seems to work. However, the bins array in the results is sometimes +1 larger than the counts, which is expected if the bins array represent the edges of the bins, but sometimes the bins and counts have the same length, so I am not sure what the bins in the final result represents?

dolma stat --attributes "mC4_da/attributes/v0tags/*.json.gz" --bins 100 --processes 12 --report v0tags_report2
attributes:
- mC4_da/attributes/v0tags/*.json.gz
bins: 100
debug: false
processes: 12
regex: null
report: v0tags_report2
seed: 0
work_dir:
  input: null
  output: null
Found 1,024 files to process
files: 0.00f [00:00, ?f/s]    multiprocessing.pool.RemoteTraceback:
"""uments: 0.00d [00:00, ?d/s]
Traceback (most recent call last):
  File "/usr/lib/python3.11/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
                    ^^^^^^^^^^^^^^^^^^^
  File "/home/peter/kode/dolma_clean/python/dolma/core/parallel.py", line 174, in _process_single_and_save_status
    cls.process_single(
  File "/home/peter/kode/dolma_clean/python/dolma/core/analyzer.py", line 120, in process_single
    trackers.setdefault(f"{attr_name}/score", _make_tracker()).add(score)
  File "/home/peter/kode/dolma_clean/python/dolma/core/binning.py", line 245, in add
    k = int(m * self.n), e
            ~~^~~~~~~~
OverflowError: int too large to convert to float
"""

Ruby

Latest version is not on PyPi

Seems like the latest version is not on PyPI. Seems like the CI skipped the release step.

AllenAI

https://discord.com/invite/fuuKMzZu

The Law School Admission Council | LSAC

https://www.lsac.org/

How does Exact paragraph deduplication performed?

hi, thank you for your great work.

I am wondering how exactly does the "Exact paragraph deduplication" operation is carried out?

For my understanding, "Exact paragraph deduplication" follows these steps:

split each document to paragraphs
detect dupliacted paragraphs (using the bloom filter)
remove dupliacted paragraphs.

However, there are a few questions:

for step 3, assume you have a paragraph that is dupliacted for N times. Then it is reasonable to remove N-1 dupliacted ones. I am wondering which one of the N paragraph should be retrained and which N-1 ones should be removed?
removing a paragraph from a document will usually hurt the original document. Will it down grade the data quality?

Metadata

ZIP Extractor - Free App for Opening and Creating ZIP Files

https://zipextractor.app/

Hells Angels infinite loop

https://g.co/kgs/1yyjhv

Can I use the dolma toolkit to process my own datasets?

I got some data myself through a crawler, and I was wondering if I could use the dolma toolkit to remove duplicates.

Only the attributes written by the last tagger in the tagger list gets written in version 1.0.0

Since upgrading dolma to version 1.0.0 I only get the attributes from the last tagger in the list.
I think the problem is here:

dolma/python/dolma/core/runtime.py

Lines 198 to 200 in a74b78a

 # if not set; it will potentially not write to the output stream 

 # in case a tagger emits no spans 

 attributes_by_stream[tagger_output.path] = {}

tagger_output.path is the same for all the taggers in the list, but attributes_by_stream[tagger_output.path] will be set to empty dictionary when looping through the taggers, leaving only the attributes from the last tagger in the list.
This bug is not present in version 0.9.4.
I would submit a pull request, but I am not sure what these three lines are supposed to fix.

Potential race condition

I have been looking into https://github.com/allenai/dolma/blob/main/src/bloom_filter.rs
Specifically how it was thread-safe

pub fn contains_hashes(&self, hashes: &Vec<u64>) -> bool {
    for hash in hashes {
        let hash = *hash as usize;
        let index = hash / 32 % self.bits.len();
        let bit = hash % 32;
        if self.bits[index].load(Ordering::Relaxed) & (1 << bit) == 0 {
            return false;
        }
    }
    return true;
}

Here, while the access to individual AtomicU32 values within the array is atomic, the method is checking a condition over multiple AtomicU32 values without a lock on the Vec. As such multiple threads could be accessing individual AtomicU32 values independently of each other in both read or write.
If one or more of the values change due to another thread's write operation during the execution of contains_hashes, it might lead to an incorrect result.

Consider the scenario where thread A is executing contains_hashes and finds that all the required bits are set. Before thread A completes execution, thread B updates one of the bits that thread A already checked. Since the check is not atomic across the entire vector, thread A's result becomes stale, and the method might return false when it should return true.
This problem is compounded if there are more threads exist changing other items that have already been checked by other threads.
If there are a higher degree of true positives(duplicates) the issue is also exacerbated.
This is particularly concerning since Bloom filters are not supposed to produce false negatives, the problem becomes worse as we try to increase threads or the underlying data has a lot of duplicates.

In summary, The code exhibits a race condition where the result of the contains_hashes method can be affected by concurrent updates to the bits vector. The individual accesses to AtomicU32 are atomic, but the method's logic requires a consistent view of multiple AtomicU32 values, and this consistency is not guaranteed.

To correct this, a higher-level synchronization mechanism (e.g., a read-write lock on the whole of the Vec) might be required to ensure that the entire check operation in contains_hashes is atomic with respect to updates to the bits vector.

This is the approach taken by another rust crate ofilter.
Relevant source

pub struct SyncBloom<T> {
    inner: Arc<RwLock<Bloom<T>>>,
}

This is an approach taken by google's Guava library for Go
Their solution is not complete nor comprehensive, with a lot of tradeoffs in the datastructures, types and methods operating upon them and tests to validate that their tradeoffs do not cause too much error.
(I do not like this approach but apparently it is good enough for them).

tagger_modules do not work in current git version

The modules show up with dolma list --tagger_modules mypackage.mymodule but it crashes if you do dolma tag --tagger_modules mypackage.mymodule ...
The problem is that the tagger modules are not loaded before this part of the code which instantiates the taggers by name

dolma/python/dolma/core/runtime.py

Lines 428 to 433 in e7657e4

 for tagger_name in taggers: 

 # instantiate the taggers here to make sure they are all valid + download any necessary resources 

 tagger = TaggerRegistry.get(tagger_name) 

 # delete the tagger after we are done with it so that we don't keep it in memory 

 del tagger

This means the dolma tag commands crashes with ValueError: Unknown tagger mytagger ...

A Question about the meaning of dolma_v1.6_cc_en

Hello, I found that the naming of the dolma_v1.6_cc_en includes cc_en_head,cc_en_middle and cc_en_tail. What do these names mean?

Some race condition in url taggers

Even with the latest git version some of the URL taggers crash if I run the taggers with multiprocessing. I can't figure out where this race condition happens. If I run the taggers with --processes 1 first and then do --processes 16 it works.

Support providing streams into mixer via CLI

@IanMagnusson asks

I'm trying to figure out how to mix using the dolma cli args instead of the config. I want to do something like this but I cant figure out how to index the streams arg correctly:

dolma mix --streams[0].name "$name"
            --streams[0].documents "$input_prefix/$file" \
            --streams[0].output.path "$output_prefix/$file" \
            --streams[0].output.max_size_in_bytes 1000000000 \
            --streams[0].attributes s2orc-eval \
            --streams[0].filter.exculde "$@.attributes[?(@.bff_duplicate_paragraph_spans_decontamination && @.bff_duplicate_paragraph_spans_decontamination[0] && @.bff_duplicate_paragraph_spans_decontamination[0][2] >= 1.0)]"

We should support this use case. As a stopgap, we should support echo '{...}' | dolma -c - mix, i.e. allow passing config through stdin.

import gzip
import json

documents = [
    {"id": str(i), "text": 'dummy', "source": "peS2o"} for i in range(5)
]

documents.append({"id": str(5), "text": 'dummy_test', "source": "peS2o"})

file_path = '/foo/test_on_dummy_dataset/dummy_data.jsonl.gz'

with gzip.open(file_path, 'wt', encoding='UTF-8') as f:
    for document in documents:
        f.write(json.dumps(document) + '\n')

Then tried to run:

RUST_BACKTRACE=full dolma dedupe \
    --documents "/foo/test_on_dummy_dataset/documents/dummy_data.jsonl.gz" \
    --dedupe.documents.attribute_name 'duplicate_documents' \
    --dedupe.skip_empty \
    --bloom_filter.file /tmp/deduper_bloom_filter_test.bin \
    --no-bloom_filter.read_only \
    --bloom_filter.estimated_doc_count '5' \
    --bloom_filter.desired_false_positive_rate '0.0001' \
    --processes 16

(P.S I have also tried to reduce the number of processes to 1, didn't work out either.)

where I obtain the error:

bloom_filter:
  desired_false_positive_rate: 0.0001
  estimated_doc_count: 5
  file: deduper_bloom_filter_test.bin
  read_only: false
  size_in_bytes: 0
dedupe:
  documents:
    attribute_name: duplicate_documents
    key: ???
  name: duplicate_documents
  skip_empty: true
documents:
- /work/github/test_on_dummy_dataset/documents/dummy_data.jsonl.gz
processes: 16
work_dir:
  input: /tmp/dolma-input-4uglco2b
  output: /tmp/dolma-output-wrrv_wfd
[2023-12-18T15:23:01Z INFO  dolma::bloom_filter] Loading bloom filter from "deduper_bloom_filter_test.bin"...
[2023-12-18T15:23:01Z INFO  dolma::deduper] Writing attributes for /work/github/test_on_dummy_dataset/documents/dummy_data.jsonl.gz to /work/github/test_on_dummy_dataset/attributes/duplicate_documents/dummy_data.jsonl.gz.tmp
[2023-12-18T15:23:01Z INFO  dolma::deduper] Writing attributes for /work/github/test_on_dummy_dataset/documents/dummy_data.jsonl.gz to /work/github/test_on_dummy_dataset/attributes/duplicate_documents/dummy_data.jsonl.gz.tmp
thread '<unnamed>' panicked at src/deduper.rs:152:26:
called `Result::unwrap()` on an `Err` value: Custom { kind: Other, error: " --> 1:1\n  |\n1 | ???\n  | ^---\n  |\n  = expected chain" }
stack backtrace:
   0:     0x7f01e9d139ec - std::backtrace_rs::backtrace::libunwind::trace::he43a6a3949163f8c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/libunwind.rs:93:5
   1:     0x7f01e9d139ec - std::backtrace_rs::backtrace::trace_unsynchronized::h50db52ca99f692e7
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/../../backtrace/src/backtrace/mod.rs:66:5
   2:     0x7f01e9d139ec - std::sys_common::backtrace::_print_fmt::hd37d595f2ceb2d3c
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:67:5
   3:     0x7f01e9d139ec - <std::sys_common::backtrace::_print::DisplayBacktrace as core::fmt::Display>::fmt::h678bbcf9da6d7d75
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:44:22
   4:     0x7f01e9d4297c - core::fmt::rt::Argument::fmt::h3a159adc080a6fc9
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/rt.rs:138:9
   5:     0x7f01e9d4297c - core::fmt::write::hb8eaf5a8e45a738e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/fmt/mod.rs:1094:21
   6:     0x7f01e9d0fd8e - std::io::Write::write_fmt::h9663fe36b2ee08f9
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/io/mod.rs:1714:15
   7:     0x7f01e9d137d4 - std::sys_common::backtrace::_print::hcd4834796ee88ad2
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:47:5
   8:     0x7f01e9d137d4 - std::sys_common::backtrace::print::h1360e9450e4f922a
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:34:9
   9:     0x7f01e9d14f03 - std::panicking::default_hook::{{closure}}::h2609fa95cd5ab1f4
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:270:22
  10:     0x7f01e9d14c1c - std::panicking::default_hook::h6d75f5747cab6e8d
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:290:9
  11:     0x7f01e9d15489 - std::panicking::rust_panic_with_hook::h57e78470c47c84de
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:707:13
  12:     0x7f01e9d15387 - std::panicking::begin_panic_handler::{{closure}}::h3dfd2453cf356ecb
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:599:13
  13:     0x7f01e9d13f16 - std::sys_common::backtrace::__rust_end_short_backtrace::hdb177d43678e4d7e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys_common/backtrace.rs:170:18
  14:     0x7f01e9d150d2 - rust_begin_unwind
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/panicking.rs:595:5
  15:     0x7f01e9628673 - core::panicking::panic_fmt::hd1e971d8d7c78e0e
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/panicking.rs:67:14
  16:     0x7f01e9628a7a - core::result::unwrap_failed::hccb456d39e9c31fc
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/core/src/result.rs:1652:5
  17:     0x7f01e9709c58 - <F as threadpool::FnBox>::call_box::h26117aa625de9352
  18:     0x7f01e9cc82b0 - std::sys_common::backtrace::__rust_begin_short_backtrace::he93b09d651d5d863
  19:     0x7f01e9cc403a - core::ops::function::FnOnce::call_once{{vtable.shim}}::hcf05def3d47db391
  20:     0x7f01e9d1a955 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::haadd4e5af2ab0d62
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  21:     0x7f01e9d1a955 - <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once::he4ba1fb09c16d807
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/alloc/src/boxed.rs:2007:9
  22:     0x7f01e9d1a955 - std::sys::unix::thread::Thread::new::thread_start::he524ecf4b47bee95
                               at /rustc/cc66ad468955717ab92600c770da8c1601a4ff33/library/std/src/sys/unix/thread.rs:108:17
  23:     0x7f01ea51eac3 - <unknown>
  24:     0x7f01ea5b0a40 - <unknown>
  25:                0x0 - <unknown>
[2023-12-18T15:23:01Z INFO  dolma::deduper] Writing bloom filter to "deduper_bloom_filter_test.bin"...
[2023-12-18T15:23:02Z INFO  dolma::deduper] Bloom filter written.
[2023-12-18T15:23:02Z INFO  dolma::deduper] Done!

File downloaded in Make file

What file gets downloaded when you run the make file from the following website ( https://sh.rustup.rs )

Hells Angels infinite loop save view open

#83 (comment)

	# if not set; it will potentially not write to the output stream
	# in case a tagger emits no spans
	attributes_by_stream[tagger_output.path] = {}

	for tagger_name in taggers:
	# instantiate the taggers here to make sure they are all valid + download any necessary resources
	tagger = TaggerRegistry.get(tagger_name)

	# delete the tagger after we are done with it so that we don't keep it in memory
	del tagger