OSCAR v2 has good meta-data information, including URL, but is missing the registry la

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

example v2 metadata for reference <div class="highlight highlight-source-python no

Yes. We should just use an empty dict I think. <span class="email-

<a class="user-mention notranslate" data-hovercard-type="organization" data-hovercard-

Match data between OSCAR v1 (in the registry repo) and OSCAR v2,about bigscience-workshop/data_tooling

huu4ontocord commented on June 2, 2024

@ChenghaoMou this is similar to deduping.

from data_tooling.

ChenghaoMou commented on June 2, 2024

Arrow expects all features to be in the same type for all datasets

raise ValueError("Features must match for all datasets")

and meta in v2 has a type of

meta: struct<headers: struct<content-length: int64, content-type: string, warc-block-digest: string, warc-date: timestamp[us], warc-identified-content-language: string, warc-record-id: string, warc-refers-to: string, warc-target-uri: string, warc-type: string>, nb_sentences: int64, offset: int64>>)

Not all v1 data have a reference(or duplicate in this case) in v2, should we just use some dummy values instead?

from data_tooling.

ChenghaoMou commented on June 2, 2024

example v2 metadata for reference

a = {
    'headers': {
        'warc-record-id': '<urn:uuid:0a6aab53-ade6-45f7-a55b-6d18967fd6cf>', 
        'warc-date': datetime.datetime(2021, 2, 28, 3, 8, 24), 
        'content-type': 'text/plain', 
        'content-length': 6490, 
        'warc-type': 'conversion', 
        'warc-identified-content-language': 'spa,lat', 
        'warc-refers-to': '<urn:uuid:a9361c95-6475-4585-a410-16bfad4da8be>', 
        'warc-target-uri': 'https://www.funerarialanueva.es/obituary/sara-martinez/', 
        'warc-block-digest': 'sha1:QI2XEXXTSVWNF3CDLOWTZLQXVLBHO5UJ'
    }, 
    'offset': 194, 
    'nb_sentences': 1
}

from data_tooling.

huu4ontocord commented on June 2, 2024

I couldn’t even load v2 using datasets. I did it by reading the gzip file itself i recall. Lmk if i can help.

…

On Nov 19, 2021, at 12:52 PM, Chenghao MOU ***@***.***> wrote: Arrow expects all features to be in the same type for all datasets raise ValueError("Features must match for all datasets") and meta in v2 has a type of meta: struct<headers: struct<content-length: int64, content-type: string, warc-block-digest: string, warc-date: timestamp[us], warc-identified-content-language: string, warc-record-id: string, warc-refers-to: string, warc-target-uri: string, warc-type: string>, nb_sentences: int64, offset: int64>>) Not all v1 data have a reference(or duplicate in this case) in v2, should we just use some dummy values instead? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

from data_tooling.

huu4ontocord commented on June 2, 2024

Yes. We should just use an empty dict I think.

…

On Nov 19, 2021, at 12:52 PM, Chenghao MOU ***@***.***> wrote: Arrow expects all features to be in the same type for all datasets raise ValueError("Features must match for all datasets") and meta in v2 has a type of meta: struct<headers: struct<content-length: int64, content-type: string, warc-block-digest: string, warc-date: timestamp[us], warc-identified-content-language: string, warc-record-id: string, warc-refers-to: string, warc-target-uri: string, warc-type: string>, nb_sentences: int64, offset: int64>>) Not all v1 data have a reference(or duplicate in this case) in v2, should we just use some dummy values instead? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe. Triage notifications on the go with GitHub Mobile for iOS or Android.

from data_tooling.

ChenghaoMou commented on June 2, 2024

@ontocord by empty you mean something like this?

a = {
    'headers': {
        'warc-record-id': '', 
        'warc-date': datetime.datetime(1970, 1, 1), 
        'content-type': '', 
        'content-length': -1, 
        'warc-type': '', 
        'warc-identified-content-language': '', 
        'warc-refers-to': '', 
        'warc-target-uri': '', 
        'warc-block-digest': ''
    }, 
    'offset': -1, 
    'nb_sentences': -1
}

Arrow checks the schema recursively, so we need to have the same keys and same value types (str, int, or date).

I am using link with a use_auth_token=True parameter. I am assuming this this the v2 vs the v1 in link.

from data_tooling.

huu4ontocord commented on June 2, 2024

Yes, I think that works fine. An empty record.

from data_tooling.

huu4ontocord commented on June 2, 2024

I'm not sure what link stands for?

from data_tooling.

huu4ontocord commented on June 2, 2024

Actually, how are you creating your dataset? If you are simply saving away the meta dat in the registry jsonl file, I think the meta field can actually be just a str. And then the user when using datasets could parse it as json if they wish. In this way, you can have an empty str if you have no json for meta. For our purposes, I think the most interesting information is URI. Maybe sha1 too.
I am not sure what WARC lang stands for, but that looks like its' not "es". Is that "spa" for spanish?

from data_tooling.

ChenghaoMou commented on June 2, 2024

@ontocord links are hyperlinks in Markdown, they should be visible in a browser.

I am using datasets because it has a nice implementation of parallelization, and it plays nicely with what we have in the deduplication script.

The parallelized map function in datasets works by sharding a dataset into shards and merges them back together at the end, after processing, with calling concatenate_datasets. The problem is that concatenating enforces the schema check, hence the error.

Yes, we can avoid that by working directly on the json files, just for the sake of creating our dataset. But if people are trying to manipulate the dataset after loading and calling map function with num_proc, it will throw this error. The minimal code to reproduce the error is

# temp.jsonl
 {"text": "test", "meta": {}, "id": 0}
 {"text": "text", "meta": {"hash": "random"}, "id": 1}
# temp.py
from datasets import load_dataset
dataset = load_dataset('json', data_files='temp.jsonl')
dataset.map(lambda x: {"new_feature": x["meta"]}, num_proc=2)

Traceback (most recent call last):                                                   | 0/1 [00:00<?, ?ex/s]
  File "<stdin>", line 1, in <module>
  File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/dataset_dict.py", line 484, in map
    {
  File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/dataset_dict.py", line 485, in <dictcomp>
    k: dataset.map(
  File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2147, in map
    result = concatenate_datasets(transformed_shards)
  File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3651, in concatenate_datasets
    raise ValueError("Features must match for all datasets")
ValueError: Features must match for all datasets

from data_tooling.

huu4ontocord commented on June 2, 2024

cool did you try this:

# temp.jsonl
{"text": "test", "meta": "{}", "id": 0}
{"text": "text", "meta": "{\"hash\": \"random\"}", "id": 1}

# temp.py
from datasets import load_dataset
dataset = load_dataset('json', data_files='/content/temp.jsonl')
ds = dataset.map(lambda x: {"new_feature": x["meta"]}, num_proc=2)

# now you can read json.loads(ds[1]["meta"])

what i propose is to save the meta field data as a str, and not an embeded json object.

from data_tooling.

ChenghaoMou commented on June 2, 2024

That works too. But it also introduces another issue — we need a place to document all the keys available in that JSON string field.

from data_tooling.

huu4ontocord commented on June 2, 2024

Yes, I agree. I don't know how to use the new style datasets that just loads from a repo. But in the old way, you create a meta-data info for your dataset loading code. You could also put the documentation in our ac_dc code itself.

from data_tooling.

Match data between OSCAR v1 (in the registry repo) and OSCAR v2 about data_tooling HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent