Git Product home page Git Product logo

Comments (13)

huu4ontocord avatar huu4ontocord commented on June 2, 2024

@ChenghaoMou this is similar to deduping.

from data_tooling.

ChenghaoMou avatar ChenghaoMou commented on June 2, 2024

Arrow expects all features to be in the same type for all datasets

raise ValueError("Features must match for all datasets")

and meta in v2 has a type of

meta: struct<headers: struct<content-length: int64, content-type: string, warc-block-digest: string, warc-date: timestamp[us], warc-identified-content-language: string, warc-record-id: string, warc-refers-to: string, warc-target-uri: string, warc-type: string>, nb_sentences: int64, offset: int64>>)

Not all v1 data have a reference(or duplicate in this case) in v2, should we just use some dummy values instead?

from data_tooling.

ChenghaoMou avatar ChenghaoMou commented on June 2, 2024

example v2 metadata for reference

a = {
    'headers': {
        'warc-record-id': '<urn:uuid:0a6aab53-ade6-45f7-a55b-6d18967fd6cf>', 
        'warc-date': datetime.datetime(2021, 2, 28, 3, 8, 24), 
        'content-type': 'text/plain', 
        'content-length': 6490, 
        'warc-type': 'conversion', 
        'warc-identified-content-language': 'spa,lat', 
        'warc-refers-to': '<urn:uuid:a9361c95-6475-4585-a410-16bfad4da8be>', 
        'warc-target-uri': 'https://www.funerarialanueva.es/obituary/sara-martinez/', 
        'warc-block-digest': 'sha1:QI2XEXXTSVWNF3CDLOWTZLQXVLBHO5UJ'
    }, 
    'offset': 194, 
    'nb_sentences': 1
}

from data_tooling.

huu4ontocord avatar huu4ontocord commented on June 2, 2024

from data_tooling.

huu4ontocord avatar huu4ontocord commented on June 2, 2024

from data_tooling.

ChenghaoMou avatar ChenghaoMou commented on June 2, 2024

@ontocord by empty you mean something like this?

a = {
    'headers': {
        'warc-record-id': '', 
        'warc-date': datetime.datetime(1970, 1, 1), 
        'content-type': '', 
        'content-length': -1, 
        'warc-type': '', 
        'warc-identified-content-language': '', 
        'warc-refers-to': '', 
        'warc-target-uri': '', 
        'warc-block-digest': ''
    }, 
    'offset': -1, 
    'nb_sentences': -1
}

Arrow checks the schema recursively, so we need to have the same keys and same value types (str, int, or date).

I am using link with a use_auth_token=True parameter. I am assuming this this the v2 vs the v1 in link.

from data_tooling.

huu4ontocord avatar huu4ontocord commented on June 2, 2024

Yes, I think that works fine. An empty record.

from data_tooling.

huu4ontocord avatar huu4ontocord commented on June 2, 2024

I'm not sure what link stands for?

from data_tooling.

huu4ontocord avatar huu4ontocord commented on June 2, 2024

Actually, how are you creating your dataset? If you are simply saving away the meta dat in the registry jsonl file, I think the meta field can actually be just a str. And then the user when using datasets could parse it as json if they wish. In this way, you can have an empty str if you have no json for meta. For our purposes, I think the most interesting information is URI. Maybe sha1 too.
I am not sure what WARC lang stands for, but that looks like its' not "es". Is that "spa" for spanish?

from data_tooling.

ChenghaoMou avatar ChenghaoMou commented on June 2, 2024

@ontocord links are hyperlinks in Markdown, they should be visible in a browser.

I am using datasets because it has a nice implementation of parallelization, and it plays nicely with what we have in the deduplication script.

The parallelized map function in datasets works by sharding a dataset into shards and merges them back together at the end, after processing, with calling concatenate_datasets. The problem is that concatenating enforces the schema check, hence the error.

Yes, we can avoid that by working directly on the json files, just for the sake of creating our dataset. But if people are trying to manipulate the dataset after loading and calling map function with num_proc, it will throw this error. The minimal code to reproduce the error is

# temp.jsonl
 {"text": "test", "meta": {}, "id": 0}
 {"text": "text", "meta": {"hash": "random"}, "id": 1}
# temp.py
from datasets import load_dataset
dataset = load_dataset('json', data_files='temp.jsonl')
dataset.map(lambda x: {"new_feature": x["meta"]}, num_proc=2)
Traceback (most recent call last):                                                   | 0/1 [00:00<?, ?ex/s]
  File "<stdin>", line 1, in <module>
  File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/dataset_dict.py", line 484, in map
    {
  File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/dataset_dict.py", line 485, in <dictcomp>
    k: dataset.map(
  File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2147, in map
    result = concatenate_datasets(transformed_shards)
  File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3651, in concatenate_datasets
    raise ValueError("Features must match for all datasets")
ValueError: Features must match for all datasets

from data_tooling.

huu4ontocord avatar huu4ontocord commented on June 2, 2024

cool did you try this:

# temp.jsonl
{"text": "test", "meta": "{}", "id": 0}
{"text": "text", "meta": "{\"hash\": \"random\"}", "id": 1}

# temp.py
from datasets import load_dataset
dataset = load_dataset('json', data_files='/content/temp.jsonl')
ds = dataset.map(lambda x: {"new_feature": x["meta"]}, num_proc=2)

# now you can read json.loads(ds[1]["meta"])

what i propose is to save the meta field data as a str, and not an embeded json object.

from data_tooling.

ChenghaoMou avatar ChenghaoMou commented on June 2, 2024

That works too. But it also introduces another issue — we need a place to document all the keys available in that JSON string field.

from data_tooling.

huu4ontocord avatar huu4ontocord commented on June 2, 2024

Yes, I agree. I don't know how to use the new style datasets that just loads from a repo. But in the old way, you create a meta-data info for your dataset loading code. You could also put the documentation in our ac_dc code itself.

from data_tooling.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.