Comments (13)
@ChenghaoMou this is similar to deduping.
from data_tooling.
Arrow expects all features to be in the same type for all datasets
raise ValueError("Features must match for all datasets")
and meta in v2 has a type of
meta: struct<headers: struct<content-length: int64, content-type: string, warc-block-digest: string, warc-date: timestamp[us], warc-identified-content-language: string, warc-record-id: string, warc-refers-to: string, warc-target-uri: string, warc-type: string>, nb_sentences: int64, offset: int64>>)
Not all v1 data have a reference(or duplicate in this case) in v2, should we just use some dummy values instead?
from data_tooling.
example v2 metadata for reference
a = {
'headers': {
'warc-record-id': '<urn:uuid:0a6aab53-ade6-45f7-a55b-6d18967fd6cf>',
'warc-date': datetime.datetime(2021, 2, 28, 3, 8, 24),
'content-type': 'text/plain',
'content-length': 6490,
'warc-type': 'conversion',
'warc-identified-content-language': 'spa,lat',
'warc-refers-to': '<urn:uuid:a9361c95-6475-4585-a410-16bfad4da8be>',
'warc-target-uri': 'https://www.funerarialanueva.es/obituary/sara-martinez/',
'warc-block-digest': 'sha1:QI2XEXXTSVWNF3CDLOWTZLQXVLBHO5UJ'
},
'offset': 194,
'nb_sentences': 1
}
from data_tooling.
from data_tooling.
from data_tooling.
@ontocord by empty you mean something like this?
a = {
'headers': {
'warc-record-id': '',
'warc-date': datetime.datetime(1970, 1, 1),
'content-type': '',
'content-length': -1,
'warc-type': '',
'warc-identified-content-language': '',
'warc-refers-to': '',
'warc-target-uri': '',
'warc-block-digest': ''
},
'offset': -1,
'nb_sentences': -1
}
Arrow checks the schema recursively, so we need to have the same keys and same value types (str, int, or date).
I am using link with a use_auth_token=True
parameter. I am assuming this this the v2 vs the v1 in link.
from data_tooling.
Yes, I think that works fine. An empty record.
from data_tooling.
I'm not sure what link stands for?
from data_tooling.
Actually, how are you creating your dataset? If you are simply saving away the meta dat in the registry jsonl file, I think the meta field can actually be just a str. And then the user when using datasets could parse it as json if they wish. In this way, you can have an empty str if you have no json for meta. For our purposes, I think the most interesting information is URI. Maybe sha1 too.
I am not sure what WARC lang stands for, but that looks like its' not "es". Is that "spa" for spanish?
from data_tooling.
@ontocord link
s are hyperlinks in Markdown, they should be visible in a browser.
I am using datasets
because it has a nice implementation of parallelization, and it plays nicely with what we have in the deduplication script.
The parallelized map
function in datasets
works by sharding a dataset into shards and merges them back together at the end, after processing, with calling concatenate_datasets
. The problem is that concatenating enforces the schema check, hence the error.
Yes, we can avoid that by working directly on the json files, just for the sake of creating our dataset. But if people are trying to manipulate the dataset after loading and calling map function with num_proc
, it will throw this error. The minimal code to reproduce the error is
# temp.jsonl
{"text": "test", "meta": {}, "id": 0}
{"text": "text", "meta": {"hash": "random"}, "id": 1}
# temp.py
from datasets import load_dataset
dataset = load_dataset('json', data_files='temp.jsonl')
dataset.map(lambda x: {"new_feature": x["meta"]}, num_proc=2)
Traceback (most recent call last): | 0/1 [00:00<?, ?ex/s]
File "<stdin>", line 1, in <module>
File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/dataset_dict.py", line 484, in map
{
File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/dataset_dict.py", line 485, in <dictcomp>
k: dataset.map(
File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 2147, in map
result = concatenate_datasets(transformed_shards)
File "/Users/chenghao/miniforge3/envs/data/lib/python3.9/site-packages/datasets/arrow_dataset.py", line 3651, in concatenate_datasets
raise ValueError("Features must match for all datasets")
ValueError: Features must match for all datasets
from data_tooling.
cool did you try this:
# temp.jsonl
{"text": "test", "meta": "{}", "id": 0}
{"text": "text", "meta": "{\"hash\": \"random\"}", "id": 1}
# temp.py
from datasets import load_dataset
dataset = load_dataset('json', data_files='/content/temp.jsonl')
ds = dataset.map(lambda x: {"new_feature": x["meta"]}, num_proc=2)
# now you can read json.loads(ds[1]["meta"])
what i propose is to save the meta field data as a str, and not an embeded json object.
from data_tooling.
That works too. But it also introduces another issue — we need a place to document all the keys available in that JSON string field.
from data_tooling.
Yes, I agree. I don't know how to use the new style datasets that just loads from a repo. But in the old way, you create a meta-data info for your dataset loading code. You could also put the documentation in our ac_dc code itself.
from data_tooling.
Related Issues (20)
- Create dataset xnli
- Create dataset indonesian_news_articles_2017 HOT 4
- Create dataset tsac
- Create dataset science_magazing_aaas_academic_journal_ HOT 1
- Create dataset ekantipur_com
- Create dataset nurition_fact
- Create dataset information_week_digital_magazine
- Create dataset du_reader HOT 4
- Create dataset wikihow_vietnamese_human_instructions HOT 2
- Create dataset MT_Vi_Mono_VLSP2020 HOT 4
- Create dataset malindomorph__morphological_dictionary_and_analyser_for_malay_indonesian
- Create dataset human_instructions_in_indonesian_extracted_from_wikihow
- Create dataset mind_body_green
- Create dataset vanguard_daily_media
- Create dataset opus_100 HOT 2
- Create dataset odiencorp2_0 HOT 4
- Create license-compliant version of the Pile: Stack Exchange HOT 1
- Create license-compliant version of the Pile: EuroParl HOT 1
- Citing this resource HOT 4
- Reason for not applying remove_non_prining_characters normalization HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from data_tooling.