Viewer for the Hugging Face datasets library.
streamlit run run.py
or if you want to view local files
streamlit run run.py <absolutepath to datasets/datasets/>
Viewer for the π€ datasets library.
Home Page: https://huggingface.co/datasets/viewer
I'm getting the following error when trying to load the definite_pronoun_resolution dataset:
ArrowInvalid: Column 2: In chunk 0: Invalid: Values Length (2644) is not equal to the length (1) multiplied by the value size (2)
Traceback:
File "/home/sasha/streamlit/lib/streamlit/ScriptRunner.py", line 322, in _run_script
exec(code, module.__dict__)
File "/home/sasha/nlp-viewer/run.py", line 217, in <module>
keys = list(d[0].keys())
File "/home/sasha/.local/share/virtualenvs/lib-ogGKnCK_/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 1024, in __getitem__
format_kwargs=self._format_kwargs,
File "/home/sasha/.local/share/virtualenvs/lib-ogGKnCK_/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 902, in _getitem
outputs = self._unnest(self._data.slice(key, 1).to_pydict())
File "pyarrow/table.pxi", line 1211, in pyarrow.lib.Table.slice
File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
Hi, not sure if the best place to ask is here or the nlp repo.
A few weeks back ago I contributed the qanta dataset https://huggingface.co/nlp/viewer/?dataset=qanta&config=mode%3Dfull%2Cchar_skip%3D25, but noticed that it can't be displayed. Are there configuration changes on the viewer or changes in the data that would let it load?
It looks like the dataset is on the boundary of being renderable since it loads on the "first" configuration, but not "full". The only difference for these subsets is that the text
field has the content of the full_question
field instead of first_sentence
. Perhaps an alternative is to change which configuration is the default though. Thanks!
For example, https://huggingface.co/datasets/viewer/?dataset=arabic_pos_dialect
This seems to only happen when the text is in a list, so probably an issue with json.dumps()
For example, https://huggingface.co/datasets/viewer/?dataset=glue&config=mrpc, which should show the mrpc
subset of GLUE, shows the cola
subset
Reported here: huggingface/datasets#1996
As reported by @marshmellow77 in huggingface/datasets#2997 (comment), data viewer still shows incorrect data of turkish_product_reviews
dataset, which was fixed on Jun 22 2021: huggingface/datasets@16bc665
See:
Link to reproduce: https://huggingface.co/datasets/viewer/?dataset=common_voice
Changing the subset to something different from ab
seems to resolve the problem.
I ran datasets-viewer
locally and accessed the mrpc
subset of the glue
dataset.
Then I followed https://huggingface.co/docs/datasets/quicktour.html, in particular I loaded the same subset + dataset with:
>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')
Then looking at the cache directory, the data seems to be messed up:
$ tree -L 3 ~/.cache/huggingface/datasets/glue/mrpc
/Users/slesage/.cache/huggingface/datasets/glue/mrpc
βββ 1.0.0
βββ LICENSE
βββ dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
βΒ Β βββ LICENSE
βΒ Β βββ dataset_info.json
βΒ Β βββ glue-test.arrow
βΒ Β βββ glue-train.arrow
βΒ Β βββ glue-validation.arrow
βββ dataset_info.json
βββ glue-test.arrow
βββ glue-train.arrow
βββ glue-validation.arrow
2 directories, 10 files
Also: I originally did it the other way (first load the subset from the docs tutorial, then access it through a local instance of datasets-viewer
) and I got an exception
2021-07-21 11:34:31.631 Traceback (most recent call last):
File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/streamlit/script_runner.py", line 349, in _run_script
exec(code, module.__dict__)
File "/Users/slesage/hf/datasets-viewer/run.py", line 215, in <module>
dts, fail = get(str(option), str(conf_option.name) if conf_option else None)
File "/Users/slesage/hf/datasets-viewer/run.py", line 148, in get
builder_instance = builder_cls(name=conf, cache_dir=path if path_to_datasets is not None else None)
File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/builder.py", line 1014, in __init__
super(GeneratorBasedBuilder, self).__init__(*args, **kwargs)
File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/builder.py", line 269, in __init__
self.info = DatasetInfo.from_directory(self._cache_dir)
File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/info.py", line 260, in from_directory
with open(os.path.join(dataset_info_dir, config.DATASET_INFO_FILENAME), "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/slesage/.cache/huggingface/datasets/glue/mrpc/1.0.0/dataset_info.json'
because /Users/slesage/.cache/huggingface/datasets/glue/mrpc
existed but only contained dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
while datasets-viewer
expected to find dataset_info.json
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. πππ
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google β€οΈ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.