huggingface / datasets-viewer Goto Github PK

Viewer for the 🤗 datasets library.

Home Page: https://huggingface.co/datasets/viewer

Python 100.00%

datasets-viewer's Introduction

Hugging Face Datasets Viewer

Viewer for the Hugging Face datasets library.

streamlit run run.py

or if you want to view local files

streamlit run run.py <absolutepath to datasets/datasets/>

datasets-viewer's People

Contributors

Stargazers

Watchers

Forkers

huggingworld ssitb dontpanicmorty alenochka nguyenhoan1988 ankitshah009 slowwavesleep mfreidank muskanmahajan486 jonathan-roberts1 oshanjayawardana artkudr gayu-thri

datasets-viewer's Issues

Error viewing definite_pronoun_resolution dataset in viewer

I'm getting the following error when trying to load the definite_pronoun_resolution dataset:

ArrowInvalid: Column 2: In chunk 0: Invalid: Values Length (2644) is not equal to the length (1) multiplied by the value size (2)

Traceback:

File "/home/sasha/streamlit/lib/streamlit/ScriptRunner.py", line 322, in _run_script
    exec(code, module.__dict__)
File "/home/sasha/nlp-viewer/run.py", line 217, in <module>
    keys = list(d[0].keys())
File "/home/sasha/.local/share/virtualenvs/lib-ogGKnCK_/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 1024, in __getitem__
    format_kwargs=self._format_kwargs,
File "/home/sasha/.local/share/virtualenvs/lib-ogGKnCK_/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 902, in _getitem
    outputs = self._unnest(self._data.slice(key, 1).to_pydict())
File "pyarrow/table.pxi", line 1211, in pyarrow.lib.Table.slice
File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status

Dataset too large to browse

Hi, not sure if the best place to ask is here or the nlp repo.

A few weeks back ago I contributed the qanta dataset https://huggingface.co/nlp/viewer/?dataset=qanta&config=mode%3Dfull%2Cchar_skip%3D25, but noticed that it can't be displayed. Are there configuration changes on the viewer or changes in the data that would let it load?

It looks like the dataset is on the boundary of being renderable since it loads on the "first" configuration, but not "full". The only difference for these subsets is that the text field has the content of the full_question field instead of first_sentence. Perhaps an alternative is to change which configuration is the default though. Thanks!

Printing Arabic text in a list as Unicode

For example, https://huggingface.co/datasets/viewer/?dataset=arabic_pos_dialect

This seems to only happen when the text is in a list, so probably an issue with json.dumps()

`config` URL param is not used in permalink

For example, https://huggingface.co/datasets/viewer/?dataset=glue&config=mrpc, which should show the mrpcsubset of GLUE, shows the cola subset

Missing dependency to explore speech datasets

Reported here: huggingface/datasets#1996

Upload code so we can link to it etc.:)

Missing py7zr dependency for `samsum`

I realise this is an optional dependency for datasets end-users, but it would be nice to include in the viewer

Old incorrect labels are shown for turkish_product_reviews

As reported by @marshmellow77 in huggingface/datasets#2997 (comment), data viewer still shows incorrect data of turkish_product_reviews dataset, which was fixed on Jun 22 2021: huggingface/datasets@16bc665

See:

Default config for `common_voice` throws a cryptic error

Link to reproduce: https://huggingface.co/datasets/viewer/?dataset=common_voice

Changing the subset to something different from ab seems to resolve the problem.

Cache directory seems to be messed up

I ran datasets-viewer locally and accessed the mrpc subset of the glue dataset.

Then I followed https://huggingface.co/docs/datasets/quicktour.html, in particular I loaded the same subset + dataset with:

>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')

Then looking at the cache directory, the data seems to be messed up:

$ tree -L 3 ~/.cache/huggingface/datasets/glue/mrpc
/Users/slesage/.cache/huggingface/datasets/glue/mrpc
└── 1.0.0
    ├── LICENSE
    ├── dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
    │   ├── LICENSE
    │   ├── dataset_info.json
    │   ├── glue-test.arrow
    │   ├── glue-train.arrow
    │   └── glue-validation.arrow
    ├── dataset_info.json
    ├── glue-test.arrow
    ├── glue-train.arrow
    └── glue-validation.arrow

2 directories, 10 files

Also: I originally did it the other way (first load the subset from the docs tutorial, then access it through a local instance of datasets-viewer) and I got an exception

2021-07-21 11:34:31.631 Traceback (most recent call last):
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/streamlit/script_runner.py", line 349, in _run_script
    exec(code, module.__dict__)
  File "/Users/slesage/hf/datasets-viewer/run.py", line 215, in <module>
    dts, fail = get(str(option), str(conf_option.name) if conf_option else None)
  File "/Users/slesage/hf/datasets-viewer/run.py", line 148, in get
    builder_instance = builder_cls(name=conf, cache_dir=path if path_to_datasets is not None else None)
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/builder.py", line 1014, in __init__
    super(GeneratorBasedBuilder, self).__init__(*args, **kwargs)
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/builder.py", line 269, in __init__
    self.info = DatasetInfo.from_directory(self._cache_dir)
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/info.py", line 260, in from_directory
    with open(os.path.join(dataset_info_dir, config.DATASET_INFO_FILENAME), "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/slesage/.cache/huggingface/datasets/glue/mrpc/1.0.0/dataset_info.json'

because /Users/slesage/.cache/huggingface/datasets/glue/mrpc existed but only contained dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad while datasets-viewerexpected to find dataset_info.json.