Git Product home page Git Product logo

datasets-viewer's Introduction

Hugging Face Datasets Viewer

Viewer for the Hugging Face datasets library.

streamlit run run.py 

or if you want to view local files

streamlit run run.py <absolutepath to datasets/datasets/>

datasets-viewer's People

Contributors

julien-c avatar severo avatar srush avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datasets-viewer's Issues

Error viewing definite_pronoun_resolution dataset in viewer

I'm getting the following error when trying to load the definite_pronoun_resolution dataset:

ArrowInvalid: Column 2: In chunk 0: Invalid: Values Length (2644) is not equal to the length (1) multiplied by the value size (2)

Traceback:

File "/home/sasha/streamlit/lib/streamlit/ScriptRunner.py", line 322, in _run_script
    exec(code, module.__dict__)
File "/home/sasha/nlp-viewer/run.py", line 217, in <module>
    keys = list(d[0].keys())
File "/home/sasha/.local/share/virtualenvs/lib-ogGKnCK_/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 1024, in __getitem__
    format_kwargs=self._format_kwargs,
File "/home/sasha/.local/share/virtualenvs/lib-ogGKnCK_/lib/python3.7/site-packages/nlp/arrow_dataset.py", line 902, in _getitem
    outputs = self._unnest(self._data.slice(key, 1).to_pydict())
File "pyarrow/table.pxi", line 1211, in pyarrow.lib.Table.slice
File "pyarrow/public-api.pxi", line 390, in pyarrow.lib.pyarrow_wrap_table
File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status

Dataset too large to browse

Hi, not sure if the best place to ask is here or the nlp repo.

A few weeks back ago I contributed the qanta dataset https://huggingface.co/nlp/viewer/?dataset=qanta&config=mode%3Dfull%2Cchar_skip%3D25, but noticed that it can't be displayed. Are there configuration changes on the viewer or changes in the data that would let it load?

It looks like the dataset is on the boundary of being renderable since it loads on the "first" configuration, but not "full". The only difference for these subsets is that the text field has the content of the full_question field instead of first_sentence. Perhaps an alternative is to change which configuration is the default though. Thanks!

Cache directory seems to be messed up

I ran datasets-viewer locally and accessed the mrpc subset of the glue dataset.

Then I followed https://huggingface.co/docs/datasets/quicktour.html, in particular I loaded the same subset + dataset with:

>>> from datasets import load_dataset
>>> dataset = load_dataset('glue', 'mrpc', split='train')

Then looking at the cache directory, the data seems to be messed up:

$ tree -L 3 ~/.cache/huggingface/datasets/glue/mrpc
/Users/slesage/.cache/huggingface/datasets/glue/mrpc
└── 1.0.0
    β”œβ”€β”€ LICENSE
    β”œβ”€β”€ dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad
    β”‚Β Β  β”œβ”€β”€ LICENSE
    β”‚Β Β  β”œβ”€β”€ dataset_info.json
    β”‚Β Β  β”œβ”€β”€ glue-test.arrow
    β”‚Β Β  β”œβ”€β”€ glue-train.arrow
    β”‚Β Β  └── glue-validation.arrow
    β”œβ”€β”€ dataset_info.json
    β”œβ”€β”€ glue-test.arrow
    β”œβ”€β”€ glue-train.arrow
    └── glue-validation.arrow

2 directories, 10 files

Also: I originally did it the other way (first load the subset from the docs tutorial, then access it through a local instance of datasets-viewer) and I got an exception

2021-07-21 11:34:31.631 Traceback (most recent call last):
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/streamlit/script_runner.py", line 349, in _run_script
    exec(code, module.__dict__)
  File "/Users/slesage/hf/datasets-viewer/run.py", line 215, in <module>
    dts, fail = get(str(option), str(conf_option.name) if conf_option else None)
  File "/Users/slesage/hf/datasets-viewer/run.py", line 148, in get
    builder_instance = builder_cls(name=conf, cache_dir=path if path_to_datasets is not None else None)
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/builder.py", line 1014, in __init__
    super(GeneratorBasedBuilder, self).__init__(*args, **kwargs)
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/builder.py", line 269, in __init__
    self.info = DatasetInfo.from_directory(self._cache_dir)
  File "/Users/slesage/hf/datasets-viewer/.venv/lib/python3.8/site-packages/datasets/info.py", line 260, in from_directory
    with open(os.path.join(dataset_info_dir, config.DATASET_INFO_FILENAME), "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: '/Users/slesage/.cache/huggingface/datasets/glue/mrpc/1.0.0/dataset_info.json'

because /Users/slesage/.cache/huggingface/datasets/glue/mrpc existed but only contained dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad while datasets-viewerexpected to find dataset_info.json.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.