vega / vega-datasets Goto Github PK

Common repository for example datasets used by Vega-related projects

JavaScript 16.34% Python 18.01% TypeScript 63.26% Shell 2.38%

vega-datasets's Introduction

Vega Datasets

Collection of datasets used in Vega and Vega-Lite examples. This data lives at https://github.com/vega/vega-datasets and https://cdn.jsdelivr.net/npm/vega-datasets.

Common repository for example datasets used by Vega related projects. Keep changes to this repository minimal as other projects (Vega, Vega Editor, Vega-Lite, Polestar, Voyager) use this data in their tests and for examples.

The list of sources is in SOURCES.md.

To access the data in Observable, you can import vega-dataset. Try our example notebook. To access these datasets from Python, you can use the Vega datasets python package. To access them from Julia, you can use the VegaDatasets.jl julia package.

Versioning

We use semantic versioning. However, since this package serve datasets we have additional rules about how we version data.

We do not change data in patch releases except to resolve formatting issues. Minor releases may change the data but only update datasets in ways that do not change field names or file names. Minor releases may also add datasets. Major versions may change file names, file contents, and remove or update files.

How to use it

HTTP

You can also get the data directly via HTTP served by GitHub or jsDelivr (a fast CDN) like:

https://vega.github.io/vega-datasets/data/cars.json or with a fixed version (recommended) such as https://cdn.jsdelivr.net/npm/vega-datasets@2/data/cars.json.

You can find a full listing of the available datasets at https://cdn.jsdelivr.net/npm/vega-datasets/data/.

NPM

Get the data on disk

npm i vega-datasets

Now you have all the datasets in a folder in node_modules/vega-datasets/data/.

Get the URLs or Data via URL

npm i vega-datasets

Now you can import data = require('vega-datasets') and access the URLs of any dataset with data[NAME].url. data[NAME]() returns a promise that resolves to the actual data fetched from the URL. We use d3-dsv to parse CSV files.

Here is a full example

import data from 'vega-datasets';

const cars = await data['cars.json']();
// equivalent to
// const cars = await (await fetch(data['cars.json'].url)).json();

console.log(cars);

Development process

Install dependencies with yarn.

Release process

To make a release, run npm run release.

vega-datasets's People

Contributors

Stargazers

Watchers

Forkers

daviddao mcorrell hamnahsuhaeri ellisonbg agile-innovations kyle-pou1 thai56 davidkd shaadr stevenanderson37 deltmd mikechi4 jgjumpshot aksharp99 otteb thunderchunk rueckstiess basanthsk wyunnn nyurik churtado grettygoose condershou ranajitbanerjee bogdanrrr11 farhin76 sis88 jakevdp dirkdupont jwolondon ydlamba jamieleonard datadesk doretteongjn nitheeshskiploop davidanthoff wei2517 tobydrane1 sachinvenkat soumyathomasus41 tryagh apogaku manoj1994 bausa 1sunil1 declann appyj321 olenaskibinska gikim sandy4321 iliatimofeev soiqualang seanrosen26 xaquingv jstrojwas mattijn mcnuttandrew jaspercaron testingwithvegalite zaknbur joelnik forkkit tzonghao kvony palathip jeewanhyongju elyaramirfathi willawebb azamshokri eitanlees mluquec tranquanghuy0801 rossanamatera nidhalrahali visnup aiden-ahn ralf4data js6450 realgenekim orenbochman elvisthermo chinnapaht yangwshi ionathan imdxh irhamrzdy svplatner huyaodong garbely arenaswan fagan2888 crcrcry kkacquah tobisim44 alexjustbarcelona borovkk oscaralsa jcforaker grace-ting ald-clarence

vega-datasets's Issues

License?

Any license info on this repo?

An in-range update of rollup is breaking the build 🚨

The devDependency rollup was updated from `1.9.2` to `1.9.3`.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

rollup is a devDependency of this project. It might not break your production code or affect downstream projects, but probably breaks your build or test tools, which may prevent deploying or publishing.

Status Details

❌ continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).
❌ Travis CI - Branch: The build errored.

Release Notes for v1.9.3

2019-04-10

Bug Fixes

Simplify return expressions that are evaluated before the surrounding function is bound (#2803)

Pull Requests

#2803: Handle out-of-order binding of identifiers to improve tree-shaking (@lukastaegert)

Commits

The new version differs by 3 commits.

516a06d 1.9.3
a5526ea Update changelog
c3d73ff Handle out-of-order binding of identifiers to improve tree-shaking (#2803)

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

Is their a license for this dataset?

it would be very helpful to have a machine readable licence file like path / spdx tsv

missing datasets for vega 3 examples

I'm trying to run the examples specs in latest vega release, but am getting some errors with missing datasets, including flare-dependencies.json from https://github.com/vega/vega/blob/master/spec/tree-radial-bundle.vg.json, normal-2.json and flights-200k.json. Are these available somewhere

(apologies if this issue is in the wrong spot)

Replace dates with safe dates

See discussion in vega/vega-lite#3185

Add one dataset for the sports fans

I suggest this top goal-scorers and footballers since 1980:

https://johnburnmurdoch.github.io/projects/goal-lines/all-comps/

build/vega-datasets.min.js is now an iife, breaking require

I noticed that #388 changed build/vega-datasets.min.js to be an IIFE:

vega-datasets/rollup.config.js

Lines 40 to 46 in b2c5de0

 { 

 file: "build/vega-datasets.min.js", 

 format: "iife", 

 sourcemap: true, 

 name: "vegaDatasets", 

 plugins: [terser()], 

 },

Whereas build/vega-datasets.js is still a UMD:

vega-datasets/rollup.config.js

Lines 34 to 39 in b2c5de0

 { 

 file: "build/vega-datasets.js", 

 format: "umd", 

 sourcemap: true, 

 name: "vegaDatasets", 

 },

The problem is that require("vega-datasets") on Observable will use your unpkg/jsdelivr entry point which points to the IIFE, and thus errors:

vega-datasets/package.json

Lines 8 to 9 in b2c5de0

 "unpkg": "build/vega-datasets.min.js", 

 "jsdelivr": "build/vega-datasets.min.js",

You can see it breaking here:

https://observablehq.com/@vega/vega-lite-input-binding

If you want to drop UMD support, we could fix that notebook (and presumably others) by using ES import instead of require:

world = (await import('vega-datasets')).default['world-110m.json'].url

But if you are supporting IIFE, maybe it’s worth continuing to support UMD for backwards compatibility?

Birdstrikes dataset missing

When trying to load the birdstrikes data, a 404 error is thrown.

vega_datasets: 0.8
Ubuntu 20.04 LTS

Python 3.8.3 (default, May 19 2020, 18:47:26) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.15.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from vega_datasets import data                                                            

In [2]: bird = data.birdstrikes()                                                                 
---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-2-a4e393dbcf55> in <module>
----> 1 bird = data.birdstrikes()

~/miniconda3/envs/cookbook/lib/python3.8/site-packages/vega_datasets/core.py in __call__(self, use_local, **kwargs)
    222             parsed data
    223         """
--> 224         datasource = BytesIO(self.raw(use_local=use_local))
    225 
    226         kwds = self._pd_read_kwds.copy()

~/miniconda3/envs/cookbook/lib/python3.8/site-packages/vega_datasets/core.py in raw(self, use_local)
    201             return pkgutil.get_data("vega_datasets", self.pkg_filename)
    202         else:
--> 203             return urlopen(self.url).read()
    204 
    205     def __call__(self, use_local=True, **kwargs):

~/miniconda3/envs/cookbook/lib/python3.8/urllib/request.py in urlopen(url, data, timeout, cafile, capath, cadefault, context)
    220     else:
    221         opener = _opener
--> 222     return opener.open(url, data, timeout)
    223 
    224 def install_opener(opener):

~/miniconda3/envs/cookbook/lib/python3.8/urllib/request.py in open(self, fullurl, data, timeout)
    529         for processor in self.process_response.get(protocol, []):
    530             meth = getattr(processor, meth_name)
--> 531             response = meth(req, response)
    532 
    533         return response

~/miniconda3/envs/cookbook/lib/python3.8/urllib/request.py in http_response(self, request, response)
    638         # request was successfully received, understood, and accepted.
    639         if not (200 <= code < 300):
--> 640             response = self.parent.error(
    641                 'http', request, response, code, msg, hdrs)
    642 

~/miniconda3/envs/cookbook/lib/python3.8/urllib/request.py in error(self, proto, *args)
    567         if http_err:
    568             args = (dict, 'default', 'http_error_default') + orig_args
--> 569             return self._call_chain(*args)
    570 
    571 # XXX probably also want an abstract factory that knows when it makes

~/miniconda3/envs/cookbook/lib/python3.8/urllib/request.py in _call_chain(self, chain, kind, meth_name, *args)
    500         for handler in handlers:
    501             func = getattr(handler, meth_name)
--> 502             result = func(*args)
    503             if result is not None:
    504                 return result

~/miniconda3/envs/cookbook/lib/python3.8/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

Movies release dates off by 100 years

There are several movies listed in the movies.json dataset (everything after 2011) that are listed as having come out 100 years after they were actually released. It's a pretty quick fix and I'm wondering if a pull request would be welcome to either fix the data or create a new dataset with corrected data.

gapminder.json is empty

Can we delete it?

List all sources

Add penguin data

See https://github.com/allisonhorst/penguins

An in-range update of rollup is breaking the build 🚨

The devDependency rollup was updated from `1.18.0` to `1.19.0`.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

Status Details

❌ continuous-integration/travis-ci/push: The Travis CI build failed (Details).
❌ Travis CI - Branch: The build failed.

Commits

The new version differs by 6 commits.

9af119d 1.19.0
b3f361c Update changelog
456f4d2 Avoid variable from empty module name be empty (#3026)
17eaa43 Use id of last module in chunk as name base for auto-generated chunks (#3025)
871bfa0 Switch to a code-splitting build and update dependencies (#3020)
2443783 Unified file emission api (#2999)

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

add vega-datasets JS notebook to docs ...

@domoritz sounds good! I'll do one better for you guys:

I woke up this morning thinking we could use a simple vega datasets preview js notebook :)

https://observablehq.com/@randomfractals/vega-datasets

I'll leave it up to you guys if you want to add this vega datasets preview utility Observable notebook to your editor or this datasets repo readme.md.

This notebook can be used as a supplemental tool for online vega editor and examples that use these data sources since it's much faster in data loading and scrolling than what github and vega editor provides.

I might add something similar to the https://github.com/RandomFractals/vscode-vega-viewer as a split panel in vega chart preview in the next major release.

cc @kanitw @arvind & @jheer

Cheers! 🤗

Originally posted by @RandomFractals in #64 (comment)

Add OHLC Data

I think the addition of an "Open High Low Close" dataset would useful.

The VL example Candlestick Chart hard codes data that is found in an earlier Protovis example found here .

The dataset contains the performance of the Chicago Board Options Exchange Volatility Index (VIX) in the summer of 2009.

I think including this dataset would be especially useful for people in finance.

Possible names: vix.json, vix-ohlc.json or just ohlc.json?

Let me know what your think and I can put together a PR.

An in-range update of rollup is breaking the build 🚨

The devDependency rollup was updated from `1.14.2` to `1.14.3`.

🚨 View failing branch.

This version is covered by your current version range and after updating it in your project the build failed.

Status Details

❌ continuous-integration/travis-ci/push: The Travis CI build could not complete due to an error (Details).
❌ Travis CI - Branch: The build errored.

Release Notes for v1.14.3

2019-06-06

Bug Fixes

Generate correct external imports when importing from a directory that would be above the root of the current working directory (#2902)

Pull Requests

#2902: Use browser relative path algorithm for chunks (@lukastaegert)

Commits

The new version differs by 4 commits.

c68bd95 1.14.3
d79aa57 Update changelog
7179390 Use browser relative path algorithm for chunks (#2902)
b1df517 Add funding button

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.

Your Greenkeeper Bot 🌴

Provide a js module

The idea would be that someone can just load vega-datsets and access the datasets (probably with async loading) or their URLs directly.

cars.json contains invalid data

It appears that some invalid data has crept into cars.json. Presumably, this is to highlight Voyager functionality, but it's had the knock-on effect of producing erroneous output with the Vega examples:

Perhaps cars.json should contain the clean values, and a secondary duplicate data file introduces the error? Note: I believe this error was also introduced to the Vega test cases, so we'll want to update that too.

What is the origin of the barley dataset?

As I continue to refine and expand the Altair example gallery, the barley dataset has become our standby for stacked bar charts.

It would be nice to fill out its sources entry in the same way we did the wheat dataset. Can someone here verify its origin?

Update sf-temps

I figured once the Seattle temperatures are updated (#127) we should also update the corresponding San Francisco temperatures. I've already downloaded the data from the San Francisco International Airport weather station.

Making a note of it here so we don't forget.

Clean up for 2.0

For the 2.0 release, let's clean up datasets we don't need anymore.

Remove graticule
Consolidate weather datasets
Update the census dataset. #171
Update the CO2 dataset

What is the source of wheat.json?

I'm finding it useful for creating simple bar chart examples in Altair. I'm interested to learn more about where the data comes from.

Action required: Greenkeeper could not be activated 🚨

🚨 You need to enable Continuous Integration on Greenkeeper branches of this repository. 🚨

To enable Greenkeeper, you need to make sure that a commit status is reported on all branches. This is required by Greenkeeper because it uses your CI build statuses to figure out when to notify you about breaking changes.

Since we didn’t receive a CI status on the greenkeeper/initial branch, it’s possible that you don’t have CI set up yet. We recommend using Travis CI, but Greenkeeper will work with every other CI service as well.

If you have already set up a CI for this repository, you might need to check how it’s configured. Make sure it is set to run on all new branches. If you don’t want it to run on absolutely every branch, you can whitelist branches starting with greenkeeper/.

Once you have installed and configured CI on this repository correctly, you’ll need to re-trigger Greenkeeper’s initial pull request. To do this, please click the 'fix repo' button on account.greenkeeper.io.

rename acceleration in cars.json to timeToXXXmph

Oglala Lakota County is missing

Thanks for your great work
Can't find Oglala Lakota - 46102 in us_10m.json

Are you open to including more datasets?

I work in data journalism and I think it would be cool to include some simple but famous datasets from our profession in your examples. If I submitted some would you be open to considering them?

weather.csv date parsing doesn't work in Safari

https://github.com/vega/vega-datasets/blob/gh-pages/data/weather.csv the date field cannot be parsed automatically in safari.

Urls should point to stable source

Right now, the generated URLs point to github, which may change as we change the content of the files in master. Instead, we should generate URLs to jsdelivr and include the version number. This way we never change data without the user knowing.

Can not load earthquakes dataset

I couldn't load the earthquakes dataset, then i tried to manually download the json file and tried to read with pandas, the same error occured.

CSV parser treats header rows as data

The CSV loader in data.ts uses the following call:
return d3.csvParseRows(await result.text(), (d3 as any).autoType)

However, this method does not parse the header row as field names. This is the intended documented behavior of csvParseRows (see https://github.com/d3/d3-dsv#dsv_parseRows).

We need to update vega-datasets to return a properly parsed CSV that uses the header row to determine object field names.

Seattle temperature data incorrect?

The seattle-temps dataset claims that the temperature in Seattle never rose above 76 degrees F in 2010:

Vega Editor Link

According to my own memory, the temperature was much hotter. Other more reliable sources agree; for example, weather underground claims that Seattle hit 96 degrees F in August 2010: https://www.wunderground.com/history/monthly/us/wa/seattle/KSEA/date/2010-8

Perhaps the dataset is mislabeled?

Update to 2017 Census

Any plans to update us-10m to represent the latest county FIPS.

https://cdn.jsdelivr.net/npm/us-atlas@2/us/10m.json

Published build artifacts have the wrong version (2.5.0 instead of 2.5.1)

Looks like 2.5.1 #390 includes the wrong version 2.5.0 in the published artifacts. Search for “version” here:

https://cdn.jsdelivr.net/npm/[email protected]/build/vega-datasets.module.js

This results in broken URLs that point to 2.5.0, which is missing data. For example, this:

world = (await import('[email protected]')).default['world-110m.json'].url

https://cdn.jsdelivr.net/npm/[email protected]/data/world-110m.json

movies.json Release Date is sometimes in the future

If you are already aware of this, ignore this.

But when using the data/movies.json for example on line 11 the movie "Duel in the Sun" has a Release Date of "Dec 31 2046". When the movie actually was released in 1946.

There are numerous places where this mixup happens.

Some files have ^M (carriage return) at end of each line. It should be consistent

There are only 5 such files. I think one should remove these specials characters.

Add volcano.json

https://github.com/vega/vega/blob/master/docs/data/volcano.json is used by Vega examples and we can only build examples in the editor if the datasets are in vega-datasets.

What's the source if this dataset? Is there a cleaner version that doesn't have the specific format that works well for just Vega?

7 datasets that cannot be loaded

Description
Hello, Vega team!

I hope you are doing well. I came across an issue while exploring datasets in the Vega dataset repository. Specifically, I found that 7 datasets in the following directory cannot be loaded using the pd.read_json(url) method:

https://github.com/vega/vega-datasets/tree/main/data

I would greatly appreciate it if you could take a look at this issue and provide a possible solution. If you need any additional information from me, please let me know.

Thank you for your time and attention!

Load the dataset using pd.read_json(url) method.
Observe that the dataset cannot be loaded.

Expected behavior
The datasets in the aforementioned directory should be able to be loaded using pd.read_json(url) method.

Actual behavior
7 of the datasets in the directory cannot be loaded using pd.read_json(url) method.

Additional Information
Operating System: Windows
Python version: 3.10.9
Pandas version: 1.5.2

list of Erorrs received:
[['ValueError', 'All arrays must be of the same length'],
['ValueError', 'All arrays must be of the same length'],
['ValueError', 'All arrays must be of the same length'],
['ValueError', 'All arrays must be of the same length'],
['ValueError', 'All arrays must be of the same length'],
['ValueError',
'Mixing dicts with non-Series may lead to ambiguous ordering.'],
['ValueError',
'Mixing dicts with non-Series may lead to ambiguous ordering.']]

Unable to process .arrow file in the datasets

A general demonstration is outlined here in the google collar file: https://colab.research.google.com/drive/1oKhivD5T9Yi1gMl0_7dUwqVFqiNfD43k?usp=sharing

The 'flights-200k.arrow" is producing an error every time I tried to read in the file using Pandas package.

	{
	file: "build/vega-datasets.min.js",
	format: "iife",
	sourcemap: true,
	name: "vegaDatasets",
	plugins: [terser()],
	},

	{
	file: "build/vega-datasets.js",
	format: "umd",
	sourcemap: true,
	name: "vegaDatasets",
	},

	"unpkg": "build/vega-datasets.min.js",
	"jsdelivr": "build/vega-datasets.min.js",

vega / vega-datasets Goto Github PK

vega-datasets's Introduction

Vega Datasets

Versioning

How to use it

HTTP

NPM

Get the data on disk

Get the URLs or Data via URL

Development process

Release process

vega-datasets's People

Contributors

Stargazers

Watchers

Forkers

vega-datasets's Issues

The devDependency rollup was updated from 1.9.2 to 1.9.3.

Bug Fixes

Pull Requests

The devDependency rollup was updated from 1.18.0 to 1.19.0.

The devDependency rollup was updated from 1.14.2 to 1.14.3.

Bug Fixes

Pull Requests

Recommend Projects

Recommend Topics

Recommend Org

The devDependency rollup was updated from `1.9.2` to `1.9.3`.

The devDependency rollup was updated from `1.18.0` to `1.19.0`.

The devDependency rollup was updated from `1.14.2` to `1.14.3`.