datopian / datahub-qa Goto Github PK

View Code? Open in Web Editor NEW

32.0 8.0 6.0 51 KB

:package: Bugs, issues and suggestions for datahub.io

Home Page: https://datahub.io/

datahub-qa's Issues

README with large images should be displayed correctly

reproduction

https://datahub.io/OriHoch/israel-traffic-violations-2012-2016-partial-2017#readme

expected

images should be limited to width of the surrounding element

actual

images are displayed at full size, forcing horizontal scroll

Path arrays fail (using the CLI tool)

Hello,
I am just testing / exploring what can be achieved with the datahub.io CLI tool and I have stumbled over the following:

With the following minimal data package:

{
  "resources": [
    {
      "name": "test-resource",
      "path": [ "myfile1.csv", "myfile2.csv" ]
    }
  ]
}

and the following folder structure:

./
├── datapackage.json
├── myfile1.csv
└── myfile2.csv

The CLI tool provides the following:

$ data info datapackage.json
> Error! path_.replace is not a function

I get the same error even if path is an array with a single item (but not if a single string path is provided). This suggests the problem is with the array.

To be fair, I am very new to the frictionless spec, but I think that the above resource descriptor is valid... But the datahub.io tool doesn't seem to like it.

I think it is fine if datahub.io only handles a single (or, specifically non-array) url-or-path but it would be useful if this was documented somewhere (or otherwise gracefully handled).

Handle semicolon-separated CSVs in data cli (and desktop)

At the moment, it seems neither data push file.csv nor data init file.csv works when the csv file is semicolon separated (as is common in France, for example).

Scheduling the updates for datasets

At the moment, we support scheduling in the following way: every 90s, every 5m, every 2d… number is always an int, selector is s/m/h/d/w (second -> week) and you can’t schedule for less than 60 seconds.

From @zelima :

I see we do not support monthly and annual schedules. Also, I was thinking, that some datasets need an annual update and if I want them to be updated every first of January I should push them on that exact date. Maybe we should support "starting date" at some point - would be useful for that case. Eg, one that are updated monthly you need them to be updated exactly on first of each month.

datapackage.json in zip file differs from version on github or pkgstore.datahub.io

As discussed on gitter, the version of datapackage.json included in zip files downloaded from datahub.io lack basic metadata like title, license, description.

See a diff here between https://pkgstore.datahub.io/core/registry/6/datapackage.json (left hand side) and the version included in the zip file downloaded (right hand side): http://www.mergely.com/Xo9fTlfZ/

Update 7 feb 2018 from @AcckiyGerman

(Sorry for editing your initial message)
Issue is fixed on assembler side datopian/assembler#81
But now we need to redeploy all core datasets to put the metadata on place.

tasks

redeploy all 'core' datasets (write a bash script to do that quickly)

Both search and blog only show first entries

http://datahub.io/search and http://datahub.io/blog show me the first few datasets / blog entries but no way to go to the next page to see more data sets / older blog posts.

Multiple accounts in config

Description

The current config file has a data structure designed around a single account on datahub. I have just started interacting with the system as a user and I already want two accounts (a pseudo org account, and my own account).

Many config files for CLIs support this elegantly. I suggest taking a good look at the gcloud CLI and the aws CLI in terms of user experience if running commands with different users, and the config file itself.

[frontend] Issue with encoding of the title

See https://datahub.io/joelgombin/ville_vitry_subventions_2017: the title of the page - which was passed by the CLI tool - has encoding issue, whereas the previewing of the CSV resource is fine.

How do I get started with push-flow?

I have a remote data somewhere (URL) and I want to create flow.yaml in a way so I it scrapes data from that URL and does some processing steps (e.g., removes some rows). Although I know this functionality is provided by DataHub, I cannot find any guidelines. It would be very useful, if there is a tutorial or simple set of steps of what should I do.

Support inline resource data

Support inline resource data on data packages.

My question is how common a use case this is ...

It may be a little bit of a pain i suspect because i'm not sure how data package pipelines handles this ...

/cc @pwalsh

I can't navigate to a dataset from the dataset list on my profile

Description

The datasets list on my profile is not linkable to the actual dataset(s).

The only place I can navigate to a dataset is from my event stream in the left-hand column.

preview table does not work

Pipeline passed successfully, but there is no preview table for this dataset http://datahub.io/JohnSnowLabs/nys-mathematics-exam.
In console log Uncaught (in promise) TypeError: Cannot read property 'unique' of undefined

'utf-8' codec can't decode byte

geoip2-ipv4 core dataset has not passed pipeline.

File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 4681: invalid start byte

On datahub: http://datahub.io/core/geoip2-ipv4

I can't get to my profile page if I don't know the URL

Description

I can't simply navigate from anywhere to my profile page. Clicking my avatar does nothing (goes to https://datahub.io/#).

json data is now CSV

From https://datahub.io/core/s-and-p-500-companies please see https://datahub.io/core/s-and-p-500-companies/r/constituents.json

The JSON link returns CSV instead of JSON (see below)

Thanks

Symbol,Name,Sector
MMM,3M Company,Industrials
ABT,Abbott Laboratories,Health Care
ABBV,AbbVie,Health Care
ACN,Accenture plc,Information Technology
ATVI,Activision Blizzard,Information Technology
AYI,Acuity Brands Inc,Industrials
ADBE,Adobe Systems Inc,Information Technology
AAP,Advance Auto Parts,Consumer Discretionary
AES,AES Corp,Utilities
AET,Aetna Inc,Health Care
AMG,Affiliated Managers Group Inc,Financials

large dataset: page should not crash (or cli should not allow to push)

Reproduction

Local datapackage at ./selected-crimes-local-authorities-2012-2015/

$ data info ./selected-crimes-local-authorities-2012-2015/
# sources/selected-crimes-local-authorities-2012-2015-*

Collection of data about Israeli Police events by local authorities and collection of selected crimes.

Data source: ... see more below

# RESOURCES

┌───────────────────────────┬────────┐
│ Name                      │ Format │
├───────────────────────────┼────────┤
│ selected_crimes_2012_2015 │ csv    │
└───────────────────────────┴────────┘

large dataset

$ ls -sh selected-crimes-local-authorities-2012-2015/data/
total 78M
78M selected_crimes_2012_2015.csv

expected

Assuming there is a file size limit, cli should not allow to push it.
Also, when pushing the same data again via CLI - does it compare checksums? or recopy the whole file each time?
if the problem is not with size limit, the dataset page should be displayed correctly

actual

data is pushed
dataset page crashes (looks like some kind of JS crash due to large dataset)
https://datahub.io/OriHoch/israel_selected_crimes_2012_2015

datahub.io fails if a resource name has an upper case letter

E.g. if the csv file title is ville_vitry_Subventions_2017_comma_csv it gives an error: https://datahub.io/joelgombin/test/pipelines

[Epic] Pushing files and datasets to datahub.io

This EPIC contains all the issues related to data push command

Critical

push is not working with different separators #59 and #35

Major

Push large files #62
Push with other encoding than utif-8 #77
not able to push with double quotes #66
Suggest help and init commands if dp.json not exists #63

Minor

pushing excel files with non-existing sheets #50
pushing empty file #60 and #24
push dataset by path to dp.json #64
Able to push other encoding than utf-8 #11
validation is not clear #23
support path arrays #40

Trivial

push failing with files that have modified extensions #57
empty excel file make CLI hang #58
not allow pushing remote package #61
push speed #71

I can't logout

Description

I can't log out of datahub.io - not in the browser, and not via a CLI command.

Instructions for importing data are not working

I cannot use data from the datahub following the instruction

How to reproduce

Go to any dataset page on the datahub.io and try to use the dataset due to instructions, e.g. http://datahub.io/core/cofog#python

expected behavior

[Epic] Installation and download

All issues about 'data-cli' installation should be listed here

As a User, I want to see clear instructions so I cat install and run 'data-cli' on a different OS.

Critical

Major

ubuntu 16LTS problem #46
npm/yarn installable #54

Minor

installation size #53
UX problems #20

Trivial

I am still logged in after revoking access on GitHub

Description

Relates to #26

As I can't logout of datahub.io, I tried to force it by revoking access to my GitHub account.

After revoking access, I am still logged in, and worse, I can still push data from the CLI.

There are no browser cookies that I can clear to forcibly flush my session either.

I am not sure if there is a potential security issue here, somewhere, or, if it is just breaking an implicit contract of trust to let me sign out.

Links to http://explorer.datahub.io/ broken

There are a few links from the recline js demo page that link to the explorer.datahub.io pages.

Does data need to have a datapackage.json? If yes, infer one.

Description

I'm a first time user of the current CLI. After login, I want to just push something.

mkdir test
cd test
touch test.txt
data push
> Error! ENOENT: no such file or directory, open '/Users/pwalsh/test/datapackage.json'

I'm confused because the CLI tells me the following about data push:

[path] Push data at path to the DataHub

I could be a user who does not know what datapackage.json is.

As someone involved in the development of our Frictionless Data software, I know we have an infer method for Data Packages, so I wonder why we are not using it.

Configuration file location

May I suggest we use ~/config and namespace under there:

~.config/datahub.io/config.json

This is a convention, and many CLI's I use follow it (Digital Ocean, Heroku).

In general I'm -1 on ~/.datahub.json

Ability to give a feedback on docs.

As a user, who is reading the docs, I need an ability to give a feedback about what i'm reading immediately, without need to leave the page, So I can send my ideas&proposals quickly and being focused on my tasks.

Acceptance criteria

user can quickly send a feedback directly from the doc page (by marking the text and pressing the right mouse button, or similar simple way)
the feedback will be directed to responsible person, who can edit the docs.

Tasks

find the proper JS plugin
integrate it with our docs platform
test feedback ability by someone from outside the team

Analysis

This example may help:
https://djbook.ru/rel1.9/intro/tutorial01.html - user can leave a comment just by clicking on the field left from the text.

utf-8 encoding for CSV files

Wonder if there is an issue in utf8 for csv files output.

See Åland islands item on:

https://datahub.io/core/country-list#data

Fine in page but on here it looks weird:

https://pkgstore.datahub.io/core/country-list:data_csv/data/data_csv.csv

My guess is that it is correct in the actual file but that we may not be setting utf8 as encoding when showing file from pkgstore. This may be a non-issue btw (I'm not sure anyone cares about looking at version on pkgstore).

`data get` only acquires the html page for the dataset not the datapackage.zip

Currently, data get acquires the html page for a dataset

Expected Behaviour: data get should get the datapackage.zip of the dataset

Steps to reproduce this error:

Alis-MBP-3:~ alinaqvi$ data get http://datahub.io/core/co2-ppm
Time elapsed: 0.64 s
Dataset/file is saved in "co2-ppm."
Alis-MBP-3:~ alinaqvi$ data --version
0.6.3
Alis-MBP-3:~ alinaqvi$ file co2-ppm. 
co2-ppm.: HTML document text, UTF-8 Unicode text, with very long lines
Alis-MBP-3:~ alinaqvi$ mv co2-ppm. co2-ppm.html
Alis-MBP-3:~ alinaqvi$ open co2-ppm.html

which shows:
https://www.dropbox.com/s/abu72lhip7yyvln/Screenshot%202018-01-16%2017.22.28.png?dl=0

So data get only obtains the html page of the dataset

Data wrangling tips for users

As a non-experienced user of datahub I want to know how to handle and/or prepare data sets, so I could make data clean and clear.

Acceptance Criteria

guide article in in the datahub.io/docs
show this guide for some stranger, who need help with any of listed typical operations (from datahub.io/chat)

Tasks

Analysis

do we need to create a python lib with this typical functions?
or we could integrate this functions in the existing package?

Webhooks for notifications to import data / data packages

As Rufus I want to push data packages to e.g. github and have a webhook so that auto-triggers an import to the datahub so that my datahub dataset is up to date (example: https://github.com/datasets/registry).

What i want is a webhook and github support for that ...

[push] Error on data push

Description

I have this on disk:

test/test.txt
test/datapackage.json

test.txt is an empty file.

datapackage.json is:

{
  "name": "stuff",
  "resources": [
    { "name": "stuff-resource", "data": [1,2,3] }
  ]
}

This is a valid Data Package. It passes data validate.

I run data push and get:

pwalsh:test pwalsh$ data push
> Error! [object Promise]

Problems with files in sources.

Description

I have source data on disk with my Data Package. I refer to it with sources.

Sources do not get uploaded (I can't find by going to what should be its URL)
Sources are not part of the zip download
The link for sources assumes a URL, and therefore creates incorrect links (relative to the data package preview page) for sources in the top metadata table of the page

Guide for typical data wrangling operations.

As a non-experienced user of datahub I want to know how to handle and/or prepare data sets, so I could make data clean and clear.

Acceptance Criteria

guide in in the datahub.io/docs
show this guide for some stranger, who need help with any of listed typical operations (from datahub.io/chat)

Typical tasks to cover

Table:
Feature name | supported by automation pipelines

JSON/CSV download doesn't work

When trying to fetch https://datahub.io/core/country-codes/r/country-codes.json
I get this error:
Cannot GET /core/country-codes/r/data/json/data/country-codes.json

I think this msg started to pop up today.

Sample Machine Learning datasets on DataHub

Many potential users come from a machine learning context and may be interested in sample machine learning datasets so let's get some up on the DataHub.

Tasks

Identify some sample datasets
Tabular Data package-ize them
Get them into a machine-learning section (maybe create a special org or add these to examples and/or awesome list)
Write a tutorial especially explaining how to convert to common formats wanted elsewhere (e.g. ARFF?)

Research

https://github.com/renatopp/arff-datasets

Observations working with the datahub.io view spec

Currently datahub.io supports two flavours: a simple spec, and Vega (v.2.x).

The simple spec is suggested to cover the 80/20 usecase, but with real-world data, I've found it not to be as useful for most cases. The data almost always needs some form of aggregation, transformation or graph specific tweaks like human readable labels in place of data keys.

On the other hand, Vega is very powerful, but the spec is harder to write by hand. There are few tools available to help iterate while developing a spec document and resulting visualisation; iterating by making changes and pushing to datahub.io isn't efficient.

Middle ground support for Vega-lite would be very desirable to help bridge the gap between the too-simple 'simple' spec, and the much more powerful, but complex Vega spec.

The problem of working with Vega is further compounded by datahub.io using v2.x, rather than the more recent v3. Online documentation and tools are centered around the newer version. For example, there is an online editor provided by Vega, that will take Vega-Lite, and 'compile' it to Vega, but this transpilation isn't supported by the Vega 2 compatible version of the editor (https://vega.github.io/vega-editor/).

Cannot access dataset files cross origin (no CORS)

Files in the pkg store are supposed to have CORS support turned on so that cross origin http requests work from javascript. However, it looks like this is not working atm which is breaking this site for example: http://rufuspollock.github.io/imf-weo/

[push] Validation of datapackage.json on data push is not clear

Description

I have an invalid datapackage.json. I try to use data push.

I get the message > Error! Unexpected end of JSON input.

I happen to know as a developer that this error is raised from the method that validates the descriptor. Even without fixing the messages that get thrown by our use of JSON Schema validators on our descriptors, the user experience could be greatly improved by showing the user the context of this error.

Example:

> Running Data Package Validation Step on datapackage.json
> Error! Unexpected end of JSON input

Then I would at least know where the error comes from.

Horizontal bar charts

At the moment, for "simple" view spec we support line and vertical bar charts. Although these two types are popular and cover most of use-cases, it would be useful to have horizontal bar charts. E.g., consider this example: https://datahub.io/core/gini-index. In the second graph, we have too many countries in x axis so it is not possible to show names for all of them as we have limited width. In such situations, horizontal bar charts could be useful as we're not limited in height.

[Push] Pushing excel file with non existing sheets should fail

Currently, if you try and push excel file and define non-existing sheet it will not fail and push data from the first sheet of excel file.
Another problem is that if you select even existing sheet, it'd push the first one, which means we're not able to push a sheet other than first one.

Reproduce

Try to push excel file with 2 sheets named sheet1 and sheet2. The following command will push the data from the first sheet

run data push myexcel.xlsx --sheets=3
run data push myexcel.xlsx --sheets=2
run data push myexcel.xlsx --sheets=1,2
run data push myexcel.xlsx --sheets=all

Expected behaviour

If sheet does not exist should terminate process and tell me about it
If sheet exists it should push that one

Dependencies

issue in data-cli
issue in datahub-clinet etc

Broken links

The following datasets can't be downloaded
Airports
https://datahub.io/dataset/global-airports-in-rdf
http://rv1460.1blu.de/datasets/global-airports/global-airports.ttl
SIDER
https://datahub.io/de/dataset/fu-berlin-sider
http://wifo5-03.informatik.uni-mannheim.de/sider/sider_dump.nt.bz2
US security
https://datahub.io/de/dataset/sec-rdfabout

Scheduled dataset not published

Added a new dataset using
data push https://raw.githubusercontent.com/okfn/licenses/master/licenses.csv --schedule="every 1d"

Returned 🙌 your data is published!

Went to link https://datahub.io/Stephen-Gates/licenses-black-rattlesnake-15

Got Sorry, this dataset was not found.

[Epic] Getting files and datasets from datahub.io

This EPIC contains all the issues related to data get command

Critical

data get only acquires the html page for the dataset itself #43

Major

getting 403 when requesting data with urllib #81

Minor

can't get in NTSF folder #55

Trivial

grab files from Github links #56

here is more issues connected to this epic, see below in the dependency list

Error when using CLI (onboarding UX)

Steps to reproduce

npm install data hub-cli

data --version

data info
> Error! ENOENT: no such file or directory, open '/Users/pwalsh/datapackage.json'

Context

When following the instructions at https://datahub.io/docs/getting-started/installing-data I consciously decided to run data info instead of data info https://datahub.io/core/finance-vix.

As a user of many CLI tools, I expect to be able to run a command without arguments and get some type of context-driven help, or, at least an error for the missing argument.

The error I do receive is related to a missing configuration step, I suspect, and this is confusing to me, especially because the instructions on this page make no mention of configuration.

I then went to http://datahub.io/docs/features/data-cli which is linked from the above page, and still do not have any idea how to configure the CLI. data help works, but has no info I am looking for.

I then ran data push which tells me to log in. This was successful (according to CLI messaging), but then I ran data info again and got the same original error.

Trying to use at UBUNTU 16 LTS

Using Ubuntu 16.04.3 LTS (xenial)

Anuar Ustayev @anuveyatsu 08:56
@ppKrauss @rufuspollock this is explained here http://datahub.io/docs/getting-started/installing-data#installing-binaries the problem is with xdg-open library on Linux

Peter @ppKrauss 09:00
Suggestion: change page http://datahub.io/docs/getting-started/installing-data#installing-binaries to link http://datahub.io/docs/getting-started/installing-data#installing-binaries
Hi @anuveyatsu , I do the cp /usr/bin/xdg-open /usr/local/bin/xdg-open, perhps need reboot. At now no effect, the login stops at prompt, "? Login with...
❯ Github"

Anuar Ustayev @anuveyatsu 09:04
@ppKrauss so after hitting enter, it doesn’t open your default browser?

Peter @ppKrauss 09:04
Thanks @rufuspollock , I will report at there
@anuveyatsu , after wait and ENTER ... "> Opening browser and waiting for you to authenticate online

Error! spawn /home/user/Downloads/working/DATAHUB/xdg-open ENOENT

R code shown on does not work

I tried to run the R code on datahub.io/JohnSnowLabs/community-emergency-response-teams
and received a warning and an error.

library("jsonlite")

json_file <- "http://datahub.io/JohnSnowLabs/community-emergency-response-teams/datapackage.json"
json_data <- fromJSON(paste(readLines(json_file), collapse = ""))
#> Warning in readLines(json_file): incomplete final line found on
#> 'http://datahub.io/JohnSnowLabs/community-emergency-response-teams/
#> datapackage.json'

path_to_file = json_data$resources[[1]]$path
#> Error in json_data$resources[[1]]$path: $ operator is invalid for atomic vectors

israel_traffic_violations dataset page should be displayed correctly

reproduction

goto: https://datahub.io/OriHoch/israel_traffic_violations

expected

resource name should be in hebrew
table preview should show some data

actual

How pushing data for the organization

Hi in github we have an organisation with some data packages, I would like to publish data with the name of the organisation not my username.

https://datahub.io/organisation_name/dataset

Thank you

expected behavior

Added by @AcckiyGerman

User can use his github organisation name to publish data under it.
E.g. @Mikanebu is a member of https://github.com/datopian so it would be great for him to publish data on http://datahub.io/datapian

how to implement

Using github oauth scopes https://developer.github.com/apps/building-oauth-apps/scopes-for-oauth-apps/ we probably could read a list of organisations where user is a member; to use it when pushing data | (creating datahub user ?)

Datahub API loses query parameters after redirect

Hi Open Data friends,

The Datahub API has been broken for at least 12 days because of a bug in the way in which HTTP redirects are performed. I've posted this issue on the OKFN forum 12 days ago (link), but that issue was closed and I was asked to open a new issue here. So here we go...

The Datahub API uses query parameters in order to retrieve information, but these parameters are currently being removed because the server loses them in redirects. Here is a particular example; notice that the original request URI contains ?id=270a, but the redirect URI no longer contains it:

$ curl -vL "http://datahub.io/api/action/organization_show?id=270a"
> GET /api/action/organization_show?id=270a HTTP/1.1
> Host: datahub.io
> User-Agent: curl/7.53.1
> Accept: */*

< HTTP/1.1 302 Found
< Date: Sun, 03 Sep 2017 06:25:14 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 73
< Connection: keep-alive
< Set-Cookie: __cfduid=d50eba1741b2be0cdef67ac675b9849e11504419913; expires=Mon, 03-Sep-18 06:25:13 GMT; path=/; domain=.datahub.io; HttpOnly
< X-Powered-By: Express
< Location: https://old.datahub.io/api/action/organization_show
< Vary: Accept
< set-cookie: connect.sid=s%3AadA_LdIs0_XUTekr2yRHpLSNhFwsAQLJ.zddMrtw53pGjwb3WzUks6%2F0WrsHlTOzxPjUA5m20vfs; Path=/; Expires=Sun, 03 Sep 2017 06:26:14 GMT; HttpOnly
< Server: cloudflare-nginx
< CF-RAY: 3986a16da09e2b9a-AMS

> GET /api/action/organization_show HTTP/2
> Host: old.datahub.io
> User-Agent: curl/7.53.1
> Accept: */*

< HTTP/2 409 
< date: Sun, 03 Sep 2017 06:25:14 GMT
< content-type: application/json;charset=utf-8
< content-length: 160
< set-cookie: __cfduid=d27b5560cea8ffe0a0e91e8e93553f2f51504419914; expires=Mon, 03-Sep-18 06:25:14 GMT; path=/; domain=.datahub.io; HttpOnly
< cache-control: no-cache
< pragma: no-cache
< server: cloudflare-nginx
< cf-ray: 3986a17058780c2f-AMS

[rfc] Indication of time period covered by a dataset

RFC: Feature Idea: Indication of time period covered by a dataset

Thinking about adding metadata about what time period time series datasets cover. What do people think and any suggestions ...?

datopian / datahub-qa Goto Github PK

datahub-qa's Issues

reproduction

expected

actual

Update 7 feb 2018 from @AcckiyGerman

tasks

Description

Description

Description

Reproduction

expected

actual

Critical

Major

Minor

Trivial

Description

How to reproduce

expected behavior

Critical

Major

Minor

Trivial

Description

Description

Acceptance criteria

Tasks

Analysis

Acceptance Criteria

Tasks

Analysis

Description

Description

Acceptance Criteria

Typical tasks to cover

Tasks

Research

Description

Reproduce

Expected behaviour

Dependencies

Critical

Major

Minor

Trivial

here is more issues connected to this epic, see below in the dependency list

Steps to reproduce

Context

reproduction

expected

actual

expected behavior

how to implement

Recommend Projects

Recommend Topics

Recommend Org