Git Product home page Git Product logo

datahub-qa's Introduction

datahub-qa's People

Contributors

acckiygerman avatar akariv avatar anuveyatsu avatar branko-dj avatar mikanebu avatar philippedupreez avatar rufuspollock avatar traviscibot avatar zelima avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

datahub-qa's Issues

Datahub API loses query parameters after redirect

Hi Open Data friends,

The Datahub API has been broken for at least 12 days because of a bug in the way in which HTTP redirects are performed. I've posted this issue on the OKFN forum 12 days ago (link), but that issue was closed and I was asked to open a new issue here. So here we go...

The Datahub API uses query parameters in order to retrieve information, but these parameters are currently being removed because the server loses them in redirects. Here is a particular example; notice that the original request URI contains ?id=270a, but the redirect URI no longer contains it:

$ curl -vL "http://datahub.io/api/action/organization_show?id=270a"
> GET /api/action/organization_show?id=270a HTTP/1.1
> Host: datahub.io
> User-Agent: curl/7.53.1
> Accept: */*

< HTTP/1.1 302 Found
< Date: Sun, 03 Sep 2017 06:25:14 GMT
< Content-Type: text/plain; charset=utf-8
< Content-Length: 73
< Connection: keep-alive
< Set-Cookie: __cfduid=d50eba1741b2be0cdef67ac675b9849e11504419913; expires=Mon, 03-Sep-18 06:25:13 GMT; path=/; domain=.datahub.io; HttpOnly
< X-Powered-By: Express
< Location: https://old.datahub.io/api/action/organization_show
< Vary: Accept
< set-cookie: connect.sid=s%3AadA_LdIs0_XUTekr2yRHpLSNhFwsAQLJ.zddMrtw53pGjwb3WzUks6%2F0WrsHlTOzxPjUA5m20vfs; Path=/; Expires=Sun, 03 Sep 2017 06:26:14 GMT; HttpOnly
< Server: cloudflare-nginx
< CF-RAY: 3986a16da09e2b9a-AMS

> GET /api/action/organization_show HTTP/2
> Host: old.datahub.io
> User-Agent: curl/7.53.1
> Accept: */*

< HTTP/2 409 
< date: Sun, 03 Sep 2017 06:25:14 GMT
< content-type: application/json;charset=utf-8
< content-length: 160
< set-cookie: __cfduid=d27b5560cea8ffe0a0e91e8e93553f2f51504419914; expires=Mon, 03-Sep-18 06:25:14 GMT; path=/; domain=.datahub.io; HttpOnly
< cache-control: no-cache
< pragma: no-cache
< server: cloudflare-nginx
< cf-ray: 3986a17058780c2f-AMS

`data get` only acquires the html page for the dataset not the datapackage.zip

Currently, data get acquires the html page for a dataset

Expected Behaviour: data get should get the datapackage.zip of the dataset

Steps to reproduce this error:

Alis-MBP-3:~ alinaqvi$ data get http://datahub.io/core/co2-ppm
Time elapsed: 0.64 s
Dataset/file is saved in "co2-ppm."
Alis-MBP-3:~ alinaqvi$ data --version
0.6.3
Alis-MBP-3:~ alinaqvi$ file co2-ppm. 
co2-ppm.: HTML document text, UTF-8 Unicode text, with very long lines
Alis-MBP-3:~ alinaqvi$ mv co2-ppm. co2-ppm.html
Alis-MBP-3:~ alinaqvi$ open co2-ppm.html

which shows:
https://www.dropbox.com/s/abu72lhip7yyvln/Screenshot%202018-01-16%2017.22.28.png?dl=0

So data get only obtains the html page of the dataset

Support inline resource data

Support inline resource data on data packages.

My question is how common a use case this is ...

It may be a little bit of a pain i suspect because i'm not sure how data package pipelines handles this ...

/cc @pwalsh

Guide for typical data wrangling operations.

As a non-experienced user of datahub I want to know how to handle and/or prepare data sets, so I could make data clean and clear.

Acceptance Criteria

  • guide in in the datahub.io/docs
  • show this guide for some stranger, who need help with any of listed typical operations (from datahub.io/chat)

Typical tasks to cover

Table:
Feature name | supported by automation pipelines

  • delete first X rows - supported
  • delete last X rows - supported
  • date correction - supported
  • add new header row - supported
  • pivot and unpivot - unpivot supported
  • regex on column - find/replace supported
  • removing extra white spaces - supported (with regex)
  • downloading and extracting zip file - partially supported (only when one file in the archive)
  • remove column - supported
  • read excel file, extract sheet - supported

'utf-8' codec can't decode byte

geoip2-ipv4 core dataset has not passed pipeline.

File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8e in position 4681: invalid start byte

On datahub: http://datahub.io/core/geoip2-ipv4

I can't logout

Description

I can't log out of datahub.io - not in the browser, and not via a CLI command.

Problems with files in sources.

Description

I have source data on disk with my Data Package. I refer to it with sources.

  • Sources do not get uploaded (I can't find by going to what should be its URL)
  • Sources are not part of the zip download
  • The link for sources assumes a URL, and therefore creates incorrect links (relative to the data package preview page) for sources in the top metadata table of the page

json data is now CSV

From https://datahub.io/core/s-and-p-500-companies please see https://datahub.io/core/s-and-p-500-companies/r/constituents.json

The JSON link returns CSV instead of JSON (see below)

Thanks

Symbol,Name,Sector
MMM,3M Company,Industrials
ABT,Abbott Laboratories,Health Care
ABBV,AbbVie,Health Care
ACN,Accenture plc,Information Technology
ATVI,Activision Blizzard,Information Technology
AYI,Acuity Brands Inc,Industrials
ADBE,Adobe Systems Inc,Information Technology
AAP,Advance Auto Parts,Consumer Discretionary
AES,AES Corp,Utilities
AET,Aetna Inc,Health Care
AMG,Affiliated Managers Group Inc,Financials

Ability to give a feedback on docs.

As a user, who is reading the docs, I need an ability to give a feedback about what i'm reading immediately, without need to leave the page, So I can send my ideas&proposals quickly and being focused on my tasks.

Acceptance criteria

  • user can quickly send a feedback directly from the doc page (by marking the text and pressing the right mouse button, or similar simple way)
  • the feedback will be directed to responsible person, who can edit the docs.

Tasks

  • find the proper JS plugin
  • integrate it with our docs platform
  • test feedback ability by someone from outside the team

Analysis

This example may help:
https://djbook.ru/rel1.9/intro/tutorial01.html - user can leave a comment just by clicking on the field left from the text.

Sample Machine Learning datasets on DataHub

Many potential users come from a machine learning context and may be interested in sample machine learning datasets so let's get some up on the DataHub.

See also openml/OpenML#482

Tasks

  • Identify some sample datasets
  • Tabular Data package-ize them
  • Get them into a machine-learning section (maybe create a special org or add these to examples and/or awesome list)
  • Write a tutorial especially explaining how to convert to common formats wanted elsewhere (e.g. ARFF?)

Research

https://github.com/renatopp/arff-datasets

R code shown on does not work

I tried to run the R code on datahub.io/JohnSnowLabs/community-emergency-response-teams
and received a warning and an error.

library("jsonlite")

json_file <- "http://datahub.io/JohnSnowLabs/community-emergency-response-teams/datapackage.json"
json_data <- fromJSON(paste(readLines(json_file), collapse = ""))
#> Warning in readLines(json_file): incomplete final line found on
#> 'http://datahub.io/JohnSnowLabs/community-emergency-response-teams/
#> datapackage.json'

path_to_file = json_data$resources[[1]]$path
#> Error in json_data$resources[[1]]$path: $ operator is invalid for atomic vectors

large dataset: page should not crash (or cli should not allow to push)

Reproduction

  • Local datapackage at ./selected-crimes-local-authorities-2012-2015/
$ data info ./selected-crimes-local-authorities-2012-2015/
# sources/selected-crimes-local-authorities-2012-2015-*

Collection of data about Israeli Police events by local authorities and collection of selected crimes.

Data source: ... see more below

# RESOURCES

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Name                      β”‚ Format β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ selected_crimes_2012_2015 β”‚ csv    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  • large dataset
$ ls -sh selected-crimes-local-authorities-2012-2015/data/
total 78M
78M selected_crimes_2012_2015.csv

expected

  • Assuming there is a file size limit, cli should not allow to push it.
  • Also, when pushing the same data again via CLI - does it compare checksums? or recopy the whole file each time?
  • if the problem is not with size limit, the dataset page should be displayed correctly

actual

[push] Error on data push

Description

I have this on disk:

test/test.txt
test/datapackage.json

test.txt is an empty file.

datapackage.json is:

{
  "name": "stuff",
  "resources": [
    { "name": "stuff-resource", "data": [1,2,3] }
  ]
}

This is a valid Data Package. It passes data validate.

I run data push and get:

pwalsh:test pwalsh$ data push
> Error! [object Promise]

[push] Validation of datapackage.json on data push is not clear

Description

I have an invalid datapackage.json. I try to use data push.

I get the message > Error! Unexpected end of JSON input.

I happen to know as a developer that this error is raised from the method that validates the descriptor. Even without fixing the messages that get thrown by our use of JSON Schema validators on our descriptors, the user experience could be greatly improved by showing the user the context of this error.

Example:

> Running Data Package Validation Step on datapackage.json
> Error! Unexpected end of JSON input

Then I would at least know where the error comes from.

[Epic] Getting files and datasets from datahub.io

This EPIC contains all the issues related to data get command

Critical

  • data get only acquires the html page for the dataset itself #43

Major

  • getting 403 when requesting data with urllib #81

Minor

  • can't get in NTSF folder #55

Trivial

  • grab files from Github links #56

here is more issues connected to this epic, see below in the dependency list

Data wrangling tips for users

As a non-experienced user of datahub I want to know how to handle and/or prepare data sets, so I could make data clean and clear.

Acceptance Criteria

  • guide article in in the datahub.io/docs
  • show this guide for some stranger, who need help with any of listed typical operations (from datahub.io/chat)

Tasks

  • delete first X rows
  • delete last X rows
  • date correction
  • add new header row
  • pivot and unpivot
  • regex on column
  • removing extra white spaces
  • downloading and extracting zip file
  • remove column
  • read excel file, extract sheet

Analysis

  • do we need to create a python lib with this typical functions?
  • or we could integrate this functions in the existing package?

How pushing data for the organization

Hi in github we have an organisation with some data packages, I would like to publish data with the name of the organisation not my username.

https://datahub.io/organisation_name/dataset

Thank you

expected behavior

Added by @AcckiyGerman

User can use his github organisation name to publish data under it.
E.g. @Mikanebu is a member of https://github.com/datopian so it would be great for him to publish data on http://datahub.io/datapian

how to implement

Using github oauth scopes https://developer.github.com/apps/building-oauth-apps/scopes-for-oauth-apps/ we probably could read a list of organisations where user is a member; to use it when pushing data | (creating datahub user ?)

Trying to use at UBUNTU 16 LTS

Using Ubuntu 16.04.3 LTS (xenial)

Anuar Ustayev @anuveyatsu 08:56
@ppKrauss @rufuspollock this is explained here http://datahub.io/docs/getting-started/installing-data#installing-binaries the problem is with xdg-open library on Linux

Peter @ppKrauss 09:00
Suggestion: change page http://datahub.io/docs/getting-started/installing-data#installing-binaries to link http://datahub.io/docs/getting-started/installing-data#installing-binaries
Hi @anuveyatsu , I do the cp /usr/bin/xdg-open /usr/local/bin/xdg-open, perhps need reboot. At now no effect, the login stops at prompt, "? Login with...
❯ Github"

Anuar Ustayev @anuveyatsu 09:04
@ppKrauss so after hitting enter, it doesn’t open your default browser?

Peter @ppKrauss 09:04
Thanks @rufuspollock , I will report at there
@anuveyatsu , after wait and ENTER ... "> Opening browser and waiting for you to authenticate online

Error! spawn /home/user/Downloads/working/DATAHUB/xdg-open ENOENT

Path arrays fail (using the CLI tool)

Hello,
I am just testing / exploring what can be achieved with the datahub.io CLI tool and I have stumbled over the following:

With the following minimal data package:

{
  "resources": [
    {
      "name": "test-resource",
      "path": [ "myfile1.csv", "myfile2.csv" ]
    }
  ]
}

and the following folder structure:

./
β”œβ”€β”€ datapackage.json
β”œβ”€β”€ myfile1.csv
└── myfile2.csv

The CLI tool provides the following:

$ data info datapackage.json
> Error! path_.replace is not a function

I get the same error even if path is an array with a single item (but not if a single string path is provided). This suggests the problem is with the array.

To be fair, I am very new to the frictionless spec, but I think that the above resource descriptor is valid... But the datahub.io tool doesn't seem to like it.

I think it is fine if datahub.io only handles a single (or, specifically non-array) url-or-path but it would be useful if this was documented somewhere (or otherwise gracefully handled).

[Epic] Pushing files and datasets to datahub.io

This EPIC contains all the issues related to data push command

Critical

  • push is not working with different separators #59 and #35

Major

  • Push large files #62
  • Push with other encoding than utif-8 #77
  • not able to push with double quotes #66
  • Suggest help and init commands if dp.json not exists #63

Minor

  • pushing excel files with non-existing sheets #50
  • pushing empty file #60 and #24
  • push dataset by path to dp.json #64
  • Able to push other encoding than utf-8 #11
  • validation is not clear #23
  • support path arrays #40

Trivial

  • push failing with files that have modified extensions #57
  • empty excel file make CLI hang #58
  • not allow pushing remote package #61
  • push speed #71

[Push] Pushing excel file with non existing sheets should fail

Currently, if you try and push excel file and define non-existing sheet it will not fail and push data from the first sheet of excel file.
Another problem is that if you select even existing sheet, it'd push the first one, which means we're not able to push a sheet other than first one.

Reproduce

Try to push excel file with 2 sheets named sheet1 and sheet2. The following command will push the data from the first sheet

  • run data push myexcel.xlsx --sheets=3
  • run data push myexcel.xlsx --sheets=2
  • run data push myexcel.xlsx --sheets=1,2
  • run data push myexcel.xlsx --sheets=all

Expected behaviour

  • If sheet does not exist should terminate process and tell me about it
  • If sheet exists it should push that one

Dependencies

  • issue in data-cli
  • issue in datahub-clinet etc

Does data need to have a datapackage.json? If yes, infer one.

Description

I'm a first time user of the current CLI. After login, I want to just push something.

mkdir test
cd test
touch test.txt
data push
> Error! ENOENT: no such file or directory, open '/Users/pwalsh/test/datapackage.json'

I'm confused because the CLI tells me the following about data push:

[path] Push data at path to the DataHub

I could be a user who does not know what datapackage.json is.

As someone involved in the development of our Frictionless Data software, I know we have an infer method for Data Packages, so I wonder why we are not using it.

Scheduling the updates for datasets

At the moment, we support scheduling in the following way: every 90s, every 5m, every 2d… number is always an int, selector is s/m/h/d/w (second -> week) and you can’t schedule for less than 60 seconds.

From @zelima :

I see we do not support monthly and annual schedules. Also, I was thinking, that some datasets need an annual update and if I want them to be updated every first of January I should push them on that exact date. Maybe we should support "starting date" at some point - would be useful for that case. Eg, one that are updated monthly you need them to be updated exactly on first of each month.

I am still logged in after revoking access on GitHub

Description

Relates to #26

As I can't logout of datahub.io, I tried to force it by revoking access to my GitHub account.

After revoking access, I am still logged in, and worse, I can still push data from the CLI.

There are no browser cookies that I can clear to forcibly flush my session either.

I am not sure if there is a potential security issue here, somewhere, or, if it is just breaking an implicit contract of trust to let me sign out.

How do I get started with push-flow?

I have a remote data somewhere (URL) and I want to create flow.yaml in a way so I it scrapes data from that URL and does some processing steps (e.g., removes some rows). Although I know this functionality is provided by DataHub, I cannot find any guidelines. It would be very useful, if there is a tutorial or simple set of steps of what should I do.

Configuration file location

May I suggest we use ~/config and namespace under there:

~.config/datahub.io/config.json

This is a convention, and many CLI's I use follow it (Digital Ocean, Heroku).

In general I'm -1 on ~/.datahub.json

Error when using CLI (onboarding UX)

Steps to reproduce

npm install data hub-cli

data --version

data info
> Error! ENOENT: no such file or directory, open '/Users/pwalsh/datapackage.json'

Context

When following the instructions at https://datahub.io/docs/getting-started/installing-data I consciously decided to run data info instead of data info https://datahub.io/core/finance-vix.

As a user of many CLI tools, I expect to be able to run a command without arguments and get some type of context-driven help, or, at least an error for the missing argument.

The error I do receive is related to a missing configuration step, I suspect, and this is confusing to me, especially because the instructions on this page make no mention of configuration.

I then went to http://datahub.io/docs/features/data-cli which is linked from the above page, and still do not have any idea how to configure the CLI. data help works, but has no info I am looking for.

I then ran data push which tells me to log in. This was successful (according to CLI messaging), but then I ran data info again and got the same original error.

datapackage.json in zip file differs from version on github or pkgstore.datahub.io

As discussed on gitter, the version of datapackage.json included in zip files downloaded from datahub.io lack basic metadata like title, license, description.

See a diff here between https://pkgstore.datahub.io/core/registry/6/datapackage.json (left hand side) and the version included in the zip file downloaded (right hand side): http://www.mergely.com/Xo9fTlfZ/

Update 7 feb 2018 from @AcckiyGerman

(Sorry for editing your initial message)
Issue is fixed on assembler side datopian/assembler#81
But now we need to redeploy all core datasets to put the metadata on place.

tasks

  • redeploy all 'core' datasets (write a bash script to do that quickly)

Observations working with the datahub.io view spec

Currently datahub.io supports two flavours: a simple spec, and Vega (v.2.x).

The simple spec is suggested to cover the 80/20 usecase, but with real-world data, I've found it not to be as useful for most cases. The data almost always needs some form of aggregation, transformation or graph specific tweaks like human readable labels in place of data keys.

On the other hand, Vega is very powerful, but the spec is harder to write by hand. There are few tools available to help iterate while developing a spec document and resulting visualisation; iterating by making changes and pushing to datahub.io isn't efficient.

Middle ground support for Vega-lite would be very desirable to help bridge the gap between the too-simple 'simple' spec, and the much more powerful, but complex Vega spec.

The problem of working with Vega is further compounded by datahub.io using v2.x, rather than the more recent v3. Online documentation and tools are centered around the newer version. For example, there is an online editor provided by Vega, that will take Vega-Lite, and 'compile' it to Vega, but this transpilation isn't supported by the Vega 2 compatible version of the editor (https://vega.github.io/vega-editor/).

Horizontal bar charts

At the moment, for "simple" view spec we support line and vertical bar charts. Although these two types are popular and cover most of use-cases, it would be useful to have horizontal bar charts. E.g., consider this example: https://datahub.io/core/gini-index. In the second graph, we have too many countries in x axis so it is not possible to show names for all of them as we have limited width. In such situations, horizontal bar charts could be useful as we're not limited in height.

Multiple accounts in config

Description

The current config file has a data structure designed around a single account on datahub. I have just started interacting with the system as a user and I already want two accounts (a pseudo org account, and my own account).

Many config files for CLIs support this elegantly. I suggest taking a good look at the gcloud CLI and the aws CLI in terms of user experience if running commands with different users, and the config file itself.

[Epic] Installation and download

All issues about 'data-cli' installation should be listed here

As a User, I want to see clear instructions so I cat install and run 'data-cli' on a different OS.

Critical

  • ...

Major

  • ubuntu 16LTS problem #46
  • npm/yarn installable #54

Minor

  • installation size #53
  • UX problems #20

Trivial

  • ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.