iamaziz / pydataset Goto Github PK

View Code? Open in Web Editor NEW

935.0 34.0 86.0 15.31 MB

Instant access to many datasets in Python.

License: MIT License

Python 100.00%

python datasets data-science

pydataset's Issues

Importing Pydataset

Hello

I am trying to use Pydataset and I am having a strange error.

I am using Windows 10 with Python 3.6. Have already updated my pip and I can load all dataset but I cannot use none.

Here is a screenshot:

As you can see it says "Not valid dataset name and no similar found" but I am trying with many different names copy and paste then. In this case exceptionaly I used cmd but most I use IDLE or PyCharm.

Mostly I use Windows 10 but it is also occurring in Mint Linux at a Virtual Machine.

Process for adding datasets?

In the README, there's interest in expanding the number of datasets. I'm wondering what kind of criteria that new data would have to meet. Just of the top of my head:

Would it need to be useful prima facie, or would niche data also be acceptable? The kind of thing I'm considering (not seriously for inclusion, just in general) is that I'm working on scraping info about episodes of Detective Conan, such as what characters appeared in them. Would that be too niche?
Would it have to pass some vote for inclusion? If so, who gets a vote?
All the current data is csv. Would other kinds of data formats be able to be included later? Like HDF5?

Translating R to Python. Worth the effort?

The starter datasets came from R's samples, so their html documentation includes R examples on how to use the data. Would it be considered worthwhile to translate the usage information to Python3?

Unable to load datasets (Python 3.5.1 under Anaconda, Win 7)

Hi,

I'm unable to load datasets in Python 3.5.1, Win 7. I can install pydataset, import it, and view available datasets just fine. However, when I try to load datasets, I get an error saying that I have the wrong name for the dataset. For example:

In [1]: iris= data('iris')
Traceback (most recent call last):

  File "<ipython-input-3-f894fb655dca>", line 1, in <module>
    cake = data("cake", show_doc=True)

  File "C:\Users\ctaylor\AppData\Local\Continuum\Anaconda3\lib\site-packages\pydataset\__init__.py", line 36, in data
    raise Exception('Wrong dataset name! Try: data() to see available.')

Exception: Wrong dataset name! Try: data() to see available.

Display options set

Pydataset sets display options like display.max_rows = 170 without restoring after whatever it does. These should be set in my opinion in an option_context context handler. A module should not modify the user's environment permanently (until restart of the interactive interpreter).

Provide namespaces and an index

This fantastic idea, kudos.

With the growing number of dataset your tool will support, you will quickly run out of names. And searching about a particular dataset will be hard.

I'd recommand:

to require the dataset to have namespaces. E.G by source: "tld.domain.titanic" or by taxonomy "history.titanic.victims.#timestamp#"
to publish a web page with an index of all data sets with their namespace and content.
to define a procedure to add a dataset to the repo, or by plugins.

Merge code ? DataPackage / datasets ...

Hello,

A lot of datasets are also available at https://github.com/datasets
They are called DataPackage.

They are available using Python and

https://github.com/datapackages/datapackage-py (Work In Progress) or https://github.com/trickvi/datapackage

Pinging @vitorbaptista @pwalsh @trickvi @rgrp

There is some overlap between these projects so maybe merging might be considered.

At least we should all be aware of existence of others projects.

Kind regards

PS : Datasets are also available at https://github.com/vincentarelbundock/Rdatasets

Fix simple typo: smiliarity -> similarity

Issue Type

[x] Bug (Typo)

Steps to Replicate

Examine pydataset/support.py.
Search for smiliarity.

Expected Behaviour

Should read similarity.

Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md

To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR.

https://github.com/timgates42/PyDataset/pull/new/bugfix_typo_similarity

Thanks.

Break same name with R

I always thought having the name "data" as a function globally is one of the weirdest things in R.
Perhaps consider changing it into load_data (comparable to load_xxxx in sklearn).
Then people can use data (however vague that term is anyway) in their scripts freely.

Regression/Classification info

Hi,

It would be nice to have a 3rd column for data() output indicating whether the dataset can be used for regression or classification problems.

Getting error in windows 10 when installing with pip3

error: Microsoft Visual C++ 10.0 is required. Get it with "Microsoft Windows SDK 7.1": www.microsoft.com/download/details.aspx?id=8279

Allow usage of pydataset with no external dependancies

PyDataset is a fantastic tool to learn Python. But requiring pandas (and hence numpy) is a big barrier of entry. What's more you may want to be able to load the data using another tool to process it.

To make your lib more flexible and more newcomer friendly, I'd advice:

to create a toolbox that let you define the data index and load it in a generic way. It should not rely on a particular tech for downloading or result format and provide hooks to plug your own.
then build adapters for your downloader and pandas;
then build an adapter for regular python data structure.
It should default on pandas if it's installed, or regular python list/dict if it's not.

This will allow:

beginers to use it without needing to learn or install pandas;
external tools to embed it and adapt it easily;
make it easy to adapt to use with other data processing tools.
make it easy to adapt to use with other way to download data (gevent, asyncio, threadpool, etc).

Please make datasets non-executable

When initiating the datasets repo, all files have permissions 0755 (-rwxr-xr-x) when in fact they are not executable. Please make the initialization install datasets as 0644 (-rw-r--r--).

get_rdatasets in statsmodels

Just wanted to point you to some similar functionality we have in statsmodels that just pulls from the Rdatasets repo.

https://github.com/statsmodels/statsmodels/blob/master/statsmodels/datasets/utils.py#L246

Distinct dataset documentation

The documentation shown from 'housing' dataset don't match actual rows and columns imported

How to reproduce:

>>> from pydataset import data`
>>> df = data('housing')`
>>> df

       id    y  time  sec
1       1  1.0     0    1
2       1  2.0     6    1
3       1  2.0    12    1
4       1  2.0    24    1
5       2  1.0     0    1
...   ...  ...   ...  ...
1444  361  NaN    24    0
1445  362  1.0     0    0
1446  362  1.0     6    0
1447  362  1.0    12    0
1448  362  1.0    24    0

[1448 rows x 4 columns]

>>> data('housing', show_doc='True')

housing

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

Frequency Table from a Copenhagen Housing Conditions Survey

Description

The housing data frame has 72 rows and 5 variables.

Usage

housing

Format

Sat

Satisfaction of householders with their present housing circumstances, (High,
Medium or Low, ordered factor).

Infl

Perceived degree of influence householders have on the management of the
property (High, Medium, Low).

Type

Type of rental accommodation, (Tower, Atrium, Apartment, Terrace).

Cont

Contact residents are afforded with other residents, (Low, High).

Freq

Frequencies: the numbers of residents in each class.

Source

Madsen, M. (1976) Statistical analysis of multiple contingency tables. Two
examples. Scand. J. Statist. 3, 97–106.

Cox, D. R. and Snell, E. J. (1984) Applied Statistics, Principles and
Examples. Chapman & Hall.

References

Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S.
Fourth edition. Springer.

Examples

options(contrasts = c("contr.treatment", "contr.poly"))
# Surrogate Poisson models
house.glm0 <- glm(Freq ~ Infl*Type*Cont + Sat, family = poisson,
                  data = housing)
summary(house.glm0, cor = FALSE)
addterm(house.glm0, ~. + Sat:(Infl+Type+Cont), test = "Chisq")
house.glm1 <- update(house.glm0, . ~ . + Sat*(Infl+Type+Cont))
summary(house.glm1, cor = FALSE)
1 - pchisq(deviance(house.glm1), house.glm1$df.residual)
dropterm(house.glm1, test = "Chisq")
addterm(house.glm1, ~. + Sat:(Infl+Type+Cont)^2, test  =  "Chisq")
hnames <- lapply(housing[, -5], levels) # omit Freq
newData <- expand.grid(hnames)
newData$Sat <- ordered(newData$Sat)
house.pm <- predict(house.glm1, newData,
                    type = "response")  # poisson means
house.pm <- matrix(house.pm, ncol = 3, byrow = TRUE,
                   dimnames = list(NULL, hnames[[1]]))
house.pr <- house.pm/drop(house.pm %*% rep(1, 3))
cbind(expand.grid(hnames[-1]), round(house.pr, 2))
# Iterative proportional scaling
loglm(Freq ~ Infl*Type*Cont + Sat*(Infl+Type+Cont), data = housing)
# multinomial model
library(nnet)
(house.mult<- multinom(Sat ~ Infl + Type + Cont, weights = Freq,
                       data = housing))
house.mult2 <- multinom(Sat ~ Infl*Type*Cont, weights = Freq,
                        data = housing)
anova(house.mult, house.mult2)
house.pm <- predict(house.mult, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pm, 2))
# proportional odds model
house.cpr <- apply(house.pr, 1, cumsum)
logit <- function(x) log(x/(1-x))
house.ld <- logit(house.cpr[2, ]) - logit(house.cpr[1, ])
(ratio <- sort(drop(house.ld)))
mean(ratio)
(house.plr <- polr(Sat ~ Infl + Type + Cont,
                   data = housing, weights = Freq))
house.pr1 <- predict(house.plr, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pr1, 2))
Fr <- matrix(housing$Freq, ncol  =  3, byrow = TRUE)
2*sum(Fr*log(house.pr/house.pr1))
house.plr2 <- stepAIC(house.plr, ~.^2)
house.plr2$anova

I can't find what the actual dataset imported means. I suggest adjusting the documentation to describe the correct one.