iamaziz / pydataset Goto Github PK
View Code? Open in Web Editor NEWInstant access to many datasets in Python.
License: MIT License
Instant access to many datasets in Python.
License: MIT License
Hello
I am trying to use Pydataset and I am having a strange error.
I am using Windows 10 with Python 3.6. Have already updated my pip and I can load all dataset but I cannot use none.
As you can see it says "Not valid dataset name and no similar found" but I am trying with many different names copy and paste then. In this case exceptionaly I used cmd but most I use IDLE or PyCharm.
Mostly I use Windows 10 but it is also occurring in Mint Linux at a Virtual Machine.
In the README, there's interest in expanding the number of datasets. I'm wondering what kind of criteria that new data would have to meet. Just of the top of my head:
The starter datasets came from R's samples, so their html documentation includes R examples on how to use the data. Would it be considered worthwhile to translate the usage information to Python3?
Hi,
I'm unable to load datasets in Python 3.5.1, Win 7. I can install pydataset, import it, and view available datasets just fine. However, when I try to load datasets, I get an error saying that I have the wrong name for the dataset. For example:
In [1]: iris= data('iris')
Traceback (most recent call last):
File "<ipython-input-3-f894fb655dca>", line 1, in <module>
cake = data("cake", show_doc=True)
File "C:\Users\ctaylor\AppData\Local\Continuum\Anaconda3\lib\site-packages\pydataset\__init__.py", line 36, in data
raise Exception('Wrong dataset name! Try: data() to see available.')
Exception: Wrong dataset name! Try: data() to see available.
Pydataset sets display options like display.max_rows = 170
without restoring after whatever it does. These should be set in my opinion in an option_context
context handler. A module should not modify the user's environment permanently (until restart of the interactive interpreter).
This fantastic idea, kudos.
With the growing number of dataset your tool will support, you will quickly run out of names. And searching about a particular dataset will be hard.
I'd recommand:
Hello,
A lot of datasets are also available at https://github.com/datasets
They are called DataPackage.
They are available using Python and
https://github.com/datapackages/datapackage-py (Work In Progress) or https://github.com/trickvi/datapackage
Pinging @vitorbaptista @pwalsh @trickvi @rgrp
There is some overlap between these projects so maybe merging might be considered.
At least we should all be aware of existence of others projects.
Kind regards
PS : Datasets are also available at https://github.com/vincentarelbundock/Rdatasets
[x] Bug (Typo)
smiliarity
.similarity
.Semi-automated issue generated by
https://github.com/timgates42/meticulous/blob/master/docs/NOTE.md
To avoid wasting CI processing resources a branch with the fix has been
prepared but a pull request has not yet been created. A pull request fixing
the issue can be prepared from the link below, feel free to create it or
request @timgates42 create the PR.
https://github.com/timgates42/PyDataset/pull/new/bugfix_typo_similarity
Thanks.
I always thought having the name "data" as a function globally is one of the weirdest things in R.
Perhaps consider changing it into load_data
(comparable to load_xxxx
in sklearn).
Then people can use data
(however vague that term is anyway) in their scripts freely.
Hi,
It would be nice to have a 3rd column for data() output indicating whether the dataset can be used for regression or classification problems.
error: Microsoft Visual C++ 10.0 is required. Get it with "Microsoft Windows SDK 7.1": www.microsoft.com/download/details.aspx?id=8279
PyDataset is a fantastic tool to learn Python. But requiring pandas (and hence numpy) is a big barrier of entry. What's more you may want to be able to load the data using another tool to process it.
To make your lib more flexible and more newcomer friendly, I'd advice:
This will allow:
When initiating the datasets repo, all files have permissions 0755 (-rwxr-xr-x
) when in fact they are not executable. Please make the initialization install datasets as 0644 (-rw-r--r--
).
Just wanted to point you to some similar functionality we have in statsmodels that just pulls from the Rdatasets repo.
https://github.com/statsmodels/statsmodels/blob/master/statsmodels/datasets/utils.py#L246
The documentation shown from 'housing' dataset don't match actual rows and columns imported
How to reproduce:
>>> from pydataset import data`
>>> df = data('housing')`
>>> df
id y time sec
1 1 1.0 0 1
2 1 2.0 6 1
3 1 2.0 12 1
4 1 2.0 24 1
5 2 1.0 0 1
... ... ... ... ...
1444 361 NaN 24 0
1445 362 1.0 0 0
1446 362 1.0 6 0
1447 362 1.0 12 0
1448 362 1.0 24 0
[1448 rows x 4 columns]
>>> data('housing', show_doc='True')
housing
PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)
The housing
data frame has 72 rows and 5 variables.
housing
Sat
Satisfaction of householders with their present housing circumstances, (High,
Medium or Low, ordered factor).
Infl
Perceived degree of influence householders have on the management of the
property (High, Medium, Low).
Type
Type of rental accommodation, (Tower, Atrium, Apartment, Terrace).
Cont
Contact residents are afforded with other residents, (Low, High).
Freq
Frequencies: the numbers of residents in each class.
Madsen, M. (1976) Statistical analysis of multiple contingency tables. Two
examples. Scand. J. Statist. 3, 97โ106.
Cox, D. R. and Snell, E. J. (1984) Applied Statistics, Principles and
Examples. Chapman & Hall.
Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S.
Fourth edition. Springer.
options(contrasts = c("contr.treatment", "contr.poly"))
# Surrogate Poisson models
house.glm0 <- glm(Freq ~ Infl*Type*Cont + Sat, family = poisson,
data = housing)
summary(house.glm0, cor = FALSE)
addterm(house.glm0, ~. + Sat:(Infl+Type+Cont), test = "Chisq")
house.glm1 <- update(house.glm0, . ~ . + Sat*(Infl+Type+Cont))
summary(house.glm1, cor = FALSE)
1 - pchisq(deviance(house.glm1), house.glm1$df.residual)
dropterm(house.glm1, test = "Chisq")
addterm(house.glm1, ~. + Sat:(Infl+Type+Cont)^2, test = "Chisq")
hnames <- lapply(housing[, -5], levels) # omit Freq
newData <- expand.grid(hnames)
newData$Sat <- ordered(newData$Sat)
house.pm <- predict(house.glm1, newData,
type = "response") # poisson means
house.pm <- matrix(house.pm, ncol = 3, byrow = TRUE,
dimnames = list(NULL, hnames[[1]]))
house.pr <- house.pm/drop(house.pm %*% rep(1, 3))
cbind(expand.grid(hnames[-1]), round(house.pr, 2))
# Iterative proportional scaling
loglm(Freq ~ Infl*Type*Cont + Sat*(Infl+Type+Cont), data = housing)
# multinomial model
library(nnet)
(house.mult<- multinom(Sat ~ Infl + Type + Cont, weights = Freq,
data = housing))
house.mult2 <- multinom(Sat ~ Infl*Type*Cont, weights = Freq,
data = housing)
anova(house.mult, house.mult2)
house.pm <- predict(house.mult, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pm, 2))
# proportional odds model
house.cpr <- apply(house.pr, 1, cumsum)
logit <- function(x) log(x/(1-x))
house.ld <- logit(house.cpr[2, ]) - logit(house.cpr[1, ])
(ratio <- sort(drop(house.ld)))
mean(ratio)
(house.plr <- polr(Sat ~ Infl + Type + Cont,
data = housing, weights = Freq))
house.pr1 <- predict(house.plr, expand.grid(hnames[-1]), type = "probs")
cbind(expand.grid(hnames[-1]), round(house.pr1, 2))
Fr <- matrix(housing$Freq, ncol = 3, byrow = TRUE)
2*sum(Fr*log(house.pr/house.pr1))
house.plr2 <- stepAIC(house.plr, ~.^2)
house.plr2$anova
I can't find what the actual dataset imported means. I suggest adjusting the documentation to describe the correct one.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.