rasbt / python-machine-learning-book Goto Github PK
View Code? Open in Web Editor NEWThe "Python Machine Learning (1st edition)" book code repository and info resource
License: MIT License
The "Python Machine Learning (1st edition)" book code repository and info resource
License: MIT License
When i try to read back the classifier on page 254 i get the following error. I have done like in the book the whole way and things have worked find until now. Any idea what has gone wrong?
Im using ipython 4.2.0
AttributeError Traceback (most recent call last)
<ipython-input-4-f050da95a5cf> in <module>()
----> 1 import codecs, os;__pyfile = codecs.open('''/var/folders/yh/mm1bdmx9073_b15lw69b2qmh0000gn/T/py71220g7y''', encoding='''utf-8''');__code = __pyfile.read().encode('''utf-8''');__pyfile.close();os.remove('''/var/folders/yh/mm1bdmx9073_b15lw69b2qmh0000gn/T/py71220g7y''');exec(compile(__code, '''/Users/henke/Documents/code/python/python-ml/movieclassifier/main.py''', 'exec'));
/Users/henke/Documents/code/python/python-ml/movieclassifier/main.py in <module>()
4 from vectorizer import vect
5
----> 6 clf = pickle.load(open(os.path.join('pkl_objects', 'classifier.pkl'), 'rb'))
7
8 import numpy as np
AttributeError: Can't get attribute 'tokenizer' on <module '__main__'>
In chapter6, the Breast Cancer Wisconsin dataset is not available now.
Maybe it is broken link.
currently
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
should be
df = pd.read_csv('http://mlr.cs.umass.edu/ml/machine-learning-databases/breast-cancer-wisconsin/wdbc.data', header=None)
I'm sorry if I'm wrong.
Hi,
The link to the live example application ( http://raschkas.pythonanywhere.com/ ) is not working.
There's a "Coming soon" message, as if the page did not exist.
trying to run gs_lr_tfidf.fit(X_train, y_train) and got the AttributeError
Running on jupyter notebook, python 3.5
https://github.com/stevekwon211/Hello-Kaggle
It is a Kaggle Guide Document for someone who is new at Kaggle!
Opening the first chapter file ch01.ipynb result in the following error:
"Unreadable Notebook: /home/antonio/libro-machine-learning/ch01.ipynb NotJSONError("Notebook does not appear to be JSON: '\n\n\n\n\n\n\n<html lang...")"
python version: 3.7 from anaconda distribution.
In this file, the code loads the names as
labels_path = os.path.join(path,
'%s-labels-idx1-ubyte' % kind)
images_path = os.path.join(path,
'%s-images-idx3-ubyte' % kind)
However the linked .gz file has the names with a period, not a hyphen. It should be
labels_path = os.path.join(path,
'%s-labels.idx1-ubyte' % kind)
images_path = os.path.join(path,
'%s-images.idx3-ubyte' % kind)
Hi
I am trying to run the following code from the book in jupyter notebook with everything updated. However, every time python crushes and the kernel restarts. Everything is fine before this point. Any thought?
ps. using 32bit and gpu, tried dmatrix, no luck
chapter 13
import numpy as np
x = T.dmatrix(name='x')
x_sum = T.sum(x, axis=0)
calc_sum = theano.function(inputs=[x],outputs=x_sum)
ary = [[1,2,3],[1,2,3]]
print('column sum:',calc_sum(ary))
Just bought this book and I can't find the source code of the examples in the book. I bought on Amazon and gone to Packtpub page as suggested on the book but even the zip I download from them is only a mirror to this repository. Just images, no code for the examples in the book. It's really annoying to have to type every single example by hand.
In chapter 2 you have some code for a simple perceptron model.
On page 27, you describe the code.
the
net_input
method simply calculates the vector product wTx
However, there is more than a simple vector product in the code:
def net_input(self, X):
"""Calculate net input"""
return np.dot(X, self.w_[1:]) + self.w_[0]
In addition to the dot product, there is an addition. The text does not mention anything about what is this + self.w_[0]
Can you (or anyone) explain why that's there?
thanks,
-trevor
Hi, I am extremely new to Python and I understand how to write basic commands and stuff.
I got the Code files for the book but I am not able to understand how to use them for learning.
All of them seem to be in Text format.
How can I use them as code to make a new file in which I can just have the Code instead of all the text.
I just wanted to see how the code runs but I am unable to understand what this code is and how to extract things which I want instead of having to remove all the "" and /n and other formatting elements.
Thanks.
Hi,
I had an issue when installing Keras on Windows 10 64bit machine (as described in ch13) but this did not work. I have posted the solution in a step by step here in this blog post:
install keras on windows 10 x64 bit machine
@rasbt : feel free to add it to the notes of the labs in GitHub.
Thanks.
Wonderful book; learning a ton! Question: In first chapter, you explain the three types of learning (supervised, unsupervised and reinforcement). Usually the third is not listed. So, I search your text for other material on RL but found none. Future chapter in next edition? Future book? In your other resources, are there links about RL with a scikit-learn style? Love Karpathy's blog on "Pong from Pixels".
Hello! Thank you for this amazing gift to everyone!
My issue is with Chapter 9's movie_classifier_with_update via python app.py.
I am able to enter my sample review and get predicted class label and probability. The issue arises when I click "Correct"/"Incorrect" for the classification.
It is almost assuredly due to the issue of versions of Python (3.5 needed) and Sklearn (0.19 needed) as indicated here: https://www.pythonanywhere.com/forums/topic/11716/
It'd be nice to keep this current though and I will send a PR if I ever figure out how to update it for Python 3.6 and Sklearn 0.20!
I wanted to run your code that compares TensorFlow with SKLearn but it no longer works.
https://github.com/rasbt/python-machine-learning-book/blob/master/faq/tensorflow-vs-scikitlearn.md
In addition, your mlxtend package no longer has tf_classifier and consequently no TfSoftMaxRegression.
Would you have an updated resource by any chance?
In
General Questions section of FAQ
How do Data Scientists perform model selection? Is it different from Kaggle?
The web link is broken.
Thank you for the beautiful book.
Via the sample size n of the bootstrap sample, we control the bias-variance tradeoff of the random forest. By choosing a larger value for n, we decrease the randomness and thus the forest is more likely to overfit. On the other hand, we can reduce the degree of overfitting by choosing smaller values for n at the expense of the model performance.
To me this implies that I should choose sample size n, that is smaller than N (original training set size).
In most implementations, including the RandomForestClassifier implementation in scikit-learn, the sample size of the bootstrap sample is chosen to be equal to the number of samples in the original training set, which usually provides a good bias-variance tradeoff
But the above got me confused: If we choose n = N, then aren't we overfitting unless the algorithm is bootstrapping aggressively - repeating the values many times over?
im new to the site, if you can or cant if you could send me in the right direction it would greatly appreciated
Hi,
I was trying out one of the example in Chapter2
, under title: Implementing an adaptive linear neuron in Python (Link to notebook).
The problem is when I plot decision boundaries, whole area is shown red.
When I change output = self.activation(X)
to output = self.predict(X)
inside fit
function, the problem seems to be gone.
Is there an issue with the code or the code is correct and I made some other mistake while implementing?
Thanks
Sohaib
Just a note in case it's helpful to anyone else - I seemed to be getting 100% accuracy with the on-line sentiment analysis classifier (pages 246-246), but it turned out to be because the code used to shuffle the dataset before exporting it to CSV on page 235 hadn't worked.
In the version of pandas I'm using (0.23.4), it looks like df.index.values
is needed in order to get the indexes of a DataFrame as a list. So, this:
df = df.reindex(np.random.permutation(df.index))
now needs to be this:
df = df.reindex(np.random.permutation(df.index.values))
Hope that helps someone!
Regarding your remark:
[2015-10-20] Good news! I just heard back from the publisher; all the typos and errors which are listed below will be fixed by next week.
I bought the ebook yesterday (O'Reilly, not PACKT) and found some errors. Up to now, they are in the errata (v2), but not yet fixed in my fresh copy. Can you say something about the current state, are the updates for immediate PACKT customers only?
edit:
Interestingly, my copy stands the test on page viii (so I have Classifiers there), but for example not the one regarding the inverted 'y' variants (with and w/o caret) on page 22 and also the following errors (p. 23) are still present.
I am working on a finite element code in python. It is originally for the diffusion equation but I want to modify it for the wave equation and include a ricker source term. I tried adding the source term and it produces an error. Below is the code and error
from IPython import display
from matplotlib.tri import Triangulation, LinearTriInterpolator
deltat = 0.001
numIterations = 30
mass = numpy.zeros((NPOINTS,NPOINTS))
stiffness = numpy.zeros((NPOINTS,NPOINTS))
phi = numpy.zeros((NPOINTS,))
phi_old = numpy.zeros((NPOINTS,))
f0= 5 # Center frequency Ricker-wavelet
q0= 100 # Maximum amplitude Ricker-Wavelet
t=np.arange(0,numIterations,deltat) # Time vector
tau=np.pif0(t-1.5/f0)
q=q0*(1.0-2.0*tau2.0)*np.exp(-tau2)
xi = np.linspace(0, L, 200)
yi = np.linspace(0, H, 200)
Xi, Yi = np.meshgrid(xi, yi)
updateMatrix(mass,stiffness,phi)
mat = mass/deltat + stiffness
triang = Triangulation(points[:,0], points[:,1])
for iteration in range(1,numIterations+1):
phi_old = phi
rhs = numpy.dot(mass/deltat, phi_old)
rhs = rhs + q
phi = numpy.linalg.solve(mat,rhs)
interpolator = LinearTriInterpolator(triang, phi)
zi = interpolator(Xi, Yi)
fig1 = pylab.figure(1)
pylab.imshow(zi)
fig2 = pylab.figure(2)
xanal, yanal = analytical(numIterations*deltat)
pylab.plot(xanal,yanal,"-")
pylab.plot(Xi[100,:],zi[100,:])
fig2.savefig("comparison.png",format="PNG")
ValueError Traceback (most recent call last)
in
29
30 rhs = numpy.dot(mass/deltat, phi_old)
---> 31 rhs = rhs + q
32 phi = numpy.linalg.solve(mat,rhs)
33 interpolator = LinearTriInterpolator(triang, phi)
ValueError: operands could not be broadcast together with shapes (200,) (30000,)
I'm now learning machine learning using the Japanese translation of this book, and when I run this program, I always get stuck on the part using sklearn.svm.
When the program do the part"gs=gs.fit(X_train,y_train)", it always show the past two graphs infinitely. I don't know the reason, so tell me what may be the cause.
My PC's spec:
Window10, python3.6.5, scikit-learn0.19.1
First things first: I absolutely like how you motivate, introduce and implement the relevant concepts in your book.
I think there is a problem with the Rosenblatt perceptron learning description (evaluation) as presented in the Figure on page 30 in the book. The errors that are counted in the variable errors
are the number of updates that are performed in one epoch. However, this number does not represent the number of misclassifications after each epoch. For instance, if you use your standard options but train only for one iteration, there will be two updates ("2 errors" according to your terminology), however, all items will be classified as -1 (Setosas), therefore, there are 50 misclassification and this classifier's error rate is actually 50%.
Hi,
First of all, thanks for your nice book, Python Machine Learning
I began to read it right now and I am wondering one thing about the implementation of AdalineSGD mentioned in book
def fit(self, X, y):
self._initialize_weights(X.shape[1])
self.cost_ = []
for i in range(self.n_iter):
if self.shuffle:
X, y = self._shuffle(X, y)
cost = []
for xi, target in zip(X, y):
cost.append(self._update_weights(xi, target))
avg_cost = sum(cost) / len(y)
self.cost_.append(avg_cost)
return self
def _update_weights(self, xi, target):
"""Apply Adaline learning rule to update the weights"""
output = self.net_input(xi)
error = (target - output)
self.w_[1:] += self.eta * xi.dot(error)
self.w_[0] += self.eta * error
cost = 0.5 * error**2
return cost
I think, the way to update self.w_[1:] in AdalineSGD is in fact the same as implementation of the batch AdalineGD, just implemented in different ways
output = self.activation(X)
errors = (y - output)
self.w_[1:] += self.eta * X.T.dot(errors)
IMO, self.eta * X.T.dot(errors) operates on entire matrix X in AdalineGD, however AdalineSGD operates on row by row via for-loop (for xi, target in zip(X, y)) of the same X. It doesn't reflect the essential diff between AdalineGD and AdalineSGD as you mentioned in book
Hello,
I think the function zero_init_weight is missing.
I searched the github site but did not find it.
Maybe this is another version of the softmax-regressor and here it is missing?
Best Regards, Thomas
In the perceptron part of the code, I see:
for xi, target in zip(X, y):
update = self.eta * (target - self.predict(xi))
self.w_[1:] += update * xi
self.w_[0] += update
In the SGD part I see something similar except that everytime the new gradient point is calculated, the data is shuffled:
X, y = self._shuffle(X, y)
for xi, target in zip(X, y):
cost.append(self._update_weights(xi, target))
def _update_weights(self, xi, target):
"""Apply Adaline learning rule to update the weights"""
output = self.net_input(xi)
error = (target - output)
self.w_[1:] += self.eta * xi.dot(error)
self.w_[0] += self.eta * error
I do not see any difference between the two except for the shuffling part and the part that one is binary value and the other is a real value (SGD). Did I misunderstand how fundamentally the weights are calculated for SGD and simple perceptron model. Ofcourse if there was a mini batch implementation, the code would have looked a lot more like adaptive linear neurons. But since you are taking sample by sample, they are implemented similarly?
In chapter 2, where the iris data is plotted on a scatterplot,
# extract sepal length and petal length
X = df.iloc[0:100, [0, 2]].values
# plot data
plt.scatter(X[:50, 0], X[:50, 1],
color='red', marker='o', label='setosa')
plt.scatter(X[50:100, 0], X[50:100, 1],
color='blue', marker='x', label='versicolor')
it is simply assumed that the first 50 rows belong to the label setosa, and the next 50 to versicolor. The scatterplot should be generated using the labels(which are in the 5th column of the dataset)
Just a heads up on this -- I checked my O'Reilly account, and they did not yet have the updated version.
I'll post here once it appears.
So you have this:
X, y = make_moons(n_samples=100, random_state=123)
alphas, lambdas =rbf_kernel_pca(X, gamma=15, n_components=1)
Then you take a sample from X:
x_new = X[25]
And then find the projection for the new sample from:
x_reproj = project_x(x_new, X,
... gamma=15, alphas=alphas, lambdas=lambdas)
But x_new
was already a part of alphas and lambdas created using X. In other words, X already had x_new when the rbf_kernel_pca was applied. So should I be surprised that the projected value of x_new coincides exactly in the plots? I would have thought it might have been better to exclude x_new to derive alpha and lambda values and then apply project_x
. Thoughts?
Sebastian,
I've been collecting my own data and have applied the plot_decision_regions function several times to my data but I am running into a problem with this new data. The problem is occurring here:
#plot class samples
for idx, cl in enumerate(np.unique(y)):
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
alpha=0.8, c=cmap(idx),
marker=markers[idx], label=cl)
My enumerated object is: [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4)]
So 5 classifications hot encoded.
From what I understand, this list comprehension passes over my X_train_pca data five times and uses the boolean comparison y == cl to plot all my data points with five different colors as it passes through the markers and colormap.
Upon running, I get the warning:
FutureWarning: in the future, boolean array-likes will be handled as a boolean array index
plt.scatter(x=X[y == cl, 0], y=X[y == cl, 1],
The really weird part is the values in the array: X[y==cl, 0]
They now look like: [-0.4277726 -0.4277726 -0.44362509 ..., -0.4277726 -0.4277726 -0.4277726 ]
With shape (9784,) which is the original length of my X_train_pca data. (I believe it should be closer to about a fifth since most of my data is similar in length and I checked np.shape after the loop ran.)
To give a visual my data looks like this.
When it should be separated into colors with a spread looking like this.
I can't really think through the problem anymore probably due to a misunderstanding of what this future warning is trying to tell me. I am wondering if you have any ideas as to what might cause this behavior.
Dear concerns : I am extracting features from wav , using PLP , this ( Pyhton 3.6 -Anaconda Spyder ) after execute i am facing error in this line
File "C:\ProgramData\Anaconda3\lib\site-packages\sidekit\frontend\features.py", line 399, in power_spectrum
ahan = framed[start:stop, :] * window
ValueError: operands could not be broadcast together with shapes (400,2) (400,)
#!usr/bin/python
import numpy.matlib
import scipy
from scipy.fftpack.realtransforms import dct
from sidekit.frontend.vad import pre_emphasis
from sidekit.frontend.io import *
from sidekit.frontend.normfeat import *
from sidekit.frontend.features import *
import scipy.io.wavfile as wav
import numpy as np
def readWavFile(wav):
#given a path from the keyboard to read a .wav file
#wav = raw_input('Give me the path of the .wav file you want to read: ')
inputWav = 'C:/Speech_Processing/2-Speech_Signal_Processing_and_Classification-master/feature_extraction_techniques'+wav
return inputWav
#reading the .wav file (signal file) and extract the information we need
def initialize(inputWav):
rate , signal = wav.read(readWavFile(inputWav)) # returns a wave_read object , rate: sampling frequency
sig = wave.open(readWavFile(inputWav))
# signal is the numpy 2D array with the date of the .wav file
# len(signal) number of samples
sampwidth = sig.getsampwidth()
print ('The sample rate of the audio is: ',rate)
print ('Sampwidth: ',sampwidth)
return signal , rate
def PLP():
folder = input('Give the name of the folder that you want to read data: ')
amount = input('Give the number of samples in the specific folder: ')
for x in range(1,int(amount)+1):
wav = '/'+folder+'/'+str(x)+'.wav'
print (wav)
#inputWav = readWavFile(wav)
signal,rate = initialize(wav)
#returns PLP coefficients for every frame
plp_features = plp(signal,rasta=True)
meanFeatures(plp_features[0])
#compute the mean features for one .wav file (take the features for every frame and make a mean for the sample)
def meanFeatures(plp_features):
#make a numpy array with length the number of plp features
mean_features=np.zeros(len(plp_features[0]))
#for one input take the sum of all frames in a specific feature and divide them with the number of frames
for x in range(len(plp_features)):
for y in range(len(plp_features[x])):
mean_features[y]+=plp_features[x][y]
mean_features = (mean_features / len(plp_features))
print (mean_features)
def main():
PLP()
main()
On page 500 (second edition: September 2017) there is a figure illustrating Full, Same and Valid padding and how the pixel patches map to the feature maps.
The feature map of the valid padding example is only 2x2. It specifies a 5x5 pixel input, 3x3 filter and a stride of 1. The feature map should be of size 3x3.
I understood the concept of complete linkage .. however in the example you provided I did not understand the values in the table with columns 'row label 1', 'row label 2' etc ..
There is an Additional Note (1) section where it says: " If all the weights are initialized to 0, only the scale of the weight vector, not the direction."
Seems there is some missing meaning in that sentence. Was wondering if you could correct it please. Thank you very much!
def tokenizer(text):
return text.split()
from nltk.stem.porter import PorterStemmer
porter = PorterStemmer()
def tokenizer_porter(text):
return [porter.stem(word) for word in text.split()]
from nltk.corpus import stopwords
stop = stopwords.words('english')
X_train = df.loc[:25000,'review'].values
y_train = df.loc[:25000,'sentiment'].values
X_test = df.loc[25000:,'review'].values
y_test = df.loc[25000:,'sentiment'].values
from distutils.version import LooseVersion as Version
from sklearn import __version__ as sklearn_version
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
if Version(sklearn_version) < '0.18':
from sklearn.grid_search import GridSearchCV
else:
from sklearn.model_selection import GridSearchCV
tfidf = TfidfVectorizer(strip_accents=None,
lowercase=False,
preprocessor=None)
param_grid = [{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'vect__tokenizer': [tokenizer, tokenizer_porter],
'clf__penalty': ['l1', 'l2'],
'clf__C': [1.0, 10.0, 100.0]},
{'vect__ngram_range': [(1, 1)],
'vect__stop_words': [stop, None],
'vect__tokenizer': [tokenizer, tokenizer_porter],
'vect__use_idf':[False],
'vect__norm':[None],
'clf__penalty': ['l1', 'l2'],
'clf__C': [1.0, 10.0, 100.0]},
]
lr_tfidf = Pipeline([('vect', tfidf),
('clf', LogisticRegression(random_state=0))])
gs_lr_tfidf = GridSearchCV(lr_tfidf, param_grid,
scoring='accuracy',
cv=5,
verbose=1,
n_jobs=-1)
gs_lr_tfidf.fit(X_train,y_train)
Hi,
I get an error about "can't get attribute tokenizer_porter" ,
what's the problem you think?
When I run this script in python notebook:
https://github.com/rasbt/python-machine-learning-book/blob/master/code/optional-py-scripts/ch02.py
The last line (ada.partial_fit(X_std[0, :], y[0])) gives the error:
<main.AdalineSGD at 0x10a89fac8>
Can the iris.data file be added back into the repo on master?
Here's the last version I believe:
https://github.com/rasbt/python-machine-learning-book/blob/194e34f245abb97f53d0e72166ab6785d01a1e94/code/datasets/iris/iris.data
Thanks again!
dear sir..
i try to study Machine Learning through your book 『Python Machine Learning』, and it's very nick book!
i can't understand how you to set up 『param_grid』
i try to get the information from sklearn but it just say『dict or list of dictionaries』
even the sample , it just write 『param_grid=....』
So ... about the 『param_grid』 how do i set it up!
i am sorry my english is a little week!
i hope i can let you know my question and thank you very much!
I am getting this error at the np.dot for the Iris data set. Can you explain the solution ?
Following is traceback :
Traceback (most recent call last):
File "Perceptron.py", line 61, in
ppn.train(x, y)
File "Perceptron.py", line 24, in train
update = self.eta * (target - self.predict(xi))
File "Perceptron.py", line 35, in predict
return np.where(self.net_input(X) >= 0.0, 1, -1)
File "Perceptron.py", line 32, in net_input
return np.dot(X, self.w_[1:]) + self.w_[0]
Hello,
I was trying to execute the code:
%matplotlib inline
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
from mlxtend.evaluate import plot_decision_regions
iris = load_iris()
y, X = iris.target, iris.data[:, [0, 2]] # only use 2 features
lr = LogisticRegression(C=100.0,
class_weight=None,
dual=False,
fit_intercept=True,
intercept_scaling=1,
max_iter=100,
multi_class='multinomial',
n_jobs=1,
penalty='l2',
random_state=1,
solver='newton-cg',
tol=0.0001,
verbose=0,
warm_start=False)
lr.fit(X, y)
plot_decision_regions(X=X, y=y, clf=lr, legend=2)
plt.xlabel('sepal length')
plt.ylabel('petal length')
plt.show()
but it returned following error:
---------------------------------------------------------------------------
ImportError Traceback (most recent call last)
<ipython-input-2-9b78ac9a656a> in <module>()
3 from sklearn.datasets import load_iris
4 import matplotlib.pyplot as plt
----> 5 from mlxtend.evaluate import plot_decision_regions
6
7 iris = load_iris()
ImportError: cannot import name 'plot_decision_regions'
I installed mlxtend
package. What am I doing wrong? Could You help me? Thanks in advance!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.