ml-course / master Goto Github PK

A machine learning course using Python, Jupyter Notebooks, and OpenML

Jupyter Notebook 83.83% Shell 0.01% Python 0.05% HTML 15.96% CSS 0.01% SCSS 0.04% JavaScript 0.13%

master's Introduction

An Open Machine Learning Course

Jupyter notebooks for teaching machine learning. Based on scikit-learn and Keras, with OpenML used to experiment more extensively on many datasets.

Online course book - powered by Jupyter-book

Sources

Practice-oriented materials

We use many code examples from the following excellent books. We urge you to read them for a more complete coverage of machine learning in Python:

Introduction to Machine Learning with Python by Andreas Mueller and Sarah Guido. Focussing entirely on scikit-learn, and written by one of its core developers, this book offers clear guidance on how to do machine learning with Python.

Deep Learning with Python by François Chollet. Written by the author of the Keras library, this book offers a clear explanation of deep learning with practical examples.

Python machine learning by Sebastian Raschka. One of the classic textbooks on how to do machine learning with Python.

Python for Data Analysis by Wes McKinney. A more introductory and broader text on doing data science with Python.

Theory-oriented materials

For a deeper understanding of machine learning techniques, we can recommend the following books:

"Mathematics for Machine Learning" by Marc Deisenroth, A. Aldo Faisal and Cheng Soon Ong. This provides the basics of linear algebra, geometry, probabilities, and continuous optimization, and how they are used in several machine learning algorithms. The PDF is available for free.

"The Elements of Statistical Learning: Data Mining, Inference, and Prediction. (2nd edition)" by Trevor Hastie, Robert Tibshirani, Jerome Friedman. One of the key references of the field. Great coverage of linear models, regularization, kernel methods, model evaluation, ensembles, neural nets, unsupervised learning. The PDF is available for free.

"Deep Learning" by Ian Goodfellow, Yoshua Bengio, Aaron Courville. The current reference for deep learning. Chapters can be downloaded from the website.

"An Introduction to Statistical Learning (with Applications in R)" by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. More introductory version of the above book, with many code examples in R. The PDF is also available for free. (Note that we won't be using R in the main course materials, but the examples are still very useful).

"Gaussian Processes for Machine Learning" by Carl Edward Rasmussen and Christopher K. I. Williams. The reference for Bayesian Inference. Also see David MacKay's book for additional insights. Also see this course by Neil Lawrence for a great introduction to Gaussian Processes, all from first principles.

Open course

Made with love by Joaquin Vanschoren. Materials are released under the CC0 License. You can use them as you like.

Partly based on notebooks by Andreas Mueller (CC0 licenced), François Chollet (MIT licenced), Sebastian Raschka (MIT licenced), and Neil Lawrence (with permission)

master's People

Contributors

Stargazers

Watchers

Forkers

xxtxx s155759 sue2727 joostvisser lirongz jethefer tt956 tibersc e-dorigatti wyxf007 mcpeer zhangmeishan paojianghu mrvege stevenlol hitum-dev allensmile sunjieee linlinchn joebradly lqleeqee hugovonken edsmits metesa jetemma coursebib scarletmclearn statml 3shmawei le0000000000n lazycrazyowl clustersdata duthedd bas3004 mjimcua thedutchdevil xia-hq nikaschutten mschalken ppwfx mesoptier aleman-z timscholtens karstenschreur woutderuiter jorisvanderheijden ylja manavparashar celinesenden jeanpierre92 bensuucar nmehralizadeh nvanson debasish2256 ignaciogarsami dalcimar wangbin83-gmail-com aashgar afcarl wesszabo ladisasimonejads jeroen365 erikpost6 lennard1010 raisauku coconuthack christoskaparakis xuzetan sjoerdpieksma chaupmcs apsolutely maxnijholt j-wilhelm nickvanzanten anrsaxen nboro kckooijman casreugebrink annaturupi tungpt247 delyanastoyanova khanhhuyen79972 priya-shiva kennos rrodrihe sahithyaravi pieterderks90 mtoqeerpk bhargavkartik ji-zhang annevankroonenburg liuweiping2020 axelhurkmans bented paulamidas massihonda paulorijnberg psychotechnopath thomasmeulen dayanach

master's Issues

RETATÓRIO_TOPICO-5-APRENDIZADO EM CONJUNTO/intro.html

EDIMAR FERNANDES_201933940004_RELATÓRIO - TOPICO-5_APRENDIZAGEM EM CONJUNTO.pdf

Issue on page /labs/Lab 2b - Model Selection Solution.html

I think there's a confusing typo in "Exercise 4: Threshold calibration": The curve is assigned a label="ROC curve" but I think that should be label="Precision-recall curve".

Missing data.csv in 00 - Tutorial 2a - Python for Data Analysis

dfs = pd.read_csv('data.csv')
dfs
dfs.set_value(0, 'a', 10)
dfs.to_csv('data.csv', index=False) # Don't export the row index`

if this is run you get following error:

OSError: File b'data.csv' does not exist

I cannot find the data.csv anywhere in the repo either.

2 Linear Models

Some questions and suggestions came to mind when I read about the gradient descent method:

In section Gradient Descent, I find the formulation of the exponential decay of the learning rate a little bit odd. I would suggest expressing \eta_s in terms of \eta_0 instead.
In section Stochastic Gradient Descent (SGD), I believe it would be better to start the index at i=0, otherwise it would make more sense to divide by n+1 when averaging the individual losses. Same goes for the other two sums in that part.
Furthermore, the "incremental gradient" method does look a lot like the SAG method described here instead of the incremental aggregated gradient (IAG) method from this paper, which I found confusing. I also found the SAGA Algorithm. Maybe adding some of these references will be helpful to other students.
Another suggestion would be to change "random i" to "if i = i_s" and adding "with i_s randomly chosen per iteration".

File naming

The names of neural networks slides contain |. This seems allowed on macOS, but it is an invalid character in Windows and Linux file names. It prevents Windows and Linux users cloning or updating this repository.

Pulling from upstreams...
remote: Counting objects: 177, done.
remote: Compressing objects: 100% (10/10), done.
remote: Total 177 (delta 4), reused 1 (delta 1), pack-reused 166
Receiving objects: 100% (177/177), 11.75 MiB | 582.00 KiB/s, done.
Resolving deltas: 100% (10/10), completed with 1 local object.
From https://github.com/joaquinvanschoren/ML-course
 * branch            master     -> FETCH_HEAD
   66e8297..3162f05  master     -> upstream/master
error: unable to create file N1 | Introduction.ipynb: Invalid argument
error: unable to create file N2 | Artificial Neuron.ipynb: Invalid argument
error: unable to create file N3 | Perceptron Classifier.ipynb: Invalid argument
error: unable to create file N4 | MLP.ipynb: Invalid argument
error: unable to create file N5 | MLP Classifier.ipynb: Invalid argument
error: unable to create file N6 | Optimization and Regularization.ipynb: Invalid argument
error: unable to create file N7 | Convolutional Networks.ipynb: Invalid argument
Updating 66e8297..3162f05
error: unable to create file N8 | Recurrent Networks.ipynb: Invalid argument

Display problem in "00 - Prerequisites.pdf"

Some packages are cut off out of the PDF.

Example:

Try to create a visualization comparing GD, SGD, minibatch SGD, SAGA

Issue on page /notebooks/01 - Introduction.html

In section "Neural networks: evaluation and optimization", after "E.g. Gradient descent:", the formula is missing a partial derivative symbol and, more importantly, I guess there should be a minus in order to move towards the minimum.

Error downloading dataset

Running this code the process gets interrupted.

import openml as oml
import numpy as np
import matplotlib.pyplot as plt
import sklearn

# Download Streetview data. Takes a while the first time.
SVHN = oml.datasets.get_dataset(41081)
X, y, cats, attrs = SVHN.get_data(dataset_format='array',
    target=SVHN.default_target_attribute)

Process finished with exit code 137 (interrupted by signal 9: SIGKILL)
I have 12gig of memory free, can't figure out why I am getting this interrupt.

Issue on page /notebooks/04 - Model Selection.html

Some LaTeX functions are not being rendered correctly on the website. This seems to primarily affect the formula for accuracy, precision, etc.

Master

Add NuSVM to kernelization lecture

Lab 6 solutions removed in commit of 28-03

Some lab solutions not available

Hi, the solutions to lab 2a and 7a are not available yet.
They are not listed on the home page /intro.html.

Could you put these online?

Error in tutorial 4

In Tutorial 4 there is an error

print("SVM component: {}".format(pipe.named_steps['svm']f))

The f should not be there

Error on loading devnagari dataset for assignment 1

I'm getting this error when I try to run the code which should load the devnagari dataset:

This error is pretty generic, so searching for for a fix is hard. Does anybody know the cause of this error?

Gini index and information gain

I could not immediately find the idea behind the Gini impurity index on the internet. The following derivation helped me understand the intuition a little bit better:

The idea is that this captures "how often a randomly selected element is labeled incorrectly if the label is chosen randomly according to the actual distribution (in a leaf)".
The definition of information gain, it is unclear to me what X_i is exactly. I would have expected Gain(X, i) and |X| in the denominator of the fraction. Would that make sense? Furthermore, am I correct that this l=1 to L sum loops over what some call the levels of this feature?

openML has dependency on Visual Studio Build Tools

When I was installing openML on windows, I had several issues. As it turns out, one of the packages included, netifaces, requires Visual C++ Build Tools to be installed. It would be helpful for future students with this issue if you include this information somewhere. Even if it is a very common program to have installed, I personally had recently emptied my hard disk so I did not have it installed. pip does not clearly state this package is required.

Not all Gabor kernels are plotted

In FDM_Challenge_Part3 it is confusing that the 'Gabor family' and 'Responses' both plot 10 * 10 = 100 images. This gives the impression that len(kernel) == 100. In reality len(kernel) == 225.

add example of sparse_categorical_crossentropy

add explanation of cross-entropy

Dollar signs in formulas

You're probably already aware, but in a lot of places in the html view of the pages there are dollar signs around the formulas.

For example:

It seems to happen when double dollar signs ($$) are used in the notebooks to typeset LaTeX over a full line. For some reason the html view then sometimes (not always) treats it as inline LaTeX instead and treats the formula as if it was only surrounded with single dollar signs. Since the result is correct in the notebooks and PDFs I'm not sure if there is an easy way to resolve this. Luckily it also doesn't affect readability too much so this isn't a very high priority issue.

Finally, the issue seems to happen in both Firefox and Chrome.

Lab solutions imports are wrong

In the lab solutions, the preamble is still imported, which causes an error (and it's overkill).
Update the imports with only the required packages, as in the orginal lab notebooks.

Labs 3 has not been opened yet on the Github

I usually sync my fork when the solutions to the labs are published but the labs 3 do not seem to be released yet.

They are available via the website.