chengsoonong / mclass-sky Goto Github PK
View Code? Open in Web Editor NEWMulticlass methods for astronomical data
License: BSD 3-Clause "New" or "Revised" License
Multiclass methods for astronomical data
License: BSD 3-Clause "New" or "Revised" License
In the experiments comparing different kernels with SVM, it seems that the polynomial kernel of degree 3 is doing very well.
This means that we should be using kernels in the active learning with logistic regression experiments. However, due to the fact that sklearn is using liblinear, we need to explicitly construct the basis functions for the degree 3 polynomial kernel.
The ‘multi_class’ option should be set to ‘multinomial’, as this will give us better probability estimates than one vs rest.
2 questions:
Consider difference between Mag as features:
For logistic regression, sort data by confidence measure (e.g. max class probability over the classes). Then, group data into bins, and plot accuracy for each bin. (Goal is to understand strange behavior where logistic regression becomes less accurate for high-confidence predictions.)
Growing a multi-class classifier with a reject option
Work out the multiclass equivalents of:
http://arxiv.org/pdf/1103.1790.pdf
Fit a density on each class of objects, and identify new objects as unknown if it is in low density regions for all classes. Can start by assuming Gaussian densities.
Currently, we only include objects which fall into one of three categories (Stars, Quasars, and Galaxies). We might want to check to see if there are any outliers in this dataset and how the classifiers behave when we add unknown data.
So that it can be installed using pip, and all notebooks should use this package.
A good starting point is didbits
pip install mclearn
doesn't work out of the box because the package is missing DESCRIPTION.rst.
Use the Sanderson and Scott approach and compare with entropy of an ensemble of classifiers.
Readthedocs using mkdocs.
http://ericholscher.com/blog/2014/feb/27/how-i-judge-documentation-quality/
It would be good to get warm start for LIBLINEAR working in sklearn.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
https://manojbits.wordpress.com/2014/07/28/scikit-learn-logistic-regression-cv-2/
http://www.csie.ntu.edu.tw/~cjlin/papers/ws/
Correct for the different distortions caused by dust in different parts of the sky (dec, ra). Dust causes reddening, and we can download the correction vector (with a transformation) from SDSS.
With QDA and/or logistic regression, try holding out one of the three existing classes. Then, see how well screening by confidence does at classifying the existing classes while detecting "outliers" (i.e. members of the held-out class.)
Justin is on record that this won't work that well with discriminative classifiers like logistic regression, but let's try to prove him wrong.
For example the fraction of data used for training is a parameter that the user (in the jupyter notebook) might wish to change, and he/she shouldn't have to go into the package code to change that parameter.
Plot the balanced accuracy as the number of training points are increased.
Given a trained classifier, we can label all examples which are unlabelled in SDSS.
If our classifier is 100% accurate, then we can just use the predictions directly. Since we are not 100% accurate, we can figure out a way to correct for the class proportions.
This may involve downloading the data in chunks.
One bug I found was that the Getting Started link from the README points to nbviewer, which points to alasdairtran's fork.
In collaboration with Christian Wolf of skymapper
Last section of introduction
Using likelihood threshold, or probability threshold, display the heatmap. Interesting coordinates:
With classifiers that can do so efficiently, try training with varying numbers of data up to the full dataset of ~2.8 million points. The SVM with a RBF kernel continues to slowly improve at an approximate linear rate up to 100,000 points, which suggests that using more data could push us significantly closer to 100% accuracy.
The objects that are labelled are not uniformly distributed in the sky.
Consider the four active learning strategies from Schein and Ungar as four arms of a bandit problem. Some careful thought needs to be put into designing a good reward function.
It would be convenient if the rewards had structured that would allow kl-UCB or KL-UCB to be applied.
KULLBACK–LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION
BY OLIVIER CAPPÉ, AURÉLIEN GARIVIER, ODALRIC-AMBRYM MAILLARD1, RÉMI MUNOS1 AND GILLES STOLTZ1
There are two main measures to consider:
Instead of empirically evaluating the Fisher information, one can use the elemental matrices which exist for certain distributions. May be useful for logistic regression.
Elemental information matrices and optimal experimental design for generalized regression models, 2014
Anthony C. Atkinson, Valerii V. Fedorov, Agnes M. Herzberg, Rongmei Zhang
In particular, see if it can predict an unknown class, e.g., maybe when the predicted probabilities for all three classes are equally low.
Currently the datasets are hosted in my DropBox account. Consider moving them to mldata.
Rename the column labels to something more meaningful.
Make the first column the class column.
Construct and compare features for each class of galaxy, stars, quasars
To choose good regularisation and kernel parameters, it is best to use a double loop validation method.
https://github.com/chengsoonong/didbits/blob/master/CrossVal/expt_protocol.ipynb
For checking that software package is working.
This can be done with a robust optimisation approach.
See chapter 12 of:
http://www2.isye.gatech.edu/~nemirovs/FullBookDec11.pdf
Plot log scale learning curves for:
http://www.inf-cv.uni-jena.de/Forschung/paperProjects/Kernel+Null+Space+Met
hods+for+Novelty+Detection.html
See survey:
From Theories to Queries: Active Learning in Practice
Burr Settles
Have a look at mydataflow.pdf in 0ddd874
There are two sdss.h5 and vstatlas.h5 which are written and read into separate directories.
From the docs, it looks like only 5 classifiers inherently support multiclass in sklearn: Naive Bayes, LDA, Decision Trees, Random Forests and Nearest Neighbours.
Support vector machine uses the one-vs-one strategy to do multiclass classification. Other linear models, including logistic regression, use one-vs-rest in sklearn.
There are several algorithms proposed in:
T. Sanderson and C. Scott
Class Proportion Estimation with Application to Multiclass Anomaly Rejection
AISTATS 2014
which relate to classification when there is an unknown class.
We often want to elicit expert knowledge and one way to do it is for them to select specific examples, or a subset of the feature space, and then allow them to annotate these examples.
http://bokeh.pydata.org/en/latest/docs/server_gallery/stocks_server.html
Start with background on Thompson sampling.
In particular, describe:
In many of the notebooks, there is the statement:
from mclearn import *
This is generally considered bad practice. Lots of discussion on stackoverflow.
Look at sklearn cross validation.
The current data on the nicta filestore does not have extinction values.
between Section 1.2 and 1.3
Only interact with data through the API exposed by this class.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.