chengsoonong / mclass-sky Goto Github PK

View Code? Open in Web Editor NEW

9.0 9.0 4.0 147.42 MB

Multiclass methods for astronomical data

License: BSD 3-Clause "New" or "Revised" License

Python 2.10% TeX 6.97% Shell 0.01% JavaScript 0.01% MATLAB 0.10% Jupyter Notebook 90.83%

mclass-sky's People

Contributors

Stargazers

Watchers

Forkers

alasdairtran yen223 davidjwu kgmacau

mclass-sky's Issues

kernel logistic regression

In the experiments comparing different kernels with SVM, it seems that the polynomial kernel of degree 3 is doing very well.

This means that we should be using kernels in the active learning with logistic regression experiments. However, due to the fact that sklearn is using liblinear, we need to explicitly construct the basis functions for the degree 3 polynomial kernel.

The ‘multi_class’ option should be set to ‘multinomial’, as this will give us better probability estimates than one vs rest.

2 questions:

Do the conclusions from the linear features with regards to active learning still hold?
Do the new colour index features still require a polynomial kernel, or does linear basis functions work?

Use colour index as feature

Consider difference between Mag as features:

Accuracy as a function of confidence bins

For logistic regression, sort data by confidence measure (e.g. max class probability over the classes). Then, group data into bins, and plot accuracy for each bin. (Goal is to understand strange behavior where logistic regression becomes less accurate for high-confidence predictions.)

Implement Tax and Duin

Growing a multi-class classifier with a reject option

http://rduin.nl/papers/prl_08_reject.pdf

Theoretical analysis of convergence

Work out the multiclass equivalents of:
http://arxiv.org/pdf/1103.1790.pdf

Derive a probabilistic approach

Fit a density on each class of objects, and identify new objects as unknown if it is in low density regions for all classes. Can start by assuming Gaussian densities.

Implement a balanced accuracy

Consider adding unknown data to the SDSS dataset

Currently, we only include objects which fall into one of three categories (Stars, Quasars, and Galaxies). We might want to check to see if there are any outliers in this dataset and how the classifiers behave when we add unknown data.

Make a package for pypi

So that it can be installed using pip, and all notebooks should use this package.

See if the features in the SDSS dataset need to be normalised

A good starting point is didbits

Package on PyPI is missing DESCRIPTION.rst

pip install mclearn doesn't work out of the box because the package is missing DESCRIPTION.rst.

Apply open world classifier to SDSS

Use the Sanderson and Scott approach and compare with entropy of an ensemble of classifiers.

Documentation

Readthedocs using mkdocs.

http://ericholscher.com/blog/2014/feb/27/how-i-judge-documentation-quality/

LIBLINEAR warm start

It would be good to get warm start for LIBLINEAR working in sklearn.

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
https://manojbits.wordpress.com/2014/07/28/scikit-learn-logistic-regression-cv-2/
http://www.csie.ntu.edu.tw/~cjlin/papers/ws/

Reddening correction and density of stars in the Milky Way band

Correct for the different distortions caused by dust in different parts of the sky (dec, ra). Dust causes reddening, and we can download the correction vector (with a transformation) from SDSS.

Leave one class out

With QDA and/or logistic regression, try holding out one of the three existing classes. Then, see how well screening by confidence does at classifying the existing classes while detecting "outliers" (i.e. members of the held-out class.)

Justin is on record that this won't work that well with discriminative classifiers like logistic regression, but let's try to prove him wrong.

Experimental parameters should not be hard coded in the package

For example the fraction of data used for training is a parameter that the user (in the jupyter notebook) might wish to change, and he/she shouldn't have to go into the package code to change that parameter.

Learning curve

Plot the balanced accuracy as the number of training points are increased.

Class proportions on SDSS

Given a trained classifier, we can label all examples which are unlabelled in SDSS.

If our classifier is 100% accurate, then we can just use the predictions directly. Since we are not 100% accurate, we can figure out a way to correct for the class proportions.

By looking at the confusion matrix, is the class proportions of the predictions and the labels the same?
What is the predicted class proportions of the whole SDSS database?

This may involve downloading the data in chunks.

Internal links should be relative

One bug I found was that the Getting Started link from the README points to nbviewer, which points to alasdairtran's fork.

Apply active learning strategies to southern ATLAS data

In collaboration with Christian Wolf of skymapper

Contribution of thesis

Last section of introduction

Plot uncertainty map

Using likelihood threshold, or probability threshold, display the heatmap. Interesting coordinates:

ra, dec
two magnitudes
last two principle components.

Learning curves with all data

With classifiers that can do so efficiently, try training with varying numbers of data up to the full dataset of ~2.8 million points. The SVM with a RBF kernel continues to slowly improve at an approximate linear rate up to 100,000 points, which suggests that using more data could push us significantly closer to 100% accuracy.

Non uniform label distribution

The objects that are labelled are not uniformly distributed in the sky.

Use bandits to choose between active learning strategies

Consider the four active learning strategies from Schein and Ungar as four arms of a bandit problem. Some careful thought needs to be put into designing a good reward function.

It would be convenient if the rewards had structured that would allow kl-UCB or KL-UCB to be applied.

KULLBACK–LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION
BY OLIVIER CAPPÉ, AURÉLIEN GARIVIER, ODALRIC-AMBRYM MAILLARD1, RÉMI MUNOS1 AND GILLES STOLTZ1

Performance measures

There are two main measures to consider:

Plotting the accuracy rate as a function of ra and dec. But the only class that changes over the coordinates are stars (due to the Milky Way band). Thus if we see drift over the coordinates, that means our calibration is not good, or there are some other factors that cause the classes to change with coordinates which we haven't taken into account. In any case, this plot is still a good sanity check.
Plotting the entries in the confusion matrix as a function of ra, dec, and the features. In particular, we actually know that between quasars and stars, there is a confusion region in the feature space.

Theoretical estimate of fisher information

Instead of empirically evaluating the Fisher information, one can use the elemental matrices which exist for certain distributions. May be useful for logistic regression.

Elemental information matrices and optimal experimental design for generalized regression models, 2014
Anthony C. Atkinson, Valerii V. Fedorov, Agnes M. Herzberg, Rongmei Zhang

Sanity check with UCI data

For testing purposes, it is good to have some easy UCI datasets to check that everything is working. The small datasets also mean that the computations should not take too long.

Play around with multinomial logit

In particular, see if it can predict an unknown class, e.g., maybe when the predicted probabilities for all three classes are equally low.

Host the datasets at mldata.org

Currently the datasets are hosted in my DropBox account. Consider moving them to mldata.

Prettify the SDSS dataset

Rename the column labels to something more meaningful.

Make the first column the class column.

Understanding photometric features

Construct and compare features for each class of galaxy, stars, quasars

Double loop validation to choose parameters

To choose good regularisation and kernel parameters, it is best to use a double loop validation method.

https://github.com/chengsoonong/didbits/blob/master/CrossVal/expt_protocol.ipynb

Create small SDSS test set

For checking that software package is working.

Use error estimates of Mag as features

This can be done with a robust optimisation approach.

See chapter 12 of:
http://www2.isye.gatech.edu/~nemirovs/FullBookDec11.pdf

Compare SVM with RBF and polynomial kernel

Plot log scale learning curves for:

RBF
polynomial kernel of degree 2 and 3 (homogeneous and inhomogenous)

An Empirical Evaluation of Thompson Sampling,
Olivier Chapelle, Lihong Li
Thompson Sampling for Contextual Bandits with Linear Payoffs,
Shipra Agrawal, Navin Goyal
Contextual Bandit for Active Learning: Active Thompson Sampling,
Djallel Bouneffouf, Romain Laroche, Tanguy Urvoy, Raphael Féraud, Robin Allesiardo