Git Product home page Git Product logo

mclass-sky's People

Contributors

alasdairtran avatar chengsoonong avatar davidjwu avatar nbgl avatar yen223 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mclass-sky's Issues

kernel logistic regression

In the experiments comparing different kernels with SVM, it seems that the polynomial kernel of degree 3 is doing very well.

This means that we should be using kernels in the active learning with logistic regression experiments. However, due to the fact that sklearn is using liblinear, we need to explicitly construct the basis functions for the degree 3 polynomial kernel.

The ‘multi_class’ option should be set to ‘multinomial’, as this will give us better probability estimates than one vs rest.

2 questions:

  • Do the conclusions from the linear features with regards to active learning still hold?
  • Do the new colour index features still require a polynomial kernel, or does linear basis functions work?

Accuracy as a function of confidence bins

For logistic regression, sort data by confidence measure (e.g. max class probability over the classes). Then, group data into bins, and plot accuracy for each bin. (Goal is to understand strange behavior where logistic regression becomes less accurate for high-confidence predictions.)

Derive a probabilistic approach

Fit a density on each class of objects, and identify new objects as unknown if it is in low density regions for all classes. Can start by assuming Gaussian densities.

Consider adding unknown data to the SDSS dataset

Currently, we only include objects which fall into one of three categories (Stars, Quasars, and Galaxies). We might want to check to see if there are any outliers in this dataset and how the classifiers behave when we add unknown data.

Leave one class out

With QDA and/or logistic regression, try holding out one of the three existing classes. Then, see how well screening by confidence does at classifying the existing classes while detecting "outliers" (i.e. members of the held-out class.)

Justin is on record that this won't work that well with discriminative classifiers like logistic regression, but let's try to prove him wrong.

Learning curve

Plot the balanced accuracy as the number of training points are increased.

Class proportions on SDSS

Given a trained classifier, we can label all examples which are unlabelled in SDSS.

If our classifier is 100% accurate, then we can just use the predictions directly. Since we are not 100% accurate, we can figure out a way to correct for the class proportions.

  • By looking at the confusion matrix, is the class proportions of the predictions and the labels the same?
  • What is the predicted class proportions of the whole SDSS database?

This may involve downloading the data in chunks.

Plot uncertainty map

Using likelihood threshold, or probability threshold, display the heatmap. Interesting coordinates:

  1. ra, dec
  2. two magnitudes
  3. last two principle components.

Learning curves with all data

With classifiers that can do so efficiently, try training with varying numbers of data up to the full dataset of ~2.8 million points. The SVM with a RBF kernel continues to slowly improve at an approximate linear rate up to 100,000 points, which suggests that using more data could push us significantly closer to 100% accuracy.

Use bandits to choose between active learning strategies

Consider the four active learning strategies from Schein and Ungar as four arms of a bandit problem. Some careful thought needs to be put into designing a good reward function.

It would be convenient if the rewards had structured that would allow kl-UCB or KL-UCB to be applied.

KULLBACK–LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION
BY OLIVIER CAPPÉ, AURÉLIEN GARIVIER, ODALRIC-AMBRYM MAILLARD1, RÉMI MUNOS1 AND GILLES STOLTZ1

Performance measures

There are two main measures to consider:

  • Plotting the accuracy rate as a function of ra and dec. But the only class that changes over the coordinates are stars (due to the Milky Way band). Thus if we see drift over the coordinates, that means our calibration is not good, or there are some other factors that cause the classes to change with coordinates which we haven't taken into account. In any case, this plot is still a good sanity check.
  • Plotting the entries in the confusion matrix as a function of ra, dec, and the features. In particular, we actually know that between quasars and stars, there is a confusion region in the feature space.

Theoretical estimate of fisher information

Instead of empirically evaluating the Fisher information, one can use the elemental matrices which exist for certain distributions. May be useful for logistic regression.

Elemental information matrices and optimal experimental design for generalized regression models, 2014
Anthony C. Atkinson, Valerii V. Fedorov, Agnes M. Herzberg, Rongmei Zhang

Sanity check with UCI data

For testing purposes, it is good to have some easy UCI datasets to check that everything is working. The small datasets also mean that the computations should not take too long.

Play around with multinomial logit

In particular, see if it can predict an unknown class, e.g., maybe when the predicted probabilities for all three classes are equally low.

Look at inherently muliclass classifiers

From the docs, it looks like only 5 classifiers inherently support multiclass in sklearn: Naive Bayes, LDA, Decision Trees, Random Forests and Nearest Neighbours.

Support vector machine uses the one-vs-one strategy to do multiclass classification. Other linear models, including logistic regression, use one-vs-rest in sklearn.

Implement Sanderson and Scott

There are several algorithms proposed in:
T. Sanderson and C. Scott
Class Proportion Estimation with Application to Multiclass Anomaly Rejection
AISTATS 2014

which relate to classification when there is an unknown class.

Literature and background (Thompson sampling)

Start with background on Thompson sampling.

In particular, describe:

  • An Empirical Evaluation of Thompson Sampling,
    Olivier Chapelle, Lihong Li
  • Thompson Sampling for Contextual Bandits with Linear Payoffs,
    Shipra Agrawal, Navin Goyal
  • Contextual Bandit for Active Learning: Active Thompson Sampling,
    Djallel Bouneffouf, Romain Laroche, Tanguy Urvoy, Raphael Féraud, Robin Allesiardo

do not import *

In many of the notebooks, there is the statement:
from mclearn import *

This is generally considered bad practice. Lots of discussion on stackoverflow.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.