ntucllab / libact Goto Github PK

View Code? Open in Web Editor NEW

777.0 777.0 174.0 1.89 MB

Pool-based active learning in Python

Home Page: http://libact.readthedocs.org/

License: BSD 2-Clause "Simplified" License

Python 75.12% C 8.93% C++ 14.02% Cython 1.93%

active-learning machine-learning machine-learning-library uncertainty-sampling

libact's People

Contributors

Stargazers

Watchers

Forkers

lazywei a9261 inonchiu gitderek chairco kanhua kingbing gogobook kennychou0529 poyuwu alanyannick yenchih copyfun terry07 phonchi ewanlu ml-ai-nlp-ir gom7745 hongyunnchen resnick1223 wy36101299 brchiu zhang-m xuq hao-hsuan crowdcurio zshwuhan maxbest postalc stegben xingwudao souvag yu-shang jonzarecki dutinghou sambozek robbymeals yunfuliu mdmustafizurrahman sychen1121 ajpharrington iamyuanchung sian-chen chaoyue0307 betterylk calypso-team young-won phillipf windj007 charlesity deepesch nikhitasingh qitma byted mherde emilleishida matela mars-wei hughsyx mlliarm aykol souvenir13 wadkar bkj maartenvm akiratu mrlevo520 codeaudit chkoar wnstlr shomronjacob heliwang zhensongqian lopamudra26pal songfgh pdumaitre ml-lab lenatech befeng adripurkayastha notani generalzh jkleint dolzodmaa morenolaquatra ahlane anbangleo decade2014 shuaiyicao jenny-nlc jaykimbravekjh rgitz-setuserv pinghsieh mkuuwaujinga ej0cl6 dunzhang arnabkar afcarl avain cmoscardi

libact's Issues

Can win10 system install this? Or must Linux/macOS?

Next stage

Implement more classical query strategies.
Add examples for using all query strategies.

Introduction on Pypi site

https://pypi.python.org/pypi/libact)

installation instructions
related links to docs and github
introduction

Incorporating diversity in active learning

In order to reduce redundancy in batch mode active learning, a suggestion may to include diversity during training mentioned on the following paper :
Incorporating diversity in active learning

The algorithm is written and it's not complicated.

Creating libact.base.dataset.Dataset with numpy array may cause error?

If feature vector are passed with np.array, when calling format_sklearn() method it would return a 3-dimensional array for the feature.

Move UncertaintySampling to multiclass subpackage?

Fix ReadTheDocs integration

Build fails on RTD server as dependencies (numpy) won't build. Should look for a workaround.

Supporting multi-label active learning problems.

It seems it is able for current interfaces to support multi-label problems without too much changes?

Possible algorithms to implement:

Dataset loading utilities

I think we should move the current get_dataset.py to something like the following utility
http://scikit-learn.org/stable/datasets/

It would be easier to write example in sphinx-gallery that way.

How do you think?
@iamyuanchung @hsuantien

Raise exception in Ideal_labeler when the given feature is not found

https://github.com/ntucllab/libact/blob/interface-documentation/libact/labelers/ideal_labeler.py#L30

IdealLabeler error using numpy 1.11.0b3

https://github.com/ntucllab/libact/blob/master/libact/labelers/ideal_labeler.py#L28

This line may have to be changed to
return self.y[np.where([np.array_equal(x, feature) for x in self.X])[0][0]]

when using numpy 1.11.0b3

Maybe caused by this?
numpy/numpy#6155

add example usage into docstring

give example usage in each code's doc string

Use setuptools instead of distutils

setuptools supports more features such as defining dependencies. Also preparing for PyPI submission.

Ref: PyPUG Tool Recommendations

QS: Model type check at constructor

For QSs that rely on a user-given model, a type checked should be performed since different QSs require different capabilities (e.g. UncertaintySampling requires a ContinuousModel).

Use pkg-config for setup.py lapacke path

https://github.com/ntucllab/libact/blob/master/setup.py#L30

scikit-learn model adapter

Since we use scikit-learn models a lot, we should define an adapter from scikit-learn models to libact models.

Fix Travis Python 3.5 build

Python 3.5 seems to import everything before running unit tests, the _variance_reduction native extension is built and installed but import fails:

ImportError: Failed to import test module: libact.query_strategies
Traceback (most recent call last):
  File "/opt/python/3.5.0/lib/python3.5/unittest/loader.py", line 462, in _find_test_path
    package = self._get_module_from_name(name)
  File "/opt/python/3.5.0/lib/python3.5/unittest/loader.py", line 369, in _get_module_from_name
    __import__(name)
  File "/home/travis/build/ntucllab/libact/libact/query_strategies/__init__.py", line 16, in <module>
    from .variance_reduction import VarianceReduction
  File "/home/travis/build/ntucllab/libact/libact/query_strategies/variance_reduction.py", line 11, in <module>
    from libact.query_strategies import _variance_reduction
ImportError: cannot import name '_variance_reduction'

Build/install log of extension:

running build_ext
building 'libact.query_strategies._variance_reduction' extension
C compiler: gcc -pthread -Wno-unused-result -Wsign-compare -Wunreachable-code -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC
creating build/temp.linux-x86_64-3.5
creating build/temp.linux-x86_64-3.5/libact
creating build/temp.linux-x86_64-3.5/libact/query_strategies
compile options: '-I/home/travis/virtualenv/python3.5.0/lib/python3.5/site-packages/numpy/core/include -I/opt/python/3.5.0/include/python3.5m -c'
extra options: '-std=c11'
Warning: Can't read registry to find the necessary compiler setting
Make sure that Python modules winreg, win32api or win32con are installed.
gcc: libact/query_strategies/variance_reduction.c
gcc -pthread -shared -L/opt/python/3.5.0/lib -Wl,-rpath=/opt/python/3.5.0/lib build/temp.linux-x86_64-3.5/libact/query_strategies/variance_reduction.o -L/opt/python/3.5.0/lib -lpython3.5m -o build/lib.linux-x86_64-3.5/libact/query_strategies/_variance_reduction.cpython-35m-x86_64-linux-gnu.so -llapacke -llapack -lblas
running install_lib
creating /home/travis/virtualenv/python3.5.0/lib/python3.5/site-packages/libact
creating /home/travis/virtualenv/python3.5.0/lib/python3.5/site-packages/libact/query_strategies
copying build/lib.linux-x86_64-3.5/libact/query_strategies/_variance_reduction.cpython-35m-x86_64-linux-gnu.so -> /home/travis/virtualenv/python3.5.0/lib/python3.5/site-packages/libact/query_strategies

Supporting multiple queries a time

#57 #84

For now, the unit tests for active learning algorithms are using the results of real-world data with fixed random seeds. So in the future if any modification to these algorithms have conflict with current test, it should be taken care carefully.

The rigorous way to do the test is to design artificial datasets. We'll leave it as future development goal.

Is specified version of Python is required when compiling? Compile error using "python setup.py install"

Hello, Thank you for providing this project

After I have installed the dependencies, I run
python setup.py install

But, I get some errors:

Platform Detection: Linux. Link to liblapacke...
running install
running build
running build_py
running build_ext
building 'libact.query_strategies._variance_reduction' extension
C compiler: x86_64-linux-gnu-gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC

compile options: '-I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/include/lapacke -I/usr/include/python2.7 -c'
extra options: '-std=c11'
x86_64-linux-gnu-gcc: libact/query_strategies/src/variance_reduction/variance_reduction.c
libact/query_strategies/src/variance_reduction/variance_reduction.c:26:15: error: variable ‘moduledef’ has initializer but incomplete type
static struct PyModuleDef moduledef = {
^
libact/query_strategies/src/variance_reduction/variance_reduction.c:27:5: error: ‘PyModuleDef_HEAD_INIT’ undeclared here (not in a function)
PyModuleDef_HEAD_INIT,
^
。。。。。。。。。
。。。。。。。。。

I wonder if I need to specify the version of Python, so I tried
python3 steup.py install
Still, I cannot install successfully, but the error changes
File "setup.py", line 13, in
from Cython.Build import cythonize
ImportError: No module named 'Cython'

However, I have already installed Cython using "pip install Cython"

It will be very kind of you if you could tell me the requirement of version of the installed dependencies

OR could you please tell how to modify the "-I/usr/include/lapacke -I/usr/include/python2.7" in the compile option

Many Thanks

HintSVM mldataset - Buffer dtype mismatch error

Hi,

I try to use hintSVM query strategy with the vehicle dataset from mldata.
However, I don't understand why, I got the following error :

File "testing.py", line 60, in run
    ask_id = qs.make_query()
  File "/usr/local/lib/python3.5/site-packages/libact-0.1.2-py3.5-macosx-10.12-x86_64.egg/libact/query_strategies/hintsvm.py", line 151, in make_query
    np.array([x.tolist() for x in unlabeled_pool]), self.svm_params)
  File "libact/query_strategies/_hintsvm.pyx", line 16, in libact.query_strategies._hintsvm.hintsvm_query (libact/query_strategies/_hintsvm.c:1836)
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'long'

I don't have this error when I use others strategies (UncertaintySampling,Quire).

def split_scale_train_test(name_dataset,test_size):
    # choose a dataset with unbalanced class instances
    #data = sklearn.datasets.fetch_mldata('segment')
    data = sklearn.datasets.fetch_mldata(name_dataset)

    X = StandardScaler().fit_transform(data['data'])
    target = np.unique(data['target'])
    # mapping the targets to 0 to n_classes-1
    y = np.array([np.where(target == i)[0][0] for i in data['target']])

    X_trn, X_tst, y_trn, y_tst = \
        train_test_split(X, y, test_size=test_size, stratify=y)

    # making sure each class appears ones initially
    init_y_ind = np.array(
        [np.where(y_trn == i)[0][0] for i in range(len(target))])
    y_ind = np.array([i for i in range(len(X_trn)) if i not in init_y_ind])
    trn_ds = Dataset(
        np.vstack((X_trn[init_y_ind], X_trn[y_ind])),
        np.concatenate((y_trn[init_y_ind], [None] * (len(y_ind)))))

    tst_ds = Dataset(X_tst, y_tst)

    fully_labeled_trn_ds = Dataset(
        np.vstack((X_trn[init_y_ind], X_trn[y_ind])),
        np.concatenate((y_trn[init_y_ind], y_trn[y_ind])))

    cost_matrix = 2000. * np.random.rand(len(target), len(target))
    np.fill_diagonal(cost_matrix, 0)

    return trn_ds, tst_ds, y_trn,y_tst, fully_labeled_trn_ds, cost_matrix

def run(trn_ds, tst_ds, lbr, model, qs, quota):
    E_in, E_out = [], []
    score_train = []
    score_test = []

    for _ in range(quota):
        ask_id = qs.make_query()
        X, _ = zip(*trn_ds.data)
        lb = lbr.label(X[ask_id])
        trn_ds.update(ask_id, lb)

        model.train(trn_ds)
        E_in = np.append(E_in, 1 - model.score(trn_ds))
        E_out = np.append(E_out, 1 - model.score(tst_ds))
        score_train = np.append(score_train,model.score(trn_ds)*100)
        score_test = np.append(score_test,model.score(tst_ds)*100)

    return E_in, E_out,score_train,score_test

qs5 = HintSVM(trn_ds5, cl=1.0, ch=1.0, p=0.5)
        model = SVM(kernel='rbf',C = n_C, gamma = n_gamma, decision_function_shape='ovr')
        E_in_5, E_out_5,score_train_5,score_test_5 = run(trn_ds5, tst_ds, idealLabels, model, qs5, quota_to_query)
        results_out.append(E_out_5.tolist())
        results_score.append(score_test_5.tolist())

Do you have any insights about this error ?

thank you

Developer guidelines

make_query() in active_learning_by_learning is broken

Hello. The make_query() method fails at the following line, with q undefined:
ask_idx = np.random.choice(
np.arange(len(self.unlabeled_invert_id_idx)), size=1, p=q
)[0]

Could you please fix it?

Thanks!

More examples with sphinx-gallery

https://github.com/sphinx-gallery/sphinx-gallery

SVM: use scikit-learn instead of LIBSVM

Separate changes out from quire branch.

Interfaces documentation

Documents on implementing their own algorithm on this framework

Is there a way to perform batch mode active learning ?

Hi,

Instead of having of having unlabeled data which come as a stream, I would like to know if there is a way with libact to perform batch mode active learning meaning that the users can select multiples images at once (positive and negatives) ?

thank you in advance

Problems installing in Linux

Hello,

I am trying to install Libact in the HPC facilites of my university. However I am getting the following error every time I try to install it:

error: Command "gcc -pthread -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/rmegret/irodriguez/anaconda3/envs/bee/lib/python3.6/site-packages/numpy/core/include -I/usr/include/lapacke -I/home/rmegret/irodriguez/anaconda3/envs/bee/include/python3.6m -c libact/query_strategies/src/variance_reduction/variance_reduction.c -o build/temp.linux-x86_64-3.6/libact/query_strategies/src/variance_reduction/variance_reduction.o -std=c11" failed with exit status 1

I have tried pip and cloning the repo and then using setup.py.

Just in case here is the specifications of the HPC: https://www.hpcf.upr.edu/documentation/boqueron/#ffs-tabbed-15

Incompatibility with plotly and cufflinks

Hello,
I have found that your lib is not compatible with python packages plotly and cufflinks. I have tested it on fresh install of ubuntu 16.04 where anaconda was installed.
Everything was ok till installation of plotly and cufflinks:


pip install plotly --upgrade
pip install cufflinks --upgrade

Then running python setup.py test ends on this:

======================================================================
ERROR: query_strategies (unittest.loader._FailedTest)
----------------------------------------------------------------------
ImportError: Failed to import test module: query_strategies
Traceback (most recent call last):
  File "/path/anaconda3/lib/python3.5/unittest/loader.py", line 153, in loadTestsFromName
    module = __import__(module_name)
  File "/path/libact/libact/query_strategies/__init__.py", line 20, in <module>
    from ._variance_reduction import estVar
ImportError: /usr/lib/liblapacke.so.3: undefined symbol: dpotrf2_

Polishing documentation

This basic infrastructure of documentation generation has been establish.
Please read about the spec of how to write document in your code.

we are currently using numpydoc:
https://github.com/numpy/numpy/blob/master/doc/HOWTO_DOCUMENT.rst.txt

also there is a lot of bug when building sphinx waiting to be fixed:
https://readthedocs.org/projects/striatum/builds/4370706/

Other multilabel active learning settings

Mentioned in these papers:

Supporting selection score when make_query

Some applications would need the selection score to do further things.

Allow make_query to return multiple items (or the entire scored set)

In certain applications, you might want to know what the top N unlabelled entities are so that a human can go through and do batch labeling offline. Right now I have a particularly hacky way of getting multiple results out, just assuming the majority class in the update, but it would be great to tweak the make_query function to return arbitrary numbers of ordered results for batch label processing.
for i in range(20):
item_to_investigate = qs.make_query()
libact_ds.update(item_to_investigate, 0)
print item_to_investigate

Happy to contribute code to try to help this happen!

moving sklearn.cross_validation to sklearn.model_selection after v0.18.0

Expected Error Reduction

Perhaps write a faster implementation in C.

https://github.com/ntucllab/libact/tree/EER

Error when installing libact

On Ubuntu
When installing libact using command " pip install git+https://github.com/ntucllab/libact.git"
Get error:
Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-Q9a2LI-build/

ideal_labeler: label() should return the label instead of list

Current IdealLabeler seems to return a list of labels instead of the label.

self.y[np.where([...])[0]]

should be

self.y[np.where([...])[0]][0]

self.y[np.where([...])[0][0]]

Probabilistic models

I would like to ask you about which classsifiers are theorized as Probabilistic so as to be combined with query strategies like Uncertainty Sampling?

Thanks in advance.

Identify whether the relabeling in sklearn will cause problem

Since sklearn internally relabels the given label to 0-n_labels. If I get it correctly, they do it in the order of data sending into the fit method.
So if after we updated an unlabeled data and cause the order of data sending into fit method to change. The value from predict_real method of our model might have wrong order.
One proposal for solving this problem could be manage relabeling set ourself in the model classes.

ALBL: ensure all query_models reference the same Dataset instance

QS: check if unlabeled pool is empty upon update

Is there any tutorial/notebook notes?

Is there a jupyter notebook for learning how to use this library?.

self assigned kernel passing to QUIRE

https://github.com/ntucllab/libact/blob/master/libact/query_strategies/quire.py#L66

and the gamma parameter should be part of the kernel.

ModuleNotFoundError: No module named 'libact.query_strategies.multilabel'

HI all,

I have installed libact package in my ubuntu OS but for some reason i cant run the alce_plot.py and multilabel_plot.py examples. I keep getting the ModuleNotFoundError for the module name 'libact.query_strategies.multilabel'

Please help!
Regards

Installation using pip fails for python 2

Tried to install libact using sudo pip install libact and got the following error message

libact/query_strategies/variance_reduction.c:26:15: error: variable ‘moduledef’ has initializer but incomplete type

You can see the full error message here.

I also tried to install using the setup.pyscript, which actually did work just fine, also the python3 installation worked using pip on the same machine.
I did some googling and the error looked similar to here, I cant look into it because setup.py worked.
Just wanted to let you guys know.

Unit testing for active learning algorithms

Dataset: specify numbers of labels at constructor

The labeled pool may contain only a subset of all possible labels.

Clarify semantics of Model.predict_real

Currently Model.predict_real is connected to predict_proba in scikit-learn, which returns an array of n_classes floats standing for probabilities of corresponding labels. But decision_function is another candidate whose returning shapes vary from model to model, for example (in our case n_samples = 1):

LogisticRegression: (n_samples,) if n_classes == 2 else (n_samples, n_classes)
C-SVC: (n_samples, n_classes * (n_classes-1) / 2)

We have to make sure what we want in order to well-define the interface. @hsuantien can you give us some advice on this?