modal-python / modal Goto Github PK
View Code? Open in Web Editor NEWA modular active learning framework for Python
Home Page: https://modAL-python.github.io/
License: MIT License
A modular active learning framework for Python
Home Page: https://modAL-python.github.io/
License: MIT License
When comparing to random sampling it does not seem to give significantly different results. I would have expected the curve to be much higher for active learning. Potentially the defaults aren't great?
learner = ActiveLearner(
estimator=RandomForestClassifier(random_state = 1234),
X_training=start_X,
y_training=start_y
)
AttributeError : 'Committe' object has no attribute 'score' in Query by comitte example.
Hi Tivadar,
This library is very interesting and easy to use. I wanted to know if there is any way I can use ActiveLearning(especially modAL) in order to tag set of different texts(organization names) and tag them that these belong to the same name. Think of it as clustering of similar companies except that there is no supervision with final values, or mapping companies to the master one, but we have huge amounts of data and I'm thinking I can take help of active learning to remove the manual process of tagging.
Could you help me on this?
I'm trying to use modAL in combination with tslearn to classify timeseries of different lengths.
tslearn supports variable-length time series by filling the shorter time series up with NAs, but modAL calls
check_X_y(X, y, accept_sparse=True, ensure_2d=False, allow_nd=True, multi_output=True)
without setting force_all_finite = 'allow-nan'
.
Is there a reason for not allowing NAs, or did this use case just not come up before?
Thanks a lot!
Hello,
I initialize my active learner with a saved trained randomforest classifier (loaded with pickle) with its training samples as you can see in the code below.
Did this impact the performances of the Active learner ?
The results i obtained are very bad and i get better results with a random selection with the same number of samples.
I would appreciate any feedback or advice !
Thank you in advance,
Model=pickle.load(open(OldModel, 'rb'))[0]
TrainDset0=pd.read_csv(OldTrainFile,sep=",")
X_train0=np.array(TrainDset0.loc[:,TrainDset0.loc[:,'band_0':'band_129'].columns.tolist()])
y_train0=np.array(TrainDset0.loc[:,str(ClassLabel)])
TrainDset2=pd.read_csv(NewTrainFile,sep=",")
X_train2=np.array(TrainDset2.loc[:,TrainDset2.loc[:,'band_0':'band_129'].columns.tolist()])
y_train2=np.array(TrainDset2.loc[:,str(ClassLabel)])
ValidationDset=pd.read_csv(NewValidationFile,sep=",")
X_validation=np.array(ValidationDset.loc[:,ValidationDset.loc[:,'band_0':'band_129'].columns.tolist()])
y_validation=np.array(ValidationDset.loc[:,str(ClassLabel)])
AdditionalSamples=10
MaxScore=0.9
estimator=deepcopy(Model)
Learner=ActiveLearner(estimator=estimator,query_strategy=entropy_sampling,X_training=X_train0,y_training=y_train0)
while Learner.score(X_validation,y_validation) < MaxScore:
query_idx, query_inst = Learner.query(X_train2,n_instances=AdditionalSamples)
Learner.teach(X=query_inst,y=y_train2[query_idx],only_new=False)
X_train2=np.delete(X_train2,query_idx,axis=0)
y_train2=np.delete(y_train2,query_idx)
with AL samples [0.13, 0.0, 0.7, 0.66, 0.60, 0.49, 0.56, 0.81,................... 0.56, 0.71]
with Random samples [0.13, 0.0, 0.60, 0.70, 0.72, 0.71, 0.85, 0.84,................... 0.87, 0.88]
This simple example:
from sklearn.linear_model import LogisticRegression
from modAL.models import ActiveLearner
X = pd.DataFrame([[1],[2],[3]])
y = pd.Series([True, False, False])
my_learner = ActiveLearner(estimator=LogisticRegression(), X_training=X, y_training=y)
df = pd.concat([X]*2000)
query_idx, _ = my_learner.query(df, n_instances=100)
yields:
KeyError: "None of [Int64Index([1665, 1662, 5412, 3399, 1758, 4866, 1755, 3402, 1752, 5415, 3405,\n 1749, 1746, 3408, 1743, 5418, 4863, 1740, 3411, 1737, 3414, 1734,\n 5421, 1731, 3417, 1728, 4860, 3420, 1725, 5424, 1722, 3423, 1719,\n 3426, 1716, 5427, 1713, 4857, 3429, 1710, 3432, 1707, 5430, 1704,\n 3435, 1701, 4854, 1698, 5433, 3438, 1695, 3441, 1692, 1689, 5436,\n 3444, 1686, 4851, 1683, 3447, 1680, 5439, 3450, 1677, 1674, 3453,\n 1671, 5442, 4848, 1668, 3456, 1764, 3459, 5469, 1587, 3492, 1608,\n 5463, 3495, 1605, 1602, 3498, 1599, 5466, 4833, 1596, 3501, 1593,\n 3504, 1590, 4836, 1575, 3513, 3519, 4827, 1569, 5475, 1572, 3516,\n 1614],\n dtype='int64')] are in the [columns]"
at:
/databricks/python/lib/python3.7/site-packages/modAL/uncertainty.py in uncertainty_sampling(classifier, X, n_instances, random_tie_break, **uncertainty_measure_kwargs)
157 query_idx = shuffled_argmax(uncertainty, n_instances=n_instances)
158
--> 159 return query_idx, X[query_idx]
It works fine with a smaller input, like:
...
query_idx, _ = my_learner.query(X, n_instances=1)
It seems like query_idx
is an array for smaller input, but a different index representation, Int64Index
when the number of instances or input is large. And then that can't be used for indexing rows in X.
Is it possible that this needs to be X.iloc[query_idx]
? I don't really know enough pandas to know for sure. Thanks!
My python env is 2.7
ImportError Traceback (most recent call last)
in ()
----> 1 from modAL.models import ActiveLearner
2 from sklearn.neighbors import KNeighborsClassifier
3
4 # initializing the active learner
5 learner = ActiveLearner(
/usr/local/lib/python2.7/dist-packages/modAL/init.py in ()
----> 1 from .models import ActiveLearner, Committee, CommitteeRegressor
2 from .uncertainty import classifier_uncertainty, classifier_margin, classifier_entropy,
3 uncertainty_sampling, margin_sampling, entropy_sampling
4 from .disagreement import vote_entropy, consensus_entropy, KL_max_disagreement,
5 vote_entropy_sampling, consensus_entropy_sampling, max_disagreement_sampling, max_std_sampling
/usr/local/lib/python2.7/dist-packages/modAL/models.py in ()
4
5 import numpy as np
----> 6 from abc import ABC, abstractmethod
7 from sklearn.utils import check_array
8 from sklearn.base import BaseEstimator
ImportError: cannot import name ABC
I am using keras/tensorflow models with this framework and the activelearner class.
As soon as I try to change the query strategy, different errors occur.
learner = ActiveLearner(
estimator=classifier,
query_strategy=expected_error_reduction,
X_training=x_initial_training,
y_training=y_initial_training,
)
prescore = learner.score(x_test, y_test)
n_queries = 50
postscore = np.zeros(shape=(n_queries, 1))
for idx in range(n_queries):
print('Query no. %d' % (idx + 1))
query_idx, query_instance = learner.query(x_pool)
learner.teach(
X=x_pool[query_idx],
y=y_pool[query_idx],
only_new=True,
epochs=10,
validation_data=(x_val, y_val),
)
# remove queried instances from pool
x_pool = np.delete(x_pool, query_idx, axis=0)
y_pool = np.delete(y_pool, query_idx, axis=0)
postscore[idx, 0] = learner.score(x_test, y_test)
What do I have to change to implement the different strategies. The trainings_input is 3D shape.
I tried up to now all uncertainty methods of which only the default selection did work. Now I was trying the expected error_reduction strategy, but there occur errors as well.
I am afraid the 3D shape of the training data is killing all the other algorithms, but for a LSTM this kind of shape is required.
when I use modAl with keras multi_gpu_model,training occurred error like following:
Query no. 1
Traceback (most recent call last):
File "/home/es712/Documents/MingHan/pycode/test/ALtest.py", line 77, in
query_idx, query_instance = learner.query(X_pool, n_instances=100, verbose=0)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/modAL/models/base.py", line 203, in query
query_result = self.query_strategy(self, *query_args, **query_kwargs)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/modAL/uncertainty.py", line 152, in uncertainty_sampling
uncertainty = classifier_uncertainty(classifier, X, **uncertainty_measure_kwargs)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/modAL/uncertainty.py", line 77, in classifier_uncertainty
classwise_uncertainty = classifier.predict_proba(X, **predict_proba_kwargs)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/modAL/models/base.py", line 186, in predict_proba
return self.estimator.predict_proba(X, **predict_proba_kwargs)
File "/home/es712/pythonenvs/tensorflow1.13.1/tensorflow1.13.1/lib/python3.6/site-packages/tensorflow/python/keras/wrappers/scikit_learn.py", line 265, in predict_proba
probs = self.model.predict_proba(x, **kwargs)
AttributeError: 'Model' object has no attribute 'predict_proba'
import numpy as np
import os
import glob
from skimage import io,transform
import matplotlib.pyplot as plt
from copy import deepcopy
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
from modAL.models import ActiveLearner,Committee
#数据集地址
path = 'E:/data/datasets/flower_photos/'
#模型保存地址
model_path = 'E:/data/model/model.ckpt'
#将所有图片重置为100100
w = 100
h = 100
c = 3
#读取图片
def read_img(path):
cate = [path+x for x in os.listdir(path) if os.path.isdir(path+x)]
imgs = []
labels = []
for idx,folder in enumerate(cate):
for im in glob.glob(folder+'/.jpg'):
print('reading the images:%s'%(im))
img = io.imread(im)
img = transform.resize(img,(w,h))
imgs.append(img)
labels.append(idx)
return np.asarray(imgs,np.float32),np.asarray(labels,np.int32)
data,label = read_img(path)
shape = np.shape(data)
#产生池
X_pool = deepcopy(data)
y_pool = deepcopy(label)
#初始化委员会
n_members = 3
learner_list = list()
for member_idx in range(n_members):
#初始化训练集
n_initial = 100
train_idx = np.random.choice(range(X_pool.shape[0]),size=n_initial,replace=False)
X_train = X_pool[train_idx]
y_train = y_pool[train_idx]
#去除训练集之后的数据集
X_pool = np.delete(X_pool,train_idx,axis=0)
y_pool = np.delete(y_pool,train_idx)
#初始化学习器
learner = ActiveLearner(estimator=RandomForestClassifier(),X_training=X_train,y_training=y_train)
learner_list.append(learner)
Traceback (most recent call last):
File "E:/query_by_committee/query_by_committee.py", line 56, in
learner = ActiveLearner(estimator=RandomForestClassifier(),X_training=X_train,y_training=y_train)
File "D:\Anaconda3\lib\site-packages\modAL\models.py", line 104, in init
self.X_training = check_array(X_training)
File "D:\Anaconda3\lib\site-packages\sklearn\utils\validation.py", line 451, in check_array
% (array.ndim, estimator_name))
ValueError: Found array with dim 4. Estimator expected <= 2.
I am a novice, so the problem may be stupid, please excuse me.
i used model.summary() in create_keras_model() function, and i set n_queries=10, i saw the model summary info at each iteration, why this happend?
Hello,
first of all I would like to thank you for sharing this great code.
I was wondering if it is possible to integrate an object detection method like (SSD, YOLO, ..., etc) to label specific objects in images with modAL?
thanks a lot again.
Regards
Hello,
I noticed a bug in your expected_error.py. In line 70, the loss variable which is supposed to be log/binary is being replaced with a local variable. As a result after the first iteration the if statement does not execute and the loss remains constant. Just change the variable to nloss or something else and it should work.
Thank you,
For convenience it would be nice, if the Committee
class would also offer a score()
method. This would allow to compare the performance of a simple learner with a committee of learners more easily.
I want to know if modAL supports 3DCNN?
Dear modAL team,
I am trying to use the strategies of modAL outside of the ActiveLearner and I would like to use confidence score but they are not returned by query sampling functions. For example, uncertainty_sampling
return the index of the samples, the samples, but not the scores associated to each of them.
Do you think that a kwarg such as return_scores=False
(similar to return_proba
for predictor in some estimators) that adds the scores in the returned tuple could be a good idea?
Thanks for your feedback.
I would like to use the ActiveLearner Class in combination with a keras model.
I followed the example code at the documentation and everything worked out fine.
However the model performance is really poor.
One major drawback I recognized is the missing ability to change the number of epochs for each training instance in the activelearner loop.
Since you would normally change or state the number of epochs in the model.fit function, you cant do that in the current configuration.
I would be happy if you could give me a hint on how to accomplish that or you may take that issue as an inspiration for the next update.
Hi,
I am experiencing an issue when using modAL.disagreement.max_std_sampling in a custom query strategy.
When using the full number of instances included in X, the function doesn't return the sorted list of index and samples and return initial ordering. It looks like it works only when n_instances < X.shape[0]
sample_idx, sample_x = max_std_sampling(regressor, X, n_instances=X.shape[0])
Hi,
I encountered an exception in the Committee
class when trying to capture the score after teaching the committee with unseen data containing new classes.
The problem seem to be that the teach
function does the following:
Step 2 should not depend on the estimators or happen after Step 3.
I am working on a PR to fix this.
hello,
I'm using too much memory when I'm reading images, so I want to use fit_generator() in keras to training. Is it possible to use modAL with fit_generator() method? Or use another way to yield batches for training?
Thank you!
Excellent work on this library. Love it. 😄
https://github.com/modAL-python/modAL/blob/master/modAL/models/learners.py#L363 <-- we are only allow to use accuracy for the score(). There are variety of scoring functions (e.g., F1) which are more solid than accuracy in sklearn: https://scikit-learn.org/stable/modules/model_evaluation.html for model evaluation. May be this can be implemented in the next release of modAL?
np.sum(generator)
throws DeprecationWarning
, should replace this with np.sum(np.from_iter(generator))
.
Suppose an original dataset contains 100 samples (pre-train data) , we try to train a model using 1000 unlabelled(pool data). Active learning picks up 10 samples for each iteration.
Question: Pretrain with 100 samples, we can get a model A. Then AL strategy selects 10 new samples. With the 100+10 samples, a) modAL uses model A to retrain on 110 samples; b) modAL initialize a new model B, and train on the 110 samples?
which is right?
In my opinion, a) is right. It is the way that modAL does, according to the codes.
Could you pls figure out the differences between a) and b)? which one is better?
Thanks!
Implement expected error and variance reduction from the Roy and McCallum paper.
HI
I opened this issue to discuss the implementation of the acquisition functions that you said you would like to make a feature in #48. I am interested in contributing. where should it be implemented? in uncertainty.py?
Thanks for sharing the great code!
Lightgbm is a popular package, which supports numpy, pd.df as train/test data.
It would be great for modAL to support pd.df as train/pool/test data.
Thanks!
do modAL implementation the module to save and load the trained model ?
Hi ,I am new to programming, so Can anyone tell me whether I can use this package for Multi-Dimensional problem as well?
for example for multi dimension Rosen-brock function.
If I can can somebody me tell me how to set kernels for multi dimension.
Please.
After installing modAL by pip install modAL on Ubuntu 16.04 with a virtualenv python3.5, I tried to import modAL, but got the error message as titled. How can I solve this issue?
Active learning not only works in pool-based or stream-based setting, it can generate examples which can be queried for labels. This is called query synthesis. (See this paper for further details.) This should be implemented in modAL.
This code breaks:
def build_model():
grey = Input(shape=(34,34,1), name="input_grey")
# ...
red = Input(shape=(34,34,1), name="input_red")
# ...
merged = concatenate([maxpoolg2, maxpoolr2])
# ...
softmax1 = Activation('softmax')(dense2)
model = Model(inputs=[grey, red], outputs=[softmax1])
model.compile(loss='categorical_crossentropy', optimizer='adadelta', metrics=['accuracy'])
return model
Xg_train = np.array(Xg_train).astype("float32") / 255.0
Xr_train = np.array(Xr_train).astype("float32") / 255.0
Xg_test = np.array(Xg_test).astype("float32") / 255.0
Xr_test = np.array(Xr_test).astype("float32") / 255.0
Xg_train = np.reshape(Xg_train, (len(Xg_train), 34, 34, 1))
Xr_train = np.reshape(Xr_train, (len(Xr_train), 34, 34, 1))
Xg_test = np.reshape(Xg_test, (len(Xg_test), 34, 34, 1))
Xr_test = np.reshape(Xr_test, (len(Xr_test), 34, 34, 1))
Y_train = to_categorical(Y_train)
Y_test = to_categorical(Y_test)
classifier = KerasClassifier(build_model())
n_initial = 1000
initial_idx = np.random.choice(range(len(Xg_train)), size=n_initial, replace=False)
Xg_train = Xg_train[initial_idx]
Xr_train = Xr_train[initial_idx]
Y_train = Y_train[initial_idx]
Xg_pool = np.delete(Xg_train, initial_idx, axis=0)
Xr_pool = np.delete(Xr_train, initial_idx, axis=0)
Y_pool = np.delete(Y_train, initial_idx, axis=0)
from modAL.models import ActiveLearner
learner = ActiveLearner(
estimator=classifier,
X_training=[Xg_train, Xr_train],
y_training=Y_train,
verbose=1
)
n_queries = 10
for idx in range(n_queries):
query_idx, query_instance = learner.query([Xg_pool, Xr_pool], n_instances=200, verbose=1)
learner.teach(X=[Xg_pool[query_idx], Xr_pool[query_idx]], y=Y_pool[query_idx],
verbose=1
)
Xg_pool = np.delete(Xg_pool, query_idx, axis=0)
Xr_pool = np.delete(Xr_pool, query_idx, axis=0)
Y_pool = np.delete(Y_pool, query_idx, axis=0)
In the last line with this error:
File "/usr/local/lib/python3.5/dist-packages/modAL/models.py", line 43, in __init__
self._fit_to_known(bootstrap=bootstrap_init, **fit_kwargs)
File "/usr/local/lib/python3.5/dist-packages/modAL/models.py", line 93, in _fit_to_known
self.estimator.fit(self.X_training, self.y_training, **fit_kwargs)
File "/usr/local/lib/python3.5/dist-packages/keras/wrappers/scikit_learn.py", line 209, in fit
return super(KerasClassifier, self).fit(x, y, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/keras/wrappers/scikit_learn.py", line 138, in fit
**self.filter_sk_params(self.build_fn.__call__))
TypeError: __call__() missing 1 required positional argument: 'inputs'
I think this happens because of the multiple inputs but I'm not sure.
What else could be wrong?
The following code is called for computing the pairwise distances for every sample within the batch. This slows down the program significantly for larger batch sizes.
Lines 93 to 96 in 4029dfd
We can compute the pairwise distances once per batch within ranked_batch(outside the for loop) and pass only the minimum distance array to select_instance and assign it directly to
Line 96 in 4029dfd
There is a significant reduction in running time with this change.
@cosmic-cortex - can I contribute this code change to this repo?
I'm using entropy sampling startegy to select samples for RandomForest classification of 7 classes.
However when i did my query with entropy sampling (i tried also uncertainty samplig) i have a different result every time i run the query.
the selected samples are never the same (i have not changed my input data).
Thank you in advance for your help.
All of my relevant code:
#!/usr/bin/env python3.5
from data_generator import data_generator as dg
# standard imports
from keras.models import load_model
from keras.utils import to_categorical
from keras.wrappers.scikit_learn import KerasClassifier
from os import listdir
import pandas as pd
import numpy as np
from modAL.models import ActiveLearner
######## NEW STUFF ########
# get filenames and folder names
data_location = './sensor_preprocessed_dataset/flow_rates_pressures/'
subfolders = ['true','false']
###########################
classifier = KerasClassifier(load_model('./0.7917.h5'))
(X_train, y_train), (X_test, y_test) = dg.load_data_for_model(data_location, subfolders)
WINDOW_SIZE = X_train[0].shape[0]
CHANNELS = X_train[0].shape[1]
# reshape and retype the data for the classifier
X_train = X_train.reshape(X_train.shape[0], WINDOW_SIZE, CHANNELS, 1)
X_test = X_test.reshape(X_test.shape[0], WINDOW_SIZE, CHANNELS, 1)
y_train = to_categorical(y_train)
y_test = to_categorical(y_test)
# assemble initial data
n_initial = 30
initial_idx = np.random.choice(range(len(X_train)), size=n_initial, replace=False)
X_initial = X_train[initial_idx]
y_initial = y_train[initial_idx]
learner = ActiveLearner(
estimator=classifier,
X_training=X_train,
y_training=y_train,
verbose=1
)
X_pool = X_test
y_pool = y_test
n_queries = 10
for idx in range(n_queries):
print('Query no. %d' % (idx + 1))
query_idx, query_instance = learner.query(X_pool, n_instances=100, verbose=0)
learner.teach(
X=X_pool[query_idx], y=y_pool[query_idx], only_new=True,
verbose=1
)
X_pool = np.delete(X_pool, query_idx, axis=0)
y_pool = np.delete(y_pool, query_idx, axis=0)
Messages, Warnings, and Errors:
Using TensorFlow backend.
WARNING:tensorflow:From /home/jazz/.local/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
2020-01-24 10:03:54.427147: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-24 10:03:54.447927: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2712000000 Hz
2020-01-24 10:03:54.448529: I tensorflow/compiler/xla/service/service.cc:150] XLA service 0x4c01c00 executing computations on platform Host. Devices:
2020-01-24 10:03:54.448599: I tensorflow/compiler/xla/service/service.cc:158] StreamExecutor device (0): <undefined>, <undefined>
WARNING:tensorflow:From /home/jazz/.local/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
WARNING:tensorflow:From /home/jazz/.local/lib/python3.5/site-packages/tensorflow/python/ops/math_grad.py:102: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
Traceback (most recent call last):
File "./classifier.py", line 45, in <module>
y_training=y_train
File "/home/jazz/.local/lib/python3.5/site-packages/modAL/models/learners.py", line 79, in __init__
X_training, y_training, bootstrap_init, **fit_kwargs)
File "/home/jazz/.local/lib/python3.5/site-packages/modAL/models/base.py", line 63, in __init__
self._fit_to_known(bootstrap=bootstrap_init, **fit_kwargs)
File "/home/jazz/.local/lib/python3.5/site-packages/modAL/models/base.py", line 106, in _fit_to_known
self.estimator.fit(self.X_training, self.y_training, **fit_kwargs)
File "/home/jazz/.local/lib/python3.5/site-packages/keras/wrappers/scikit_learn.py", line 210, in fit
return super(KerasClassifier, self).fit(x, y, **kwargs)
File "/home/jazz/.local/lib/python3.5/site-packages/keras/wrappers/scikit_learn.py", line 139, in fit
**self.filter_sk_params(self.build_fn.__call__))
TypeError: __call__() missing 1 required positional argument: 'inputs'
I honestly don't even know where to begin to solve this, my code is based on your example here: [https://modal-python.readthedocs.io/en/latest/content/examples/Keras_integration.html] https://modal-python.readthedocs.io/en/latest/content/examples/Keras_integration.html)
And I've read the docs here: [https://modal-python.readthedocs.io/en/latest/content/apireference/models.html] https://modal-python.readthedocs.io/en/latest/content/apireference/models.html
Any input is appreciated.
Implement the hierarchical sampling algorithm from Dasgupta and Hsu paper.
I see ActiveLearner expects a 2d array for x_training, What if I want to train a keras CNN (Resnet, Inception, etc) on cifar10 (32,32,3) for example. Is it possible?
Hello!
I am trying to use modAL with a sklearn pipeline described here.
So, the X_training shape is (n_samples,) rather than (n_samples, n_features).
Learner creation works well but after successful querying I could not pass query_inst to the learner.teach(), because it internally calls np.vstack((X_seed, query_inst)).
Why not use here np.concatenate(X_seed, query_inst) in the same way as it is used for labels?
Also, I expect that only_new=True will solve this, but no...
I have a sklearn pipeline that accepts custom data type as input but it when I use that pipeline and teach the learner, I get the following error
TypeError: float() argument must be a string or a number, not 'MyClass'
I traced the problem back to check_X_y
function used in BaseLearner
. I added dtype=None
so that it preserves the input type instead of trying to convert it to a numeric and it didn't throw any errors and works as expected.
I think that behaviour should be expected instead of it trying to convert our data types for us.
Thank you for your great work!
I ran the script examples/pytorch_integration.py
but got the following error:TypeError: <class 'torch.Tensor'> datatype is not supported
.
According to the compiler, this error is raised by the code line learner.teach(X_pool[query_idx], y_pool[query_idx], only_new=True)
. I tried pytorch 1.1.0 and 0.4.1 but the same error occurred.
I'm using Windows10, python3.6. Could you please give me some advice on how to figure it out? Thanks in advance.
Hi!
The behavior of cold start handling in ranked batch sampling seems different from the Cardoso et al.'s "Ranked batch-mode active learning".
Lines 133 to 139 in 452898f
In modAL's implementation, in the case of cold start, the instance selected by select_cold_start_instance is not added to the instance list instance_index_ranking.
While in "Ranked batch-mode active learning", the instance selected by select_cold_start_instance seems to be the first item in instance_index_ranking.
Line 46 in 452898f
If my understanding on the algorithm proposed in the paper and modAL's implementation is correct, we can change the return of select_cold_start_instance to
return best_coldstart_instance_index, X[best_coldstart_instance_index].reshape(1, -1)
,
store best_coldstart_instance_index in instance_index_ranking, and revise ranked_batch correspondingly.
Hi Travidar
Thank you for this solid project.
I am running the pool_based_sampling.py for my own dataset. The dataset has image features and their respective labels in a numpy array.
However, i am getting an error due to his line of code
learner.teach(
X=new_test[query_idx].reshape(1, -1),
y=dummy_test[query_idx].reshape(1,)
Here is the error
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 668, in runfile
execfile(filename, namespace)
File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/NainaSaid/Downloads/Active Learning Codes/modAL-master/examples/pool-based_sampling.py", line 118, in
y=dummy_test[query_idx].reshape(1,)
File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\modAL\models\learners.py", line 95, in teach
self._add_training_data(X, y)
File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\modAL\models\base.py", line 81, in _add_training_data
self.X_training = data_vstack((self.X_training, X))
File "c:\users\nainasaid\appdata\local\programs\python\python36\lib\site-packages\modAL\utils\data.py", line 28, in data_vstack
raise TypeError('%s datatype is not supported' % type(blocks[0]))
TypeError: <class 'pandas.core.frame.DataFrame'> datatype is not supported
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
I dont know why am i getting this error even though "dummy_test" variable is a numpy array and not a pandas dataframe. Can somebody please help?
The current implementation of the information_density function uses 1/1+d to convert distance (d) to a measure of similarity. However, at least for cosine, this is not how similarity and distance are related.
Is this because you're treating similarity as an ordinal value? i.e. As long as the ranking of instances doesn't change it doesn't matter how we convert distance to similarity and we will always get the same argmax (in Settles, eq. 5.1) when choosing which point to query?
Hi there
I'm trying to "replicate" the example you have for Active regression with Gaussian processes for 2d input data.
This is the code so far (based on what you provided in the example):
import numpy as np
import matplotlib.pyplot as plt
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import WhiteKernel, RBF
from modAL.models import ActiveLearner
# query strategy for regression
def GP_regression_std(regressor, X):
_, std = regressor.predict(X, return_std=True)
query_idx = np.argmax(std)
return query_idx, X[query_idx]
# generating the data
num_dim, num_data = 2, 1000
data = np.random.rand(num_data, num_dim)
x_data = data[:,0]
y_data = data[:,1]
label = np.sin(np.sqrt(x_data ** 2 + y_data **2))
# assembling initial training set
n_initial = 50
initial_idx = np.random.choice(range(len(data)), size=n_initial, replace=False)
X_initial, y_initial = data[initial_idx], label[initial_idx]
# defining the kernel for the Gaussian process
kernel = RBF(length_scale=1.0, length_scale_bounds=(1e-2, 1e3)) \
+ WhiteKernel(noise_level=1, noise_level_bounds=(1e-10, 1e+1))
# initializing the active learner
regressor = ActiveLearner(
estimator=GaussianProcessRegressor(kernel=kernel),
query_strategy=GP_regression_std,
X_training=X_initial, y_training=y_initial
)
# active learning
n_queries = 100
for idx in range(n_queries):
query_idx, query_instance = regressor.query(data)
regressor.teach(data[query_idx], label[query_idx].reshape(1, -1))
And this is the error I get:
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
~/Documents/tmc/python/active_learning/gpr2d.py in <module>()
42 for idx in range(n_queries):
43 query_idx, query_instance = regressor.query(data)
---> 44 regressor.teach(data[query_idx], label[query_idx].reshape(1, -1))
/usr/local/lib/python3.7/site-packages/modAL/models.py in teach(self, X, y, bootstrap, only_new, **fit_kwargs)
352 Keyword arguments to be passed to the fit method of the predictor.
353 """
--> 354 self._add_training_data(X, y)
355 if not only_new:
356 self._fit_to_known(bootstrap=bootstrap, **fit_kwargs)
/usr/local/lib/python3.7/site-packages/modAL/models.py in _add_training_data(self, X, y)
63 classifier has seen.
64 """
---> 65 assert len(X) == len(y), 'the number of new data points and number of labels must match'
66
67 if type(self.X_training) != type(None):
AssertionError: the number of new data points and number of labels must match
I understand the issue, but am not sure how to pass multidimensional data. I suspect the solution will be simple.
Thanks in advance for your time and attention!
Cheers
Hi,
I've run into a bit of a use-case that I'm not sure is quite supported by modAL
– nor the broader libraries for active learning – but would be relatively simple to implement. After reviewing modAL
's internals a bit, I don't think it officially supports active learning with batch-mode queries.
The sampling strategies (for example, uncertainty sampling) do support the n_instances
parameter, but from what I can tell, uncertainty sampling may return redundant/sub-optimal queries if we return more than one instance from the unlabeled set. This is a bit prohibitive in settings where we'd like to ask an active learner to return multiple (if not all) examples from the unlabeled set/pool, and the computational cost for re-training an active learning model goes without saying.
I found requests for batch-mode support in the popular libact
library (issues #57 and #89) but, to the best of my knowledge, I'm not sure they were addressed in any of their PRs.
In that case, does it make sense to implement something like [Ranked batch-mode active learning] by Cardoso et al.? I took a crack at it this weekend for a better personal understanding, but if it's worth integrating and supporting in modAL
I'm happy to polish it and talk it through in a PR.
Thanks!
it seems that each time we run the learner. teach, the model will fit the initial data plus the new data from the beginning just like an untrained new model, can the model just learn the new data with the weight which has been trained on the initial data?
Hi Tivadar,
First: this is a really solid project — thank you for your contributions!
I noticed that the examples that accompany this repository are functionally sufficient, but difficult for someone to skim/follow along without running locally. Do you think it'd be worthwhile to add examples that are in a Jupyter notebook format? If so, would you mind if I took a crack at porting over one of the current scripts into a notebook (with inline plots, comments, etc.) this weekend? Thanks!
Danny
Hi
I am a research assistant and I have been working on deep bayesian active learning for the past few weeks. I have been using pytorch and custom active learning classes so far, and i just found out about modAL and it seems very cool. That s why I was wondering if it was possible to extend it to pytorh models. I would be glad to contribute.
more specifically i am using dropout based bayesian neural networks and use monte carlo sampling to compute predictive variance. i am quite new to active learning but i believe deep bayesian active learning is very close to query by committee in the sense that for every x of the unlabeled pool set, there are N feedforward passes of x through a committee of N networks sampled from the posterior distribution over the weights of the bayesian network.
I also experimented with some query strategies for classifiers mentioned in the active learning survey by Burr Settles that I think are not implemented in modAL yet. I would be glad to contribute on this side to. I am think about gini index of the votes, gini index of the consensus, least confident vote, least confident consensus. (In my experiments they perform as well as vote entropy and consensus entropy).
There are lot of examples on using Active learning for classification but for regression there is only one example which uses only one predictor variable. Can we anyone guide me on working with multiple predictors ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.