awslabs / datawig Goto Github PK
View Code? Open in Web Editor NEWImputation of missing values in tables.
License: Apache License 2.0
Imputation of missing values in tables.
License: Apache License 2.0
It appears that logic for image based imputation is not as well tested and polished as text based imputation. Therefore, it may confuse customers that both are of the same quality (e.g. using text + image is better than just text).
I'd like to move the code from master to a feature branch where development on image based imputation will continue until it's of comparable quality with text based imputation.
Any thoughts ?
When I ran the simpleimputer_intro.py in the example, the following error occurred
Traceback (most recent call last):File "/Users/chen/PycharmProjects/test2/datamissing/examples/simpleimputer_intro.py", line 41, in <module> predictions = imputer.predict(df_test)
File "/usr/local/lib/python3.7/site-packages/datawig/simple_imputer.py", line 420, in predict score_suffix, inplace=inplace)
File "/usr/local/lib/python3.7/site-packages/datawig/imputer.py", line 822, in predict if data_frame.columns.contains(imputation_col):
AttributeError: 'Index' object has no attribute 'contains'
It could be a data processing error in predict function
At the moment all metrics we evaluate automatically are measured wrt to the test data (except for the log likelihood). To analyse model performance in particular bias/variance, training accuracy is crucial.
We should add this.
The whole metrics computation, however, is cluttered and complicated and I wonder whether we should revisit it more fundamentally.
Hi, I am using Jupyter Notebook(through Anaconda) and the SimpleImputer.complete. I am running Anaconda as an administrator.
The error arises at shutil.rmtree(output_col), and the stack trace eventually calls os.unlink(fullname) in _rmtree_unsafe.
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'SOME_PATH\\imputer.log'
I am unable to delete the imputer.log manually until I restart the notebook after which this is possible.
Unable to run datawig code in google colab
I confirmed that I have the required dependency versions.
scikit-learn[alldeps]==0.22.1
typing==3.6.6
pandas==0.25.3
mxnet==1.4.0
I successfully installed Datawig and the quickstart example works just fine.
`import pandas as pd
import datawig
import numpy as np
df1 = datawig.utils.generate_df_numeric()
df1_with_missing = df1.mask(np.random.rand(*df1.shape) > .9)
df1_with_missing_imputed = datawig.SimpleImputer.complete(df1_with_missing)`
However, when I try to apply it to my dataset, I get the following error:
`df = pd.read_csv(abundance_file_path)
df_with_missing = df
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)
/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/sklearn/utils/extmath.py:765: RuntimeWarning: invalid value encountered in true_divide
updated_mean = (last_sum + new_sum) / updated_sample_count
/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/sklearn/utils/extmath.py:706: RuntimeWarning: Degrees of freedom <= 0 for slice.
result = op(x, *args, **kwargs)
Traceback (most recent call last):
File "", line 2, in
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)
File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/simple_imputer.py", line 527, in complete
calibrate=False)
File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/simple_imputer.py", line 390, in fit
calibrate=calibrate)
File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/imputer.py", line 263, in fit
iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)
File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/imputer.py", line 604, in __build_iterators
batch_size=self.batch_size
File "/Applications/anaconda3/envs/Python_3-7/lib/python3.7/site-packages/datawig/iterators.py", line 231, in init
self.start_padding_idx = int(data_frame.index.max() + 1)
ValueError: cannot convert float NaN to integer
`
shutil.rmtree(output_col)
in SimpleImputer.complete
fails because the directories are generated within the output_path
. The code should have been shutil.rmtree(os.path.join(output_path, output_col))
Hi, thanks for the amazing library.
I am having an issue while running the complete() method.
import datawig
import pandas as pd
df = pd.read_csv('SpendCubeCleanNAN.csv', low_memory=False)
df_imp= datawig.SimpleImputer.complete(df)
df_imp.to_csv('SpendCubeCleanImpNN.csv')
Level 0: Cat Group is a column name in my dataset.
This is the output with the error traceback:
`C:\Users\Shadow\miniconda3\lib\site-packages\sklearn\utils\extmath.py:765: RuntimeWarning: invalid value encountered in true_divide
updated_mean = (last_sum + new_sum) / updated_sample_count
C:\Users\Shadow\miniconda3\lib\site-packages\sklearn\utils\extmath.py:706: RuntimeWarning: Degrees of freedom <= 0 for slice.
result = op(x, *args, **kwargs)
[10:30:26] c:\jenkins\workspace\mxnet-tag\mxnet\src\operator\../common/utils.h:450:
Storage type fallback detected:
operator = Concat
input storage types = [csr, default, ]
output storage types = [default, ]
params = {"num_args" : 2, "dim" : 1, }
context.dev_mask = cpu
The operator with default storage type will be dispatched for execution. You're seeing this warning message because the operator above is unable to process the given ndarrays with specified storage types, context and parameter. Temporary dense ndarrays are generated in order to execute the operator. This does not affect the correctness of the programme. You can set environment variable MXNET_STORAGE_FALLBACK_LOG_VERBOSE to 0 to suppress this warning.
Traceback (most recent call last):
File "C:/Users/Shadow/PycharmProjects/Datawig/SimpleImpSpendCube.py", line 8, in <module>
df_imp= datawig.SimpleImputer.complete(df)
File "C:\Users\Shadow\miniconda3\lib\site-packages\datawig\simple_imputer.py", line 527, in complete
calibrate=False)
File "C:\Users\Shadow\miniconda3\lib\site-packages\datawig\simple_imputer.py", line 382, in fit
output_path=self.output_path)
File "C:\Users\Shadow\miniconda3\lib\site-packages\datawig\imputer.py", line 150, in __init__
os.makedirs(self.output_path)
File "C:\Users\Shadow\miniconda3\lib\os.py", line 221, in makedirs
mkdir(name, mode)
NotADirectoryError: [WinError 267] Der Verzeichnisname ist ungültig: '.\\Level 0: Cat Group'
Can you please help me with the issue?
I trained an Imputer model with a mix of categorical, numerical and bow encoder (and associated featurizers), but when I run the explain
method on it I get this error:
~/opt/anaconda3/envs/extract/lib/python3.6/site-packages/datawig/imputer.py in explain(self, label, k, label_column)
390 # for each data encoder extract (token_idx, token_idx_correlation_with_label), extract and apply idx2token map.
391 feature_dict = dict(explained_label = label)
--> 392 for encoder, pattern in self.__class_patterns:
393 # extract idx2token mappings
394 if isinstance(encoder, CategoricalEncoder):
TypeError: 'NoneType' object is not iterable
I tried with a dummy setup and explain
works, so I would like to know if you have any clue about what is exactly causing this in my more complex model
We need to host the projects documentation, including example use-cases and ideally similar to the user guides in scikit-learn, on readthedocs.
There has been a discussion about serialisation behaviour. In particular it can lead to crashes because of permission issue. As second problem is with space issues when training many models, e.g. for HPO.
From my perspective it would be nice to make serialisation optional. But I understand there has already been a consensus to not to(?)
Thanks for your nice package.
I have one question.
I am imputing large matrix (90,000 by 7,000).
And this matrix contain lots of NA (Over 80%).
Also include numerical value and zero or one categorical value.
Below is my code (After loading whole dataframe to impute)
` import datawig
with tf.device(d):
df = datawig.SimpleImputer.complete(df, inplace=True, num_epochs=max_epoch, verbose=1, output_path=result_dir+ str(num_seed)+'seed_imputer_model')
with open(result_dir+str(num_seed)+"seed_Imputed_merged_cid.pickle", 'wb') as handle:
pickle.dump(merged_cid, handle, protocol=pickle.HIGHEST_PROTOCOL)
pd.DataFrame(df).to_csv(result_dir+ str(num_seed) + 'seed_Imputed_merged_cid.csv', index=None)`
I use "datawig.SimpleImputer.complete" for simplicity,
but is there any method to get neural network weight which used for imputation.
And "datawig.SimpleImputer.complete" function how works for train and validation
I asking because there is no decrease of accuracy
2020-10-27 11:14:22,355 [INFO] Epoch[49] Batch [0-34] Speed: 1651.71 samples/sec cross-entropy=0.515578 C0040436-accuracy=0.000000 2020-10-27 11:14:22,675 [INFO] Epoch[49] Train-cross-entropy=0.667427 2020-10-27 11:14:22,675 [INFO] Epoch[49] Train-C0040436-accuracy=0.000000 2020-10-27 11:14:22,676 [INFO] Epoch[49] Time cost=0.657 2020-10-27 11:14:22,688 [INFO] Saved checkpoint to "result/dtip/impute/datawig/1000seed_imputer_model/C0040436/model-0049.params" 2020-10-27 11:14:22,723 [INFO] Epoch[49] Validation-cross-entropy=0.492388 2020-10-27 11:14:22,723 [INFO] Epoch[49] Validation-C0040436-accuracy=0.000000
Thanks
Hyojin
Build 1.4 today
Ubuntu 18.04
gcc-6, g++-6
Cuda 10
TensorRT
Tensorflow 1.13
Make option:
USE_OPENCV=1
USE_BLAS=openblas
USE_CUDA=1
USE_CUDA_PATH=/usr/local/cuda
USE_CUDNN=1
USE_NCCL=1
Compiled with only a warning about lapack
Copied the example into a python script and ran with python3
Error reported:
Traceback (most recent call last):
File "test.py", line 8, in
mx.random.seed(1234)
AttributeError: module 'mxnet' has no attribute 'random'
I tried to run the simpleimputer on pycharm, but reported an error, which surprisingly worked fine on the command line。
Traceback (most recent call last): File "F:/pyfile/missing_data/datawig.py", line 9, in <module> import datawig.utils File "F:\pyfile\missing_data\datawig.py", line 9, in <module> import datawig.utils ModuleNotFoundError: No module named 'datawig.utils'; 'datawig' is not a package
import datawig
I just get errors.AttributeError Traceback (most recent call last)
in
----> 1 import datawig
2
3 df = datawig.utils.generate_df_string(num_samples=200, data_column_name='sentences', label_column_name='label')
4 df_train, df_test = datawig.utils.random_split(df)
5
~/Development/repos/AWS/MBA/datawig/datawig/init.py in
1 # makes the column encoders available as e.g. from datawig import CategoricalEncoder
----> 2 from .column_encoders import CategoricalEncoder, BowEncoder, NumericalEncoder, SequentialEncoder
3 from .mxnet_input_symbols import BowFeaturizer, LSTMFeaturizer, NumericalFeaturizer, EmbeddingFeaturizer
4 from .simple_imputer import SimpleImputer
5 from .imputer import Imputer
~/Development/repos/AWS/MBA/datawig/datawig/column_encoders.py in
30 from sklearn.preprocessing import StandardScaler
31
---> 32 from .utils import logger
33
34 random.seed(0)
~/Development/repos/AWS/MBA/datawig/datawig/utils.py in
32 import pandas as pd
33
---> 34 mx.random.seed(1)
35 random.seed(1)
36 np.random.seed(42)
Should I setup any conda env. for specific version of mxnet?
Hello,
I am trying to install datawig, however, I can only install later versions of mxnet. Is it possible to use newer versions of mxnet?
This is the error I am getting while installing from pip:
ERROR: Could not find a version that satisfies the requirement mxnet==1.4.0 (from versions: 1.6.0, 1.7.0.post1)
ERROR: No matching distribution found for mxnet==1.4.0
To introspect the learned parameters of the imputer:
def explain('class')
)def explain_instance(sample)
)For now, a simple, univariate, measure of feature label covariances should suffice (link).
numpy==1.18.0
scikit-learn[alldeps]==0.22.1
typing==3.6.6
pandas==0.25.0
mxnet==1.4.0
these are the requirements however mxnet 1.4.0 has numpy dependency as 1.14.6
and latest version of mxnet 1.6.0 has numpy dependencies 1.16.6.
As a result unable to install this package
Hi,
I am trying to impute numeric values from one specific column (it's called 'Comercializadora_encoded', and it is now a numeric column because I previously encoded the original object-type column with LabelEncoder() from sklearn).
This is are the column types I would like to input:
--> Provincia 166203 non-null float64
--> Consumo 166203 non-null float64
--> Potencia max 166203 non-null float64
And this one the column to impute:
--> Comercializadora_encoded 163937 non-null object
This is my code:
df_train, df_test = datawig.utils.random_split(df_copy)
imputer = datawig.SimpleImputer(
input_columns=['Provincia', 'Consumo', 'Potencia max'],
output_column= 'Comercializadora_encoded',
output_path = 'imputer_model'
)
imputer.fit(train_df=df_train, num_epochs=50)
imputed = imputer.predict(df_test)
And this is the error message I am getting:
2020-11-30 09:57:37,860 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 47 occurrences of value 16.0
2020-11-30 09:57:37,860 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 40 occurrences of value 7.0
2020-11-30 09:57:37,860 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 27 occurrences of value 44.0
2020-11-30 09:57:37,865 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 23 occurrences of value 66.0
2020-11-30 09:57:37,866 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 19 occurrences of value 29.0
2020-11-30 09:57:37,868 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 18 occurrences of value 28.0
2020-11-30 09:57:37,869 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 17 occurrences of value 56.0
2020-11-30 09:57:37,870 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 17 occurrences of value 21.0
2020-11-30 09:57:37,871 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 16 occurrences of value 81.0
2020-11-30 09:57:37,872 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 16 occurrences of value 34.0
2020-11-30 09:57:37,873 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 16 occurrences of value 74.0
2020-11-30 09:57:37,874 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 13 occurrences of value 43.0
2020-11-30 09:57:37,875 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 12 occurrences of value 1.0
2020-11-30 09:57:37,876 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 9 occurrences of value 52.0
2020-11-30 09:57:37,877 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 9 occurrences of value 38.0
2020-11-30 09:57:37,878 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 9 occurrences of value 9.0
2020-11-30 09:57:37,880 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 8 occurrences of value 12.0
2020-11-30 09:57:37,881 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 8 occurrences of value 25.0
2020-11-30 09:57:37,882 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 7 occurrences of value 69.0
2020-11-30 09:57:37,884 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 7 occurrences of value 79.0
2020-11-30 09:57:37,885 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 7 occurrences of value 63.0
2020-11-30 09:57:37,886 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 7 occurrences of value 6.0
2020-11-30 09:57:37,887 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 7 occurrences of value 76.0
2020-11-30 09:57:37,888 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 6 occurrences of value 67.0
2020-11-30 09:57:37,888 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 6 occurrences of value 54.0
2020-11-30 09:57:37,889 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 5 occurrences of value 26.0
2020-11-30 09:57:37,890 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 5 occurrences of value 20.0
2020-11-30 09:57:37,890 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 5 occurrences of value 48.0
2020-11-30 09:57:37,891 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 5 occurrences of value 49.0
2020-11-30 09:57:37,892 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 5 occurrences of value 10.0
2020-11-30 09:57:37,893 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 4 occurrences of value 23.0
2020-11-30 09:57:37,894 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 4 occurrences of value 53.0
2020-11-30 09:57:37,896 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 4 occurrences of value 5.0
2020-11-30 09:57:37,897 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 4 occurrences of value 36.0
2020-11-30 09:57:37,899 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 3 occurrences of value 57.0
2020-11-30 09:57:37,900 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 3 occurrences of value 27.0
2020-11-30 09:57:37,902 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 3 occurrences of value 0.0
2020-11-30 09:57:37,903 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 3 occurrences of value 17.0
2020-11-30 09:57:37,904 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 3 occurrences of value 2.0
2020-11-30 09:57:37,906 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 2 occurrences of value 45.0
2020-11-30 09:57:37,907 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 2 occurrences of value 71.0
2020-11-30 09:57:37,908 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 2 occurrences of value 46.0
2020-11-30 09:57:37,909 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 2 occurrences of value 4.0
2020-11-30 09:57:37,910 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 2 occurrences of value 50.0
2020-11-30 09:57:37,911 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 2 occurrences of value 14.0
2020-11-30 09:57:37,912 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 2 occurrences of value 68.0
2020-11-30 09:57:37,913 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 2 occurrences of value 22.0
2020-11-30 09:57:37,914 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 59.0
2020-11-30 09:57:37,916 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 65.0
2020-11-30 09:57:37,917 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 42.0
2020-11-30 09:57:37,919 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 72.0
2020-11-30 09:57:37,920 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 77.0
2020-11-30 09:57:37,921 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 60.0
2020-11-30 09:57:37,922 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 8.0
2020-11-30 09:57:37,923 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 3.0
2020-11-30 09:57:37,924 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 82.0
2020-11-30 09:57:37,925 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 13.0
2020-11-30 09:57:37,926 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 33.0
2020-11-30 09:57:37,927 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 15.0
2020-11-30 09:57:37,928 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 37.0
2020-11-30 09:57:37,930 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 62.0
2020-11-30 09:57:37,931 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 75.0
2020-11-30 09:57:37,932 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 40.0
2020-11-30 09:57:37,933 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 41.0
2020-11-30 09:57:37,934 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 30.0
2020-11-30 09:57:37,935 [INFO] CategoricalEncoder for column Comercializadora_encoded found only 1 occurrences of value 39.0
C:\Users\rcruz\Anaconda3\lib\site-packages\pandas\core\frame.py:3509: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
self[k1] = value[k2]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-55-55b90ff782c9> in <module>
10
11 ## Fit an imputer model on the train data
---> 12 imputer.fit(train_df=df_train, num_epochs=50)
13
14 ## Impute missing values and return original dataframe with predictions
~\AppData\Roaming\Python\Python38\site-packages\datawig\simple_imputer.py in fit(self, train_df, test_df, ctx, learning_rate, num_epochs, patience, test_split, weight_decay, batch_size, final_fc_hidden_units, calibrate, class_weights, instance_weights)
384 self.output_path = self.imputer.output_path
385
--> 386 self.imputer = self.imputer.fit(train_df, test_df, ctx, learning_rate, num_epochs, patience,
387 test_split,
388 weight_decay, batch_size,
~\AppData\Roaming\Python\Python38\site-packages\datawig\imputer.py in fit(self, train_df, test_df, ctx, learning_rate, num_epochs, patience, test_split, weight_decay, batch_size, final_fc_hidden_units, calibrate)
261 train_df, test_df = random_split(train_df, [1.0 - test_split, test_split])
262
--> 263 iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)
264
265 self.__check_data(test_df)
~\AppData\Roaming\Python\Python38\site-packages\datawig\imputer.py in __build_iterators(self, train_df, test_df, test_split)
590
591 logger.debug("Building Train Iterator with {} elements".format(len(train_df)))
--> 592 iter_train = ImputerIterDf(
593 data_frame=train_df,
594 data_columns=self.data_encoders,
~\AppData\Roaming\Python\Python38\site-packages\datawig\iterators.py in __init__(self, data_frame, data_columns, label_columns, batch_size)
221 numerical_columns = [c for c in data_frame.columns if is_numeric_dtype(data_frame[c])]
222 string_columns = list(set(data_frame.columns) - set(numerical_columns))
--> 223 data_frame = data_frame.fillna(value={x: "" for x in string_columns})
224 data_frame = data_frame.fillna(value={x: np.nan for x in numerical_columns})
225
~\Anaconda3\lib\site-packages\pandas\core\frame.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
4250 **kwargs
4251 ):
-> 4252 return super().fillna(
4253 value=value,
4254 method=method,
~\Anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
6272 continue
6273 obj = result[k]
-> 6274 obj.fillna(v, limit=limit, inplace=True, downcast=downcast)
6275 return result if not inplace else None
6276
~\Anaconda3\lib\site-packages\pandas\core\series.py in fillna(self, value, method, axis, inplace, limit, downcast, **kwargs)
4339 **kwargs
4340 ):
-> 4341 return super().fillna(
4342 value=value,
4343 method=method,
~\Anaconda3\lib\site-packages\pandas\core\generic.py in fillna(self, value, method, axis, inplace, limit, downcast)
6255 )
6256
-> 6257 new_data = self._data.fillna(
6258 value=value, limit=limit, inplace=inplace, downcast=downcast
6259 )
~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in fillna(self, **kwargs)
573
574 def fillna(self, **kwargs):
--> 575 return self.apply("fillna", **kwargs)
576
577 def downcast(self, **kwargs):
~\Anaconda3\lib\site-packages\pandas\core\internals\managers.py in apply(self, f, axes, filter, do_integrity_check, consolidate, **kwargs)
436 kwargs[k] = obj.reindex(b_items, axis=axis, copy=align_copy)
437
--> 438 applied = getattr(b, f)(**kwargs)
439 result_blocks = _extend_blocks(applied, result_blocks)
440
~\Anaconda3\lib\site-packages\pandas\core\internals\blocks.py in fillna(self, value, limit, inplace, downcast)
1950 def fillna(self, value, limit=None, inplace=False, downcast=None):
1951 values = self.values if inplace else self.values.copy()
-> 1952 values = values.fillna(value=value, limit=limit)
1953 return [
1954 self.make_block_same_class(
~\Anaconda3\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
206 else:
207 kwargs[new_arg_name] = new_arg_value
--> 208 return func(*args, **kwargs)
209
210 return wrapper
~\Anaconda3\lib\site-packages\pandas\core\arrays\categorical.py in fillna(self, value, method, limit)
1871 elif is_hashable(value):
1872 if not isna(value) and value not in self.categories:
-> 1873 raise ValueError("fill value must be in categories")
1874
1875 mask = codes == -1
ValueError: fill value must be in categories
I've also tried to use categorical columns as input columns, and to convert the output column into a category.
Am I missing something?
Thank you very much.
Regards,
Rubén.
As the title, says. When calling imputer.predict(df)
, columns are appended to df
such that a second call of the command throws an error.
This makes it unnecessarily complicated to apply different imputers to the same dataset, e.g. for comparing predictions.
Changing this may break backwards compatibility though.
Hi,
I installed datawig today using pip3 in an extra virtualenv and ran into some problems that I'd like to point out here.
During the installation, I encountered the following warning:
mxnet 1.3.0b20180820 has requirement numpy<1.15.0,>=1.8.2, but you'll have numpy 1.15.0 which is incompatible.
After importing numpy in jupyter I got this warning:
/home/.../datawig/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
Next I tried the simple imputer example, but the fit_hpo
method took ages to run for the small train dataset with 5k records, I stopped the hpo after 20 minutes or so.
When I tried to evaluate the predictions of datawig using the proposed code
f1 = f1_score(predictions['finish'], predictions['finish_imputed'])
I received the following error message:
ValueError: Target is multiclass but average='binary'. Please choose another average setting.
Why do we get cross-entropy and accuracy logs when we assign a numeric variable?
I got a continuous value for the Impute value, but I'm wondering.
2021-02-19 19:20:06,854 [INFO] NumExpr defaulting to 8 threads.
2021-02-19 19:24:50,957 [INFO]
========== start: fit model
2021-02-19 19:24:50,957 [WARNING] Already bound, ignoring bind()
2021-02-19 19:25:18,406 [INFO] Epoch[0] Batch [0-6639] Speed: 3870.89 samples/sec cross-entropy=2.836561 total_votes-accuracy=0.000000
2021-02-19 19:25:45,889 [INFO] Epoch[0] Train-cross-entropy=2.407569
2021-02-19 19:25:45,890 [INFO] Epoch[0] Train-total_votes-accuracy=0.000000
2021-02-19 19:25:45,891 [INFO] Epoch[0] Time cost=54.932
2021-02-19 19:25:45,893 [INFO] Saved checkpoint to "imputer_model/model-0000.params"
2021-02-19 19:25:50,622 [INFO] Epoch[0] Validation-cross-entropy=2.897565
2021-02-19 19:25:50,623 [INFO] Epoch[0] Validation-total_votes-accuracy=0.000000
2021-02-19 19:26:19,453 [INFO] Epoch[1] Batch [0-6639] Speed: 3685.39 samples/sec cross-entropy=2.033162 total_votes-accuracy=0.000000
2021-02-19 19:26:46,887 [INFO] Epoch[1] Train-cross-entropy=1.915317
2021-02-19 19:26:46,888 [INFO] Epoch[1] Train-total_votes-accuracy=0.000000
2021-02-19 19:26:46,888 [INFO] Epoch[1] Time cost=56.265
2021-02-19 19:26:46,890 [INFO] Saved checkpoint to "imputer_model/model-0001.params"
2021-02-19 19:26:51,626 [INFO] Epoch[1] Validation-cross-entropy=2.187781
2021-02-19 19:26:51,626 [INFO] Epoch[1] Validation-total_votes-accuracy=0.000000
2021-02-19 19:27:19,355 [INFO] Epoch[2] Batch [0-6639] Speed: 3831.58 samples/sec cross-entropy=1.926377 total_votes-accuracy=0.000000
2021-02-19 19:27:46,809 [INFO] Epoch[2] Train-cross-entropy=1.839549
2021-02-19 19:27:46,810 [INFO] Epoch[2] Train-total_votes-accuracy=0.000000
2021-02-19 19:27:46,810 [INFO] Epoch[2] Time cost=55.183
2021-02-19 19:27:46,813 [INFO] Saved checkpoint to "imputer_model/model-0002.params"
2021-02-19 19:27:51,539 [INFO] Epoch[2] Validation-cross-entropy=2.001703
2021-02-19 19:27:51,540 [INFO] Epoch[2] Validation-total_votes-accuracy=0.000000
2021-02-19 19:28:19,027 [INFO] Epoch[3] Batch [0-6639] Speed: 3865.17 samples/sec cross-entropy=1.891239 total_votes-accuracy=0.000000
2021-02-19 19:28:46,633 [INFO] Epoch[3] Train-cross-entropy=1.813997
2021-02-19 19:28:46,634 [INFO] Epoch[3] Train-total_votes-accuracy=0.000000
2021-02-19 19:28:46,634 [INFO] Epoch[3] Time cost=55.094
2021-02-19 19:28:46,637 [INFO] Saved checkpoint to "imputer_model/model-0003.params"
2021-02-19 19:28:51,367 [INFO] Epoch[3] Validation-cross-entropy=1.956922
2021-02-19 19:28:51,368 [INFO] Epoch[3] Validation-total_votes-accuracy=0.000000
2021-02-19 19:29:18,846 [INFO] Epoch[4] Batch [0-6639] Speed: 3866.56 samples/sec cross-entropy=1.873258 total_votes-accuracy=0.000000
2021-02-19 19:29:46,276 [INFO] Epoch[4] Train-cross-entropy=1.806516
2021-02-19 19:29:46,277 [INFO] Epoch[4] Train-total_votes-accuracy=0.000000
2021-02-19 19:29:46,277 [INFO] Epoch[4] Time cost=54.909
2021-02-19 19:29:46,279 [INFO] Saved checkpoint to "imputer_model/model-0004.params"
2021-02-19 19:29:51,008 [INFO] Epoch[4] Validation-cross-entropy=1.971730
2021-02-19 19:29:51,009 [INFO] Epoch[4] Validation-total_votes-accuracy=0.000000
2021-02-19 19:30:18,524 [INFO] Epoch[5] Batch [0-6639] Speed: 3861.23 samples/sec cross-entropy=1.869224 total_votes-accuracy=0.000000
2021-02-19 19:30:46,013 [INFO] Epoch[5] Train-cross-entropy=1.799461
2021-02-19 19:30:46,014 [INFO] Epoch[5] Train-total_votes-accuracy=0.000000
2021-02-19 19:30:46,014 [INFO] Epoch[5] Time cost=55.005
2021-02-19 19:30:46,016 [INFO] Saved checkpoint to "imputer_model/model-0005.params"
2021-02-19 19:30:50,742 [INFO] Epoch[5] Validation-cross-entropy=1.914169
2021-02-19 19:30:50,743 [INFO] Epoch[5] Validation-total_votes-accuracy=0.000000
2021-02-19 19:31:18,287 [INFO] Epoch[6] Batch [0-6639] Speed: 3857.19 samples/sec cross-entropy=1.848747 total_votes-accuracy=0.000000
2021-02-19 19:31:45,981 [INFO] Epoch[6] Train-cross-entropy=1.785161
2021-02-19 19:31:45,981 [INFO] Epoch[6] Train-total_votes-accuracy=0.000000
2021-02-19 19:31:45,982 [INFO] Epoch[6] Time cost=55.238
2021-02-19 19:31:45,984 [INFO] Saved checkpoint to "imputer_model/model-0006.params"
2021-02-19 19:31:50,726 [INFO] Epoch[6] Validation-cross-entropy=1.864254
2021-02-19 19:31:50,727 [INFO] Epoch[6] Validation-total_votes-accuracy=0.000000
2021-02-19 19:32:18,360 [INFO] Epoch[7] Batch [0-6639] Speed: 3844.81 samples/sec cross-entropy=1.842331 total_votes-accuracy=0.000000
2021-02-19 19:32:45,956 [INFO] Epoch[7] Train-cross-entropy=1.781625
2021-02-19 19:32:45,957 [INFO] Epoch[7] Train-total_votes-accuracy=0.000000
2021-02-19 19:32:45,957 [INFO] Epoch[7] Time cost=55.230
2021-02-19 19:32:45,961 [INFO] Saved checkpoint to "imputer_model/model-0007.params"
2021-02-19 19:32:50,694 [INFO] Epoch[7] Validation-cross-entropy=1.862272
2021-02-19 19:32:50,695 [INFO] Epoch[7] Validation-total_votes-accuracy=0.000000
2021-02-19 19:33:18,318 [INFO] Epoch[8] Batch [0-6639] Speed: 3846.10 samples/sec cross-entropy=1.836069 total_votes-accuracy=0.000000
2021-02-19 19:33:45,916 [INFO] Epoch[8] Train-cross-entropy=1.777847
2021-02-19 19:33:45,917 [INFO] Epoch[8] Train-total_votes-accuracy=0.000000
2021-02-19 19:33:45,917 [INFO] Epoch[8] Time cost=55.222
2021-02-19 19:33:45,919 [INFO] Saved checkpoint to "imputer_model/model-0008.params"
2021-02-19 19:33:50,644 [INFO] Epoch[8] Validation-cross-entropy=1.833026
2021-02-19 19:33:50,645 [INFO] Epoch[8] Validation-total_votes-accuracy=0.000000
2021-02-19 19:34:18,208 [INFO] Epoch[9] Batch [0-6639] Speed: 3854.56 samples/sec cross-entropy=1.833520 total_votes-accuracy=0.000000
2021-02-19 19:34:45,896 [INFO] Epoch[9] Train-cross-entropy=1.776226
2021-02-19 19:34:45,897 [INFO] Epoch[9] Train-total_votes-accuracy=0.000000
2021-02-19 19:34:45,897 [INFO] Epoch[9] Time cost=55.252
2021-02-19 19:34:45,900 [INFO] Saved checkpoint to "imputer_model/model-0009.params"
2021-02-19 19:34:50,627 [INFO] Epoch[9] Validation-cross-entropy=1.813570
2021-02-19 19:34:50,628 [INFO] Epoch[9] Validation-total_votes-accuracy=0.000000
2021-02-19 19:35:18,287 [INFO] Epoch[10] Batch [0-6639] Speed: 3841.24 samples/sec cross-entropy=1.830642 total_votes-accuracy=0.000000
2021-02-19 19:35:45,914 [INFO] Epoch[10] Train-cross-entropy=1.778353
2021-02-19 19:35:45,915 [INFO] Epoch[10] Train-total_votes-accuracy=0.000000
2021-02-19 19:35:45,915 [INFO] Epoch[10] Time cost=55.287
2021-02-19 19:35:45,917 [INFO] Saved checkpoint to "imputer_model/model-0010.params"
2021-02-19 19:35:50,653 [INFO] Epoch[10] Validation-cross-entropy=1.804272
2021-02-19 19:35:50,654 [INFO] Epoch[10] Validation-total_votes-accuracy=0.000000
2021-02-19 19:36:18,289 [INFO] Epoch[11] Batch [0-6639] Speed: 3844.46 samples/sec cross-entropy=1.830434 total_votes-accuracy=0.000000
2021-02-19 19:36:45,936 [INFO] Epoch[11] Train-cross-entropy=1.775856
2021-02-19 19:36:45,937 [INFO] Epoch[11] Train-total_votes-accuracy=0.000000
2021-02-19 19:36:45,937 [INFO] Epoch[11] Time cost=55.283
2021-02-19 19:36:45,940 [INFO] Saved checkpoint to "imputer_model/model-0011.params"
2021-02-19 19:36:50,669 [INFO] Epoch[11] Validation-cross-entropy=1.835253
2021-02-19 19:36:50,670 [INFO] Epoch[11] Validation-total_votes-accuracy=0.000000
2021-02-19 19:37:18,301 [INFO] Epoch[12] Batch [0-6639] Speed: 3845.06 samples/sec cross-entropy=1.821207 total_votes-accuracy=0.000000
2021-02-19 19:37:47,084 [INFO] Epoch[12] Train-cross-entropy=1.769923
2021-02-19 19:37:47,085 [INFO] Epoch[12] Train-total_votes-accuracy=0.000000
2021-02-19 19:37:47,085 [INFO] Epoch[12] Time cost=56.415
2021-02-19 19:37:47,087 [INFO] Saved checkpoint to "imputer_model/model-0012.params"
2021-02-19 19:37:51,819 [INFO] No improvement detected for 3 epochs compared to 1.8135704350806672 last error obtained: 1.8220642169074315, stopping here
2021-02-19 19:37:51,820 [INFO]
========== done (780.864077091217 s) fit model
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/datawig/calibration.py:92: RuntimeWarning: invalid value encountered in log
return np.log(probas)
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/datawig/calibration.py:59: RuntimeWarning: invalid value encountered in greater_equal
bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)
/home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/datawig/calibration.py:59: RuntimeWarning: invalid value encountered in less
bin_mask = (top_probas >= bin_lower) & (top_probas < bin_upper)```
Hello,
I get this error that I can't solve, using google colaboratory.
I'm not sure if it's due to a wrong install or conflicting versions, my apologies if it is.
/usr/local/lib/python3.6/dist-packages/datawig/iterators.py in init(self, data_frame, data_columns, label_columns, batch_size)
229 # custom padding for having to discard the last batch in mxnet for sparse data
230 padding_n_rows = self._n_rows_padding(data_frame)
--> 231 self.start_padding_idx = int(data_frame.index.max() + 1)
232 for idx in range(self.start_padding_idx, self.start_padding_idx + padding_n_rows):
233 data_frame.loc[idx, :] = data_frame.loc[self.start_padding_idx - 1, :]
ValueError: cannot convert float NaN to integer
My code :
!pip install datawig
import datawig, numpy
import pandas as pd
import sys
from io import StringIO
data="""epiU epiPV dsU dsPG ifrU ifrPG
874 1125 40 57
815 1081 48 95
712 937 39 53
606 773 45 80
576 721 38 52
401 547 28 44 1040 1202
362 479 31 46 986 1139
295 361 29 42 909 1043
253 314 30 57 757 892
292 364 92 150 844 1018
253 311 18 43 765 921
214 263 14 24 681 808
198 248 16 26 645 752
161 199 10 24 562 654
"""
df = pd.read_csv(StringIO(data), sep="\t")
df = df[['epiU', 'dsU', 'ifrU']]
print(df.dtypes)
print(df)
df_imputed = datawig.SimpleImputer.complete(df)
EDIT:
important note is that the basic example is working ok.
# generate some data with simple nonlinear dependency
df = datawig.utils.generate_df_numeric()
# mask 10% of the values
df_with_missing = df.mask(numpy.random.rand(*df.shape) > .9)
# impute missing values
df_with_missing_imputed = datawig.SimpleImputer.complete(df_with_missing)
EDIT2
I think the issue is datawig requires pandas 0.25.3
!pip install datawig
ERROR: google-colab 1.0.0 has requirement pandas~=1.0.0; python_version >= "3.0", but you'll have pandas 0.25.3 which is incompatible.
I have imputed a column using datawig but its returning output in the columns is an object which is <datawig.simple_imputer.SimpleImputer object at 0x7f5e05b40630> how can we further process and determine the value of this categorical datapoint it is even not saved in pickle file and is to be run again and again when restarteed kernel
Hi There,
I could be wrong but it appears that the GPU is mainly used when training. When I train my model, I see the GPU speed up but when I'm doing predictions it uses a single CPU core. For my large dataset, I'm noticing it's spending more time here than the training.
Is there is anything I can do to leverage the GPU for predictions as well? In the tutorial, you can recreate this with using a large dataset(1M rows X 200 columns) and run imputed = imputer.predict(df_test)
To implement a whitebox imputer, that is able to explain instances and classes we need a non-hashing text encoder.
Hello,
I am working on a synthetic categorical dataset that contains information about people and their location, with a format similar to
name, city, zip
john, paris, 1234
frank, rome, 718
In this situation the zip codes are integers, but I want to treat them as categorical data because I do not want to infer information based on their numerical properties.
In my code, I implemented the imputer both as
data_encoder_cols = [datawig.BowEncoder('zip')]
label_encoder_cols = [datawig.CategoricalEncoder('city')]
data_featurizer_cols = [datawig.BowFeaturizer('zip', max_tokens=df_train['zip'].nunique())]
and as
data_encoder_cols = [datawig.CategoricalEncoder('zip')]
label_encoder_cols = [datawig.CategoricalEncoder('city')]
data_featurizer_cols = [datawig.EmbeddingFeaturizer('zip', max_tokens=df_train['zip'].nunique())]
In the first case, imputer.fit(df_train)
failed because (I'm assuming) the zip
column was automatically cast back to integer (even though I had previously set it to object
). The exception I'm getting is AttributeError: 'float' object has no attribute 'lower'
In the second case, the training failed in a weird way I can't explain, with the exception IndexError: index 716 is out of bounds for axis 0 with size 715
I tried df_dirty['zip'] = df_dirty['zip'].apply(lambda x: 'i' + str(x))
as a workaround, so that the zip codes are forced to be seen as strings. In this case, the code with CategoricalEncoder still failed with a similar error (719 instead of 716), while the BoWEncoder runs (slowly).
What's the correct way of treating numerical columns as categorical ones?
Howdy,
I'm playing around with datawig and it's really neat. I am using it for a fairly large file and it completed after almost 8 hours last night - it would be awesome if there was some indicator of progress.
Right now I'm checking my computer here and there to see if the CPU is maxed out and if the job is running. Tensorflow has a great progress bar that might be nice to help people estimate timing for large jobs.
There appear to be two places in the code where precision filtering for categorical predictions is done.
in imputer.predict
where below threshold values are replaced by empty strings; here the resulting data frame has the same number of rows as the data frame that was the argument to predict
in imputer.__filter_predictions
where the below threshold values are discarded; the result list now can have a lower number of rows and there will be an error in imputer.predict
We should make sure filtering is done consistently and preferably without changing the size of the input data frame
I might be missing something but I couldn't find any missing values on the dataset you're doing the example on.
The symbolic API for the imputer makes the code difficult to understand.
The Gluon API is much cleaner and allows for easier extensions.
Given the refactoring discussions I feel we would profit a lot from moving to Gluon.
Why is the failing build not been fixed? Is the project still alive? It hasn't seen any updates lately. Or is it abandoned?
I would propose to write learning curve as part of the metrics output. I.e. log likelihood, test and train accuracy; all as function of the epoch. This facilitates convergence diagnostics and is computed already.
When I predict missing value, I found that the datawig can't predict multiple data. For example,
data_encoder_cols = [NumericalEncoder('a'), NumericalEncoder('c'),
NumericalEncoder('e'),NumericalEncoder('g'),NumericalEncoder('h')]
label_encoder_cols = [NumericalEncoder('b'),NumericalEncoder('d'),NumericalEncoder('f')]
data_featurizer_cols = [NumericalFeaturizer('a'), NumericalFeaturizer('c'), NumericalFeaturizer('e'),
NumericalFeaturizer('g'), NumericalFeaturizer('h')]
imputer = Imputer(
data_featurizers=data_featurizer_cols,
label_encoders=label_encoder_cols,
data_encoders=data_encoder_cols,
output_path='imputer_model1'
)
This is my code, I want to get the 'b','d','f', but there will be a error:
Traceback (most recent call last):
File "<ipython-input-42-15a8b8acfb65>", line 1, in <module>
runfile('E:/Python/datawig-master/1.py', wdir='E:/Python/datawig-master')
File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 704, in runfile
execfile(filename, namespace)
File "D:\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 108, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "E:/Python/datawig-master/1.py", line 32, in <module>
imputer.fit(train_df=df_train,num_epochs=10)
File "E:\Python\datawig-master\datawig\imputer.py", line 257, in fit
iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)
File "E:\Python\datawig-master\datawig\imputer.py", line 564, in __build_iterators
train_df = self.__drop_missing_labels(train_df, how='all')
File "E:\Python\datawig-master\datawig\imputer.py", line 935, in __drop_missing_labels
if missing_idx == -1:
File "D:\Anaconda3\lib\site-packages\pandas\core\generic.py", line 1469, in __nonzero__
.format(self.__class__.__name__))
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
I don't know how to solve it.I want to get some help.
I have checked the full code to find the error. During training a log file is generated and mapped to logger Python framework, but the connection is never closed. This action make us unuseful when we want to remove the imputer folder that is created until we restart the kernel and the connection is lost.
To solve it I've attached a new function to retrieve all the handlers that are opened, then all of them are closed. With this easy action we can now remove any file generated after the training of datawig model has finished.
Find below the only changes I have made to make it work. Changes only performed in "imputer.py" file:
` def __close_filehandlers(self) -> None:
"""Function to close connection with log file.
author: Carlos Moral Rubio."""
handlers = logger.handlers[:]
for handler in handlers:
handler.close()
logger.removeHandler(handler)
`
` def fit(self,
train_df: pd.DataFrame,
test_df: pd.DataFrame = None,
ctx: mx.context = get_context(),
learning_rate: float = 1e-3,
num_epochs: int = 100,
patience: int = 3,
test_split: float = .1,
weight_decay: float = 0.,
batch_size: int = 16,
final_fc_hidden_units: List[int] = None,
calibrate: bool = True):
"""
Trains and stores imputer model
:param train_df: training data as dataframe
:param test_df: test data as dataframe; if not provided, [test_split] % of the training
data are used as test data
:param ctx: List of mxnet contexts (if no gpu's available, defaults to [mx.cpu()])
User can also pass in a list gpus to be used, ex. [mx.gpu(0), mx.gpu(2), mx.gpu(4)]
:param learning_rate: learning rate for stochastic gradient descent (default 1e-4)
:param num_epochs: maximal number of training epochs (default 100)
:param patience: used for early stopping; after [patience] epochs with no improvement,
training is stopped. (default 3)
:param test_split: if no test_df is provided this is the ratio of test data to be held
separate for determining model convergence
:param weight_decay: regularizer (default 0)
:param batch_size: default 16
:param final_fc_hidden_units: list of dimensions for the final fully connected layer.
:param calibrate: whether to calibrate predictions
:return: trained imputer model
"""
if final_fc_hidden_units is None:
final_fc_hidden_units = []
# make sure the output directory is writable
assert os.access(self.output_path, os.W_OK), "Cannot write to directory {}".format(
self.output_path)
self.batch_size = batch_size
self.final_fc_hidden_units = final_fc_hidden_units
self.ctx = ctx
logger.debug('Using [{}] as the context for training'.format(ctx))
if (train_df is None) or (not isinstance(train_df, pd.core.frame.DataFrame)):
raise ValueError("Need a non-empty DataFrame for fitting Imputer model")
if test_df is None:
train_df, test_df = random_split(train_df, [1.0 - test_split, test_split])
iter_train, iter_test = self.__build_iterators(train_df, test_df, test_split)
self.__check_data(test_df)
# to make consecutive calls to .fit() continue where the previous call finished
if self.module is None:
self.module = self.__build_module(iter_train)
self.__fit_module(iter_train, iter_test, learning_rate, num_epochs, patience, weight_decay)
# Check whether calibration is needed, if so ompute and set internal parameter
# for temperature scaling that is supplied to self.__predict_mxnet_iter()
if calibrate is True:
self.calibrate(iter_test)
_, metrics = self.__transform_and_compute_metrics_mxnet_iter(iter_test,
metrics_path=self.metrics_path)
for att, att_metric in metrics.items():
if isinstance(att_metric, dict) and ('precision_recall_curves' in att_metric):
self.precision_recall_curves[att] = att_metric['precision_recall_curves']
self.__prune_models()
self.save()
if self.is_explainable:
self.__persist_class_prototypes(iter_train, train_df)
self.__close_filehandlers()
return self`
Hi All,
I installed the latest (0.1.12) version of Datawig module. I've considered all the package requirements:
But when I run the command "import datawig", I get the following error:
Traceback (most recent call last):
File "C:/Users/PC/Desktop/python_ummd/venv/Lib/imputation.py", line 7, in
import datawig
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\datawig_init_.py", line 2, in
from .column_encoders import CategoricalEncoder, BowEncoder, NumericalEncoder, SequentialEncoder
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\datawig\column_encoders.py", line 26, in
import mxnet as mx
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\mxnet_init_.py", line 24, in
from .context import Context, current_context, cpu, gpu, cpu_pinned
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\mxnet\context.py", line 24, in
from .base import classproperty, with_metaclass, _MXClassPropertyMetaClass
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\mxnet\base.py", line 213, in
_LIB = _load_lib()
File "C:\Users\PC\Desktop\untitled1\venv\bin\lib\site-packages\mxnet\base.py", line 204, in load_lib
lib = ctypes.CDLL(lib_path[0], ctypes.RTLD_LOCAL)
File "C:\Users\PC\AppData\Local\Programs\Python\Python37\lib\ctypes_init.py", line 364, in init
self._handle = _dlopen(self._name, mode)
OSError: [WinError 126] The specified module could not be found
I want to use SimpleImputer. I use Windows [Version 10.0.18363.836], Python 3.7, Pycharm 2020.1.1.
Could someone give me a hint by this issue?
Best regards,
Anastasiia
I would be nice to have a logging level, that is less verbose than info but provide some basic information. Info prints a line at every training batch. A useful logger could generate a statement at every epoch, or every n epochs and include:
This used to work in previous versions but has no effect now
from datawig.utils import logger
logger.setLevel("ERROR")
Certain tests share state by using the same pseudo random number generator. Changing order of these tests as well as addition of new tests and removal of existing tests break other tests.
It would make imputation simpler to have a complete
functionality that takes a dataframe and imputes all missing values without having to specify input and output colums, as in fancyimpute
Hello,
Thank you for this code.
I am currently using Datawig for data imputation of numerical dataset.
However, an error is received when calling the Simpleimputer.complete function:
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process: 'SDNN\imputer.log'
SDNN is my column name in array.
This is my code:
import scipy.io
mat = scipy.io.loadmat('A_only_HRV.mat')
training_array_A = mat['A_only_HRV']
mat = scipy.io.loadmat('dataset_all_A_fixed_missing.mat')
test_array_A = mat['dataset_all_A_fixed_missing']
import pandas as pd
c = pd.DataFrame(data=test_array_A,columns=['AVNN','SDNN','RMSSD','pNN','SEM','BETA','HF_NORM','HF_Peak','HF_Power','LF_Norm','LF_Peak','LF_Power','LF/HF','Total_Power','VLF_Norm','VLF_Power','SD1','SD2','Alpha1','Alpha2','SE','PIP','IALS','PSS','PAS'])
c2 = c.astype('str')
c_train = pd.DataFrame(data=training_array_A,columns=['AVNN','SDNN','RMSSD','pNN','SEM','BETA','HF_NORM','HF_Peak','HF_Power','LF_Norm','LF_Peak','LF_Power','LF/HF','Total_Power','VLF_Norm','VLF_Power','SD1','SD2','Alpha1','Alpha2','SE','PIP','IALS','PSS','PAS'])
c_train2 = c_train.astype('str')
import datawig, numpy
from datawig import SimpleImputer
df_with_missing_imputed = datawig.SimpleImputer.complete(c)
I am attaching the needed files, I obtain them as .mat files.
Could you please let me know why am facing this error?
I am working on some missing values problem with datawig (I am new to it), where from a total of 19 features in a pandas dataframe with missing data, only 4 of them are not fully imputed.
I do:
import datawig
# impute missing values
dataframe = datawig.SimpleImputer.complete(dataframe)
and I get the following error message:
/home/user/.local/lib/python3.8/site-packages/sklearn/metrics/_classification.py:1272: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, msg_start, len(result))
What's happening and how could I impute the rest of the features?
If I want to add attention mechanisms to optimize the model, Where should I operate?
Write a markdown doc detailing how to contribute.
Hi there, I am trying to run the example to apply datawig on both categorical and numerical data. The categorical data has integral values while numerical data is a positive real number. I read the documentation, it seems that datawig takes multiple columns as input and impute on a specific column instead of imputing on all missing values across all columns, am I correct? I have a dataset with 4 columns, A, B, C, and Y. Y is the conclusion (label) while A, B, C are preditors. all columns contain missing values. Here is what I am trying to do with datawig if I understand it correctly
It seems quite tedious if there are 1000 columns, how do I manage to run all those iterations.
My second question is about handling the categorical data. In the section of quick example, it tells how to handle the text but what happens if the categorical data is a number (integer) in a given range while some data are real numbers, how could I specify the data type? I am trying the following example
df = pd.DataFrame([[np.nan, 2, np.nan, 0],
[0, 3, np.nan, 1],
[np.nan, np.nan, np.nan, 1],
[1, 4, 3, 0],
[3, 1, 0, np.nan],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[2, 2, 1, 1],
[0, 3, 3, 1],
[1, 3, 0, np.nan],
[np.nan, np.nan, 0, np.nan],
],
columns = list('ABCY'))
df_train, df_test = datawig.utils.random_split(df)
categorial_encoder_cols = [CategoricalEncoder('A')]
label_encoder_cols = 'Y'
print(df)
imputer = datawig.SimpleImputer(
label_encoders=label_encoder_cols,
data_encoders=categorial_encoder_cols,
output_path = 'imputer_model' # stores model data and metrics
)
dout = imputer.fit(train_df=df_train)
but it turns out with an error "TypeError: init() got an unexpected keyword argument 'label_encoders'"
Hi There,
I'm wondering if it's possible to have datawig replace a existing column instead of creating a new column?
For example, when I run df = imputer.predict(df). It takes the column I wanted to predict and adds a copy of _imputed and _imputed_proba.
How can I avoid that and just replace the df's column with the new value?
The user guide at https://datawig.readthedocs.io/en/latest/source/userguide.html#step-by-step-examples contains a reference to a dropbox link that does not exist anymore.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.