dkn22 / embedder Goto Github PK

Embed categorical variables via neural networks.

Python 100.00%

neural-networks embeddings categorical-data utility-library

embedder's Introduction

Overview

embedder is a small utility tool meant to simplify training, pre-processing and extraction of entity embeddings through neural networks.

Installation

It is recommended to create a virtual environment first.

To install a package, clone the repository and run:

python setup.py install

Example – training

rossman = pd.read_csv('rossman.csv')
y = rossman['Sales']
X = rossman.drop('Sales', axis=1)
cat_vars = categorize(rossman)
embedding_dict = pick_emb_dim(cat_vars, max_dim=50)
X_encoded, encoders = encode_categorical(X)
embedder = Embedder(embedding_dict, model_json=None)
embedder.fit(X_encoded, y)

Let’s examine what embedder is trying to do here.

It determines the categorical variables in the data by examining data types of the columns in the pandas DataFrame and the number of unique categories in each variable.
Then, it prepares a dictionary of variables to be embedded and the dimensionality of the embeddings. Recall that an embedding is a fixed-length vector representation of a category. Here, embedder determines embedding sizes using a rule of thumb: it simply takes the minimum of the half of the number of unique categories or the maximum dimensionality allowed, which is passed as an argument. Those defaults have worked very well in my experience. However, nothing prevents a user from passing different dictionary — it only has to be of the same format.
The categorical variables are encoded using integer encoding, as this is the data type that Keras, and any other major deep learning framework, would expect. The encoders that map categories to integers are also returned — this may become useful to later assess learnt embeddings, e.g. by labelling them. Note that these pre-processing steps are only meant to simplify the process of preparing the data prior to training a neural network, but are not mandatory.
Finally, the main class is instantiated and a neural network is fit on the pre-processed data. Two things to point out — by default, embedder will train a feedforward network with two hidden layers, which is a sensible default. Of course, it may not be optimal for all possible applications. The desired architecture can be passed as a json at class instantiation. Second, by default on a regression task a mean squared error loss function will be used (and cross-entropy loss for classification tasks)— again, a sensible default for vanilla applications that embedder aims to simplify.

References

Guo and Berkhahn (2016): Entity Embeddings of Categorical Variables

Contribution

Any contributions are welcome and you can email me for troubleshooting.

embedder's People

Contributors

Stargazers

Watchers

Forkers

orico lukyanenkomax tinghao724 dpanjwani nturibi shivanandroy learningasigoxyz black-milk jorgeih hercules261188 slow-neuron bramiozo

embedder's Issues

Setting as_df=True in embedder.transoform causes an error

I tried to set as_df=True in fit_transform but it causes an error:
Traceback (most recent call last):
File "emb_test.py", line 157, in
emb_data=embedder.fit_transform(data_encoded,y,as_df=True)
File "/opt/conda/lib/python3.6/site-packages/embedder/classification.py", line 73, in fit_transform
return self.transform(X, as_df=as_df)
File "/opt/conda/lib/python3.6/site-packages/embedder/base.py", line 136, in transform
names = [var + '_{}'.format(x) for x in range(emb_dim)
NameError: name 'emb_dim' is not defined
It seems like it is some kind of issue with list comprehensions.
This issue can be fixed by replacing lines 136 and 137 with
names = [var + '_{}'.format(x) for var, emb_dim in sizes for x in range(emb_dim)]
Next error after performing this fix is:
Traceback (most recent call last):
File "emb_test.py", line 157, in
emb_data=embedder.fit_transform(data_encoded,y,as_df=True)
File "/opt/conda/lib/python3.6/site-packages/embedder/classification.py", line 73, in fit_transform
return self.transform(X, as_df=as_df)
File "/opt/conda/lib/python3.6/site-packages/embedder/base.py", line 140, in transform
embedded = pd.DataFrame(embedded, columns=names)
NameError: name 'pd' is not defined
It can be fixed by adding import pandas as pd in the head of base.py

train, test split

How would you do a train/test split with the pipeline? I seem to get an error if I run the xgboost regression when passing the X_train and X_test like this after encoding:

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=2001)
xgb_train = xgboost.DMatrix(X_train, label=y_train)
xgb_test = xgboost.DMatrix(X_test, label=y_test)

eval_set = [(X_train, y_train), (X_test, y_test)]
xgb_model.fit(X_train, y_train, model__eval_set=eval_set, model__verbose=True,
              model__early_stopping_rounds=50);

The error tells me that the variable names 'f0', etc. (which correspond to the embedded categories) can't be found in list of variables that that weren't encoded.

Returned embbeded values

Hey, thanks for the wrapper.
Are the returned values after embedding sorted.
For example - the DayOfWeek column in rossman dataset can have values ranging from 1 to 7. So, after the embedding is done, i get a matrix which contains the embedded values, is it safe to assume that the first embedded value corresponds to DayOfWeek=1, second to DayOfWeek=2 and so on...

Thanks

size_embeddings is missing in preprocessing

AttributeError: module 'embedder.preprocessing' has no attribute 'size_embeddings'

MXNetError while fitting Embedder

In trying to run the following flow:

from embedder.preprocessing import (categorize,pick_emb_dim, encode_categorical)
categorical_variable_count = categorize(X)
dict_embedding_size = pick_emb_dim(categorical_variable_count, max_dim=50)
X_encoded, encoders = encode_categorical(X)
embedder = Embedder(dict_embedding_size, model_json=None)
embedder.fit(X_encoded, y)

getting the exception:

---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
<ipython-input-11-0ccee6a2b185> in <module>
      5 from embedder.classification import Embedder
      6 embedder = Embedder(dict_embedding_size, model_json=None)
----> 7 embedder.fit(X_encoded, y)
      8 
      9 print('PreProcessing time: '+ str(datetime.now() - start_time))

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/embedder/classification.py in fit(self, X, y, batch_size, epochs, checkpoint, early_stop)
     31         '''
     32 
---> 33         nnet = self._create_model(X, model_json=self.model_json)
     34 
     35         nnet.compile(loss='binary_crossentropy',

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/embedder/base.py in _create_model(self, X, model_json)
     85         if model_json is None:
     86             if hasattr(self, '_default_nnet'):
---> 87                 nnet = self._default_nnet(X)
     88             else:
     89                 raise ValueError('No model architecture provided.')

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/embedder/classification.py in _default_nnet(self, X)
     97         flatten = concatenate(flatten_layers, axis=-1)
     98 
---> 99         fc1 = Dense(1000, kernel_initializer='normal')(flatten)
    100         fc1 = Activation('relu')(fc1)
    101         # fc1 = BatchNormalization(fc1)

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/engine/base_layer.py in __call__(self, inputs, **kwargs)
    468             # Actually call the layer,
    469             # collecting output(s), mask(s), and shape(s).
--> 470             output = self.call(inputs, **kwargs)
    471             output_mask = self.compute_mask(inputs, previous_mask)
    472 

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/layers/core.py in call(self, inputs)
    891         output = K.dot(inputs, self.kernel)
    892         if self.use_bias:
--> 893             output = K.bias_add(output, self.bias, data_format='channels_last')
    894         if self.activation is not None:
    895             output = self.activation(output)

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in func_wrapper(*args, **kwargs)
     92                 # Create Train Symbol
     93                 set_learning_phase(1)
---> 94                 train_symbol = func(*args, **kwargs)
     95                 # Create Test Symbol
     96                 set_learning_phase(0)

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in bias_add(x, bias, data_format)
   3980         raise ValueError('MXNet Backend: Unknown data_format ' + str(data_format))
   3981     bias_shape = int_shape(bias)
-> 3982     x_dim = ndim(x)
   3983     if len(bias_shape) != 1 and len(bias_shape) != x_dim - 1:
   3984         raise ValueError('MXNet Backend: Unexpected bias dimensions %d, expect to be 1 or %d dimensions'

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in ndim(x)
    533     ```
    534     """
--> 535     shape = x.shape
    536     if shape is not None:
    537         return len(shape)

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in shape(self)
   4393     @property
   4394     def shape(self):
-> 4395         return self._get_shape()
   4396 
   4397     def eval(self):

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in _get_shape(self)
   4402             return self._keras_shape
   4403         else:
-> 4404             _, out_shape, _ = self.symbol.infer_shape_partial()
   4405             return out_shape[0]
   4406 

~/anaconda3/envs/mxnet_latest_p37/cpu/lib/python3.7/site-packages/mxnet/symbol/symbol.py in infer_shape_partial(self, *args, **kwargs)
   1175             The order is same as the order of list_auxiliary_states().
   1176         """
-> 1177         return self._infer_shape_impl(True, *args, **kwargs)
   1178 
   1179     def _infer_shape_impl(self, partial, *args, **kwargs):

~/anaconda3/envs/mxnet_latest_p37/cpu/lib/python3.7/site-packages/mxnet/symbol/symbol.py in _infer_shape_impl(self, partial, *args, **kwargs)
   1263                 ctypes.byref(aux_shape_ndim),
   1264                 ctypes.byref(aux_shape_data),
-> 1265                 ctypes.byref(complete)))
   1266         if complete.value != 0:
   1267             arg_shapes = [tuple(arg_shape_data[i][:arg_shape_ndim[i]])

~/anaconda3/envs/mxnet_latest_p37/cpu/lib/python3.7/site-packages/mxnet/base.py in check_call(ret)
    244     """
    245     if ret != 0:
--> 246         raise get_last_ffi_error()
    247 
    248 

MXNetError: MXNetError: Error in operator dot0: [19:57:48] src/operator/tensor/./dot-inl.h:1241: Check failed: L[!Ta].Size() == R[Tb].Size() (76 vs. 292) : dot shape error: [-1,76] X [292,1000]

Have you seen this before?

dkn22 / embedder Goto Github PK

embedder's Introduction

Overview

Installation

Example – training

References

Contribution

embedder's People

Contributors

Stargazers

Watchers

Forkers

embedder's Issues

Setting as_df=True in embedder.transoform causes an error

train, test split

Returned embbeded values

size_embeddings is missing in preprocessing

MXNetError while fitting Embedder

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent