Git Product home page Git Product logo

embedder's Introduction

Build Status Twitter Follow

Overview

embedder is a small utility tool meant to simplify training, pre-processing and extraction of entity embeddings through neural networks.

Installation

It is recommended to create a virtual environment first.

To install a package, clone the repository and run:

python setup.py install

Example – training

rossman = pd.read_csv('rossman.csv')
y = rossman['Sales']
X = rossman.drop('Sales', axis=1)
cat_vars = categorize(rossman)
embedding_dict = pick_emb_dim(cat_vars, max_dim=50)
X_encoded, encoders = encode_categorical(X)
embedder = Embedder(embedding_dict, model_json=None)
embedder.fit(X_encoded, y)

Let’s examine what embedder is trying to do here. 

  1. It determines the categorical variables in the data by examining data types of the columns in the pandas DataFrame and the number of unique categories in each variable. 
  2. Then, it prepares a dictionary of variables to be embedded and the dimensionality of the embeddings. Recall that an embedding is a fixed-length vector representation of a category. Here, embedder determines embedding sizes using a rule of thumb: it simply takes the minimum of the half of the number of unique categories or the maximum dimensionality allowed, which is passed as an argument. Those defaults have worked very well in my experience. However, nothing prevents a user from passing different dictionary — it only has to be of the same format.
  3. The categorical variables are encoded using integer encoding, as this is the data type that Keras, and any other major deep learning framework, would expect. The encoders that map categories to integers are also returned — this may become useful to later assess learnt embeddings, e.g. by labelling them. Note that these pre-processing steps are only meant to simplify the process of preparing the data prior to training a neural network, but are not mandatory.
  4. Finally, the main class is instantiated and a neural network is fit on the pre-processed data. Two things to point out — by default, embedder will train a feedforward network with two hidden layers, which is a sensible default. Of course, it may not be optimal for all possible applications. The desired architecture can be passed as a json at class instantiation. Second, by default on a regression task a mean squared error loss function will be used (and cross-entropy loss for classification tasks)— again, a sensible default for vanilla applications that embedder aims to simplify.

References

Contribution

Any contributions are welcome and you can email me for troubleshooting.

embedder's People

Contributors

dkn22 avatar lukyanenkomax avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

embedder's Issues

Setting as_df=True in embedder.transoform causes an error

I tried to set as_df=True in fit_transform but it causes an error:
Traceback (most recent call last):
File "emb_test.py", line 157, in
emb_data=embedder.fit_transform(data_encoded,y,as_df=True)
File "/opt/conda/lib/python3.6/site-packages/embedder/classification.py", line 73, in fit_transform
return self.transform(X, as_df=as_df)
File "/opt/conda/lib/python3.6/site-packages/embedder/base.py", line 136, in transform
names = [var + '_{}'.format(x) for x in range(emb_dim)
NameError: name 'emb_dim' is not defined
It seems like it is some kind of issue with list comprehensions.
This issue can be fixed by replacing lines 136 and 137 with
names = [var + '_{}'.format(x) for var, emb_dim in sizes for x in range(emb_dim)]
Next error after performing this fix is:
Traceback (most recent call last):
File "emb_test.py", line 157, in
emb_data=embedder.fit_transform(data_encoded,y,as_df=True)
File "/opt/conda/lib/python3.6/site-packages/embedder/classification.py", line 73, in fit_transform
return self.transform(X, as_df=as_df)
File "/opt/conda/lib/python3.6/site-packages/embedder/base.py", line 140, in transform
embedded = pd.DataFrame(embedded, columns=names)
NameError: name 'pd' is not defined
It can be fixed by adding import pandas as pd in the head of base.py

train, test split

How would you do a train/test split with the pipeline? I seem to get an error if I run the xgboost regression when passing the X_train and X_test like this after encoding:

X_train, X_test, y_train, y_test = train_test_split(X_encoded, y, test_size=0.2, random_state=2001)
xgb_train = xgboost.DMatrix(X_train, label=y_train)
xgb_test = xgboost.DMatrix(X_test, label=y_test)

eval_set = [(X_train, y_train), (X_test, y_test)]
xgb_model.fit(X_train, y_train, model__eval_set=eval_set, model__verbose=True,
              model__early_stopping_rounds=50);

The error tells me that the variable names 'f0', etc. (which correspond to the embedded categories) can't be found in list of variables that that weren't encoded.

Returned embbeded values

Hey, thanks for the wrapper.
Are the returned values after embedding sorted.
For example - the DayOfWeek column in rossman dataset can have values ranging from 1 to 7. So, after the embedding is done, i get a matrix which contains the embedded values, is it safe to assume that the first embedded value corresponds to DayOfWeek=1, second to DayOfWeek=2 and so on...

Thanks

MXNetError while fitting Embedder

In trying to run the following flow:

from embedder.preprocessing import (categorize,pick_emb_dim, encode_categorical)
categorical_variable_count = categorize(X)
dict_embedding_size = pick_emb_dim(categorical_variable_count, max_dim=50)
X_encoded, encoders = encode_categorical(X)
embedder = Embedder(dict_embedding_size, model_json=None)
embedder.fit(X_encoded, y)

getting the exception:

---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
<ipython-input-11-0ccee6a2b185> in <module>
      5 from embedder.classification import Embedder
      6 embedder = Embedder(dict_embedding_size, model_json=None)
----> 7 embedder.fit(X_encoded, y)
      8 
      9 print('PreProcessing time: '+ str(datetime.now() - start_time))

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/embedder/classification.py in fit(self, X, y, batch_size, epochs, checkpoint, early_stop)
     31         '''
     32 
---> 33         nnet = self._create_model(X, model_json=self.model_json)
     34 
     35         nnet.compile(loss='binary_crossentropy',

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/embedder/base.py in _create_model(self, X, model_json)
     85         if model_json is None:
     86             if hasattr(self, '_default_nnet'):
---> 87                 nnet = self._default_nnet(X)
     88             else:
     89                 raise ValueError('No model architecture provided.')

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/embedder/classification.py in _default_nnet(self, X)
     97         flatten = concatenate(flatten_layers, axis=-1)
     98 
---> 99         fc1 = Dense(1000, kernel_initializer='normal')(flatten)
    100         fc1 = Activation('relu')(fc1)
    101         # fc1 = BatchNormalization(fc1)

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/engine/base_layer.py in __call__(self, inputs, **kwargs)
    468             # Actually call the layer,
    469             # collecting output(s), mask(s), and shape(s).
--> 470             output = self.call(inputs, **kwargs)
    471             output_mask = self.compute_mask(inputs, previous_mask)
    472 

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/layers/core.py in call(self, inputs)
    891         output = K.dot(inputs, self.kernel)
    892         if self.use_bias:
--> 893             output = K.bias_add(output, self.bias, data_format='channels_last')
    894         if self.activation is not None:
    895             output = self.activation(output)

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in func_wrapper(*args, **kwargs)
     92                 # Create Train Symbol
     93                 set_learning_phase(1)
---> 94                 train_symbol = func(*args, **kwargs)
     95                 # Create Test Symbol
     96                 set_learning_phase(0)

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in bias_add(x, bias, data_format)
   3980         raise ValueError('MXNet Backend: Unknown data_format ' + str(data_format))
   3981     bias_shape = int_shape(bias)
-> 3982     x_dim = ndim(x)
   3983     if len(bias_shape) != 1 and len(bias_shape) != x_dim - 1:
   3984         raise ValueError('MXNet Backend: Unexpected bias dimensions %d, expect to be 1 or %d dimensions'

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in ndim(x)
    533     ```
    534     """
--> 535     shape = x.shape
    536     if shape is not None:
    537         return len(shape)

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in shape(self)
   4393     @property
   4394     def shape(self):
-> 4395         return self._get_shape()
   4396 
   4397     def eval(self):

~/anaconda3/envs/mxnet_latest_p37/lib/python3.7/site-packages/keras/backend/mxnet_backend.py in _get_shape(self)
   4402             return self._keras_shape
   4403         else:
-> 4404             _, out_shape, _ = self.symbol.infer_shape_partial()
   4405             return out_shape[0]
   4406 

~/anaconda3/envs/mxnet_latest_p37/cpu/lib/python3.7/site-packages/mxnet/symbol/symbol.py in infer_shape_partial(self, *args, **kwargs)
   1175             The order is same as the order of list_auxiliary_states().
   1176         """
-> 1177         return self._infer_shape_impl(True, *args, **kwargs)
   1178 
   1179     def _infer_shape_impl(self, partial, *args, **kwargs):

~/anaconda3/envs/mxnet_latest_p37/cpu/lib/python3.7/site-packages/mxnet/symbol/symbol.py in _infer_shape_impl(self, partial, *args, **kwargs)
   1263                 ctypes.byref(aux_shape_ndim),
   1264                 ctypes.byref(aux_shape_data),
-> 1265                 ctypes.byref(complete)))
   1266         if complete.value != 0:
   1267             arg_shapes = [tuple(arg_shape_data[i][:arg_shape_ndim[i]])

~/anaconda3/envs/mxnet_latest_p37/cpu/lib/python3.7/site-packages/mxnet/base.py in check_call(ret)
    244     """
    245     if ret != 0:
--> 246         raise get_last_ffi_error()
    247 
    248 

MXNetError: MXNetError: Error in operator dot0: [19:57:48] src/operator/tensor/./dot-inl.h:1241: Check failed: L[!Ta].Size() == R[Tb].Size() (76 vs. 292) : dot shape error: [-1,76] X [292,1000]

Have you seen this before?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.