Git Product home page Git Product logo

sentiment-analysis's Introduction

Sentiment Analysis results

Data: 102582 train sentiments, 34194 test sentiments, target: int in [1, 5]. Scoring: categorical_accuracy.

Preprocessing

  1. Delete nltk.corpus.stopwords.

  2. Filter word frequences: delete words with frequence in test and train less than 2.

  3. Delete all non alpha-num words.

  4. Coding all test and train sentiments with keras.preprocessing.text.Tokenizer

  5. Pad the lest side of encoded sentiments.

sklearn.ensemble.RandomForestClassifier

train_size test_size n_estimators score on test training time
51272 51272 50 0.33 < 1 min
51272 51272 400 0.35 ~ 10-20 min

sklearn.svm.SVC

train_size test_size kernel score on test training time
51272 51272 rbf ? > 3 h
51272 51272 linear 0.24 < 1 min

Pretrained word embedding + Dense NN

Pretrained glove http://nlp.stanford.edu/projects/glove/ dictionary: 6B tokens; dim=100; 400k different words. Neural network architecture:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
Flatten()
Dense(300, activation='relu')
Dense(128, activation='relu')
out = Dense(5, activation='softmax')

model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
train_size test_size batch_size nb_epoch score training time
51272 51272 128 2 0.41 ~ 20 min

Pretrained word embedding + LSTM

Pretrained glove dictionary: 6B tokens; dim=100; 400k different words. Neural network architecture:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(50)
out = Dense(5, activation='softmax')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
train_size test_size batch_size nb_epoch score training time
51272 51272 128 2 0.47 ~ 60 min

Pretrained word embedding + double LSTM

Pretrained glove dictionary: 6B tokens; dim=100; 400k different words. Neural network architecture:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(50, return_sequences=True)
LSTM(50, W_regularizer='l2')
out = Dense(5, activation='softmax')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['acc'])
train_size test_size batch_size nb_epoch score training time
51272 51272 128 2 0.42 ~ 2h 30min

Pretrained word embedding + LSTM

Pretrained glove dictionary: 840B tokens; dim=300; 2.2m different words. Neural network architecture:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(150, W_regularizer='l2')
Dropout(0.25)
Dense(30, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Total params: 275285

train_size test_size batch_size nb_epoch score training + test time
51272 51272 128 1 0.4499 ~ 90 min
51272 51272 128 2 0.5035 ~ 90 min
51272 51272 128 3 0.5170 ~ 90 min

Pretrained word embedding + LSTM

Pretrained glove dictionary: 840B tokens; dim=300; 2.2m different words. Neural network architecture:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(50, W_regularizer='l2')
Dropout(0.25)
Dense(25, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2)
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Total params: 71605

train_size test_size batch_size nb_epoch score training + test time
51272 51272 128 1 0.4927 ~ 25 min
51272 51272 128 2 0.4929 ~ 25 min
51272 51272 128 3 0.5261 ~ 25 min

Pretrained word embedding + LSTM

pretrained glove dictionary: 840B tokens; dim=300; 2.2m different words. Neural network architecture:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(25, W_regularizer='l2')
Dropout(0.25)
Dense(30, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Total params: 33535

train_size | batch_size | nb_epoch |public leaderboard score | training time ------------ | ------------- | ----------| ---------- | ------------- | ---------- 102582 | 128 | 7 | 0.54056 | ~ 3 h

Mixture: (Pretrained word embedding + LSTM) and sklearn.ensemble.RandomForestClassifier

Grid mixture coefficient with 51272 train and 51272 test examples. After that train on all train data RF with 400 trees, NN with batch_size 128, nb_epoch = 15. Neural network architecture:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(25, W_regularizer='l2')
Dropout(0.25)
Dense(40, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Best mixture is 0.959 * NN + (1 - 0.959) * RF.

train_size public leaderboard score private leaderboard score training time
102582 0.55132 0.55513 ~ 7 h

Final model: pretrained word embedding + LSTM

Neural network architecture:

sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')
embedding_layer
LSTM(25, W_regularizer='l2')
Dropout(0.25)
Dense(40, activation='relu', W_regularizer='l2')
out = Dense(5, activation='softmax', W_regularizer='l2')
model = Model(sequence_input, out)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['acc'])

Total params: 33845

train_size batch_size nb_epoch public leaderboard score private leaderboard score training time
102582 128 7 0.55472 0.55559 ~ 7 h

sentiment-analysis's People

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.