Git Product home page Git Product logo

ml_sentiment_analysis's People

Contributors

jonbeibeibei avatar woshibiantai avatar

Watchers

 avatar  avatar  avatar

ml_sentiment_analysis's Issues

Estimating transition parameters

Write a function that estimates the transition parameters from the training set using MLE (maximum likelihood estimation):

q(y(i) | y(I−1)) = Count(y(i−1), y(i)) / Count(y(i−1))

Please make sure the following special cases are also considered: q(STOP|y(n)) and q(y(1)|START).

Adding #UNK# to training set

One problem with estimating the emission parameters is that some words that appear in the test set do not appear in the training set. One simple idea to handle this issue is as follows. First, replace those words that appear less than k times in the training set with a special token #UNK# before training. This leads to a “modified training set”. We then use such a modified training set to train our model.

During the testing phase, if the word does not appear in the “modified training set”, we replace that word with #UNK# as well.

Set k to 3, implement this fix into your function for computing the emission parameters.

Parser

Read training data into usable class

Function for emission parameters

Point 1 of part 2:

Write a function that estimates the emission parameters from the training set using MLE
(maximum likelihood estimation):
e(x|y) = Count(y → x) / Count(y)

Sentiment analysis

Implement a simple sentiment analysis system that produces the tag
y* = argmax e(x|y)
for each word x in the sequence.

For all the four datasets EN, FR, CN, and SG, learn these parameters with train, and evaluate your system on the development set dev.in for each of the dataset. Write your output to dev.p2.out for the four datasets respectively. Compare your outputs and the gold-standard outputs in dev.out and report the precision, recall and F scores of such a baseline system for each dataset.

The precision score is defined as follows:
Precision = Total number of correctly predicted entities / Total number of predicted entities

The recall score is defined as follows:
Recall = Total number of correctly predicted entities / Total number of gold entities

where a gold entity is a true entity that is annotated in the reference output file, and a predicted entity is regarded as correct if and only if it matches exactly the gold entity (i.e., both their boundaries and sentiment are exactly the same).

Finally the F score is defined as follows:
F= 2 / (1/Precision + 1/Recall)

You can use the evaluation script shared with you to calculate such scores. However it is strongly encouraged that you understand how the scores are calculated.

Note: in some cases, you might have an output sequence that consists of a transition from O to I-negative (rather than B-negative). For example, “O I-negative I-negative O”. In this case, the second and third words should be regarded as one entity with negative sentiment.

Implement the Viterbi algorithm

Use the estimated transition and emission parameters, implement the Viterbi algorithm to
compute the following (for a sentence with n words):

y(1)∗,...,y(n)∗ = argmaxp(x(1),...,x(n),y(1),...,y(n))
y(1) ,...,y(n)

For all datasets, learn the model parameters with train. Run the Viterbi algorithm on the develop- ment set dev.in using the learned models, write your output to dev.p3.out for the four datasets respectively. Report the precision, recall and F scores of all systems.

Note: in case you encounter potential numerical underflow issue, think of a way to address such an issue in your implementation.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.