jonbeibeibei / ml_sentiment_analysis Goto Github PK
View Code? Open in Web Editor NEWDeveloping automated systems for analyzing sentiment information associated with social media data
Developing automated systems for analyzing sentiment information associated with social media data
Write a function that estimates the transition parameters from the training set using MLE (maximum likelihood estimation):
q(y(i) | y(I−1)) = Count(y(i−1), y(i)) / Count(y(i−1))
Please make sure the following special cases are also considered: q(STOP|y(n)) and q(y(1)|START).
One problem with estimating the emission parameters is that some words that appear in the test set do not appear in the training set. One simple idea to handle this issue is as follows. First, replace those words that appear less than k times in the training set with a special token #UNK# before training. This leads to a “modified training set”. We then use such a modified training set to train our model.
During the testing phase, if the word does not appear in the “modified training set”, we replace that word with #UNK# as well.
Set k to 3, implement this fix into your function for computing the emission parameters.
Read training data into usable class
Point 1 of part 2:
Write a function that estimates the emission parameters from the training set using MLE
(maximum likelihood estimation):
e(x|y) = Count(y → x) / Count(y)
Implement a simple sentiment analysis system that produces the tag
y* = argmax e(x|y)
for each word x in the sequence.
For all the four datasets EN, FR, CN, and SG, learn these parameters with train, and evaluate your system on the development set dev.in for each of the dataset. Write your output to dev.p2.out for the four datasets respectively. Compare your outputs and the gold-standard outputs in dev.out and report the precision, recall and F scores of such a baseline system for each dataset.
The precision score is defined as follows:
Precision = Total number of correctly predicted entities / Total number of predicted entities
The recall score is defined as follows:
Recall = Total number of correctly predicted entities / Total number of gold entities
where a gold entity is a true entity that is annotated in the reference output file, and a predicted entity is regarded as correct if and only if it matches exactly the gold entity (i.e., both their boundaries and sentiment are exactly the same).
Finally the F score is defined as follows:
F= 2 / (1/Precision + 1/Recall)
You can use the evaluation script shared with you to calculate such scores. However it is strongly encouraged that you understand how the scores are calculated.
Note: in some cases, you might have an output sequence that consists of a transition from O to I-negative (rather than B-negative). For example, “O I-negative I-negative O”. In this case, the second and third words should be regarded as one entity with negative sentiment.
Use the estimated transition and emission parameters, implement the Viterbi algorithm to
compute the following (for a sentence with n words):
y(1)∗,...,y(n)∗ = argmaxp(x(1),...,x(n),y(1),...,y(n))
y(1) ,...,y(n)
For all datasets, learn the model parameters with train
. Run the Viterbi algorithm on the develop- ment set dev.in using the learned models, write your output to dev.p3.out
for the four datasets respectively. Report the precision, recall and F scores of all systems.
Note: in case you encounter potential numerical underflow issue, think of a way to address such an issue in your implementation.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.