NLP Assignment 6

Yunhao Li

NetID:

Features tried

len(token): The length of the token.
token: The string of the token
pre[0]: The precedent token. If the current token is the first word, this is chosen as "None".
post[0]: The succeeding token. If the current token is the last word, this is chosen as "None".
pos: The pos tag of current token.
bio: The bio tag of current token.
token.islower(): The bool whether all the chars in the token is in lower case.
token.find("-"): The index of "-" in the token. If not found then this item is -1
prev_tag: The name tag of the previous token. If the current token is the first word, then this item is "@@"
token_lower: The lowercase string of the token
pre[0].lower(): The lowercase string of the precedent token
upper_char_count: The number of the chars in uppercase in the token.
upper_char_count / len(token): The fraction of the the number of the chars in uppercase in the token.
token_lower.find("bach"): The location where the substring "bach" in token. If not found, this item is -1.
pre[1]： The pos tag of the precedent token. If no precedent, then it is "start".
post[1]: The pos tag of the succeeding token. If no succeeding token, then it is "end".
pre[2]: The bio tag of the precedent token. If no precedent, then it is 0.
post[2]: The bio tag of the succeeding token. If no succeeding token, then it is 0.
count_non_alpha / len(token): The fraction of the number of the chars which is not an alphabet char.
count_non_alphanum / len(token): The fraction of the number of the chars which is not an alphabet or number char.

Features finally chosen

To accelerate the program, the code of some features that are not chosen(e.g, count_non_alpha, count_non_alphanum) is commented.

len(token)
token
pre[0]
post[0]
pos
bio
token.islower()
token.find("-")
prev_tag
token_lower
pre[0].lower()
upper_char_count
upper_char_count / len(token)
token_lower.find("bach")

The highest F-1 measure tested on the development corpus

49984 out of 51578 tags correct
accuracy: 96.91
5917 groups in key
5777 groups in response
4849 correct groups
precision: 83.94
recall: 81.95
F1: 82.93

The design of the program

I created a class FeatureBuilder to read a tagged/untagged file with pos and bio. There is a variable named train_mode. It can be set by construction function or directly. When train_mode is true, it will create a feature file with tag, otherwise it will create the feature file without tag.
The run function is the main part of a builder.
I use a python list named all_feature to store all the features I have tried. And I create a mask vector named enable_list to activate and deactivate the features. Each item of enable_list is a binary integer, and when the item is 1 then the corresponding feature is activated, and if it is 0 then the feature is disabled.
Given an input file whose path is like [filepath].[suffix], then create a FeatureBuilder object builder and call builder.run(). It will generated a file whose path is [filepath].feature.

How to run it

Put the training file and development file and the Jar file in the same folder with the FeatureBuilder.py.
Run "python3 FeatureBuilder.py".
The python script will automatically build the feature file of training file, and compile and run MEtrain to train on it. Then it will build the feature file of the development file and test file. Then run the MEtag to tag them, and finally run score.name.py to estimate the result.

The class FeatureBuilder has 2 arguments, input_path and train_mdoe. The default file path is "./CONLL_train.pos-chunk-name". And the default value of train_mode is true.

If you want to use this class to build a feature of a given .pos-chunk file, just create a FeatureBuilder object builder with its path. If it is a train file then the train_mode is true, else it is false. Then call builder.run(), then the feature file will be generated.

lyhthu / nlp_hw06 Goto Github PK

nlp_hw06's Introduction

NLP Assignment 6

Yunhao Li

NetID:

Features tried

Features finally chosen

The highest F-1 measure tested on the development corpus

The design of the program

How to run it

nlp_hw06's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent