len(token)
: The length of the token.token
: The string of the tokenpre[0]
: The precedent token. If the current token is the first word, this is chosen as "None".post[0]
: The succeeding token. If the current token is the last word, this is chosen as "None".pos
: The pos tag of current token.bio
: The bio tag of current token.token.islower()
: The bool whether all the chars in the token is in lower case.token.find("-")
: The index of "-" in the token. If not found then this item is-1
prev_tag
: The name tag of the previous token. If the current token is the first word, then this item is "@@"token_lower
: The lowercase string of the tokenpre[0].lower()
: The lowercase string of the precedent tokenupper_char_count
: The number of the chars in uppercase in the token.upper_char_count / len(token)
: The fraction of the the number of the chars in uppercase in the token.token_lower.find("bach")
: The location where the substring "bach" in token. If not found, this item is -1.pre[1]
๏ผ The pos tag of the precedent token. If no precedent, then it is"start"
.post[1]
: The pos tag of the succeeding token. If no succeeding token, then it is"end"
.pre[2]
: The bio tag of the precedent token. If no precedent, then it is0
.post[2]
: The bio tag of the succeeding token. If no succeeding token, then it is0
.count_non_alpha / len(token)
: The fraction of the number of the chars which is not an alphabet char.count_non_alphanum / len(token)
: The fraction of the number of the chars which is not an alphabet or number char.
To accelerate the program, the code of some features that are not chosen(e.g, count_non_alpha, count_non_alphanum) is commented.
len(token)
token
pre[0]
post[0]
pos
bio
token.islower()
token.find("-")
prev_tag
token_lower
pre[0].lower()
upper_char_count
upper_char_count / len(token)
token_lower.find("bach")
49984 out of 51578 tags correct
accuracy: 96.91
5917 groups in key
5777 groups in response
4849 correct groups
precision: 83.94
recall: 81.95
F1: 82.93
I created a class FeatureBuilder
to read a tagged/untagged file with pos and bio. There is a variable named train_mode
.
It can be set by construction function or directly. When train_mode
is true, it will create a feature file with tag,
otherwise it will create the feature file without tag.
The run
function is the main part of a builder.
I use a python list named all_feature
to store all the features I have tried. And I create a mask vector named enable_list
to activate and deactivate the features. Each item of enable_list
is a binary integer, and when the item is 1
then
the corresponding feature is activated, and if it is 0
then the feature is disabled.
Given an input file whose path is like [filepath].[suffix]
, then create a FeatureBuilder
object builder
and call
builder.run()
. It will generated a file whose path is [filepath].feature
.
Put the training file and development file and the Jar file in the same folder with the FeatureBuilder.py.
Run "python3 FeatureBuilder.py"
.
The python script will automatically build the feature file of training file, and compile and run MEtrain
to train on it.
Then it will build the feature file of the development file and test file. Then run the MEtag to tag them, and finally
run score.name.py
to estimate the result.
The class FeatureBuilder
has 2 arguments, input_path
and train_mdoe
. The default file path is "./CONLL_train.pos-chunk-name"
. And the default value of train_mode
is true
.
If you want to use this class to build a feature of a given .pos-chunk
file, just create a FeatureBuilder
object builder
with its path. If it is a train file then the train_mode
is true, else it is false
.
Then call builder.run()
, then the feature file will be generated.