jiesutd / ncrfpp Goto Github PK

NCRF++, a Neural Sequence Labeling Toolkit. Easy use to any sequence labeling tasks (e.g. NER, POS, Segmentation). It includes character LSTM/CNN, word LSTM/CNN and softmax/CRF components.

License: Apache License 2.0

Python 100.00%

pytorch ner sequence-labeling crf lstm-crf char-rnn char-cnn named-entity-recognition part-of-speech-tagger chunking

ncrfpp's People

Contributors

Stargazers

Watchers

Forkers

catcatrun lsvih assulan kaeflint liu-nlper duytinvo muzaluisa cyzhangathit ehaschia fireae dreadlord1984 entn-at tingtingalice ml-ai-nlp-ir segurac tagucci victor0118 aust-hansen zqma2 xxcharles gailysun huangpeng1126 threefoldo jhnlp george86028 slouvan starwang ninasujit2016 fandywang leiloong wh-forker ricelingz jxlijunhao cattigers magician-david abc3436645 yucoian ziruan dsp6414 sc268 web199195 styxjedi xmzhao lilingrowup glossimar cnglen loyaltyji ascenoputing likicode couragelfyang kastnerkyle mahmoudaljabary qitsweauca cclauss yalechang kywang rubenszimbres marcelomata abbottlane-zz mathcass liushifeng huguanglong yuanjie-ai jinyeong jianliu91 aaronlifenghan raymonddixon lloydzhang donrv qxiaomay chenmosha chenwgen liuweiping2020 twhughes shaoyn0817 sanmusunrise lakezhang fpzh2011 hfxunlp zldeng casillas-qf shannonyu colinsongf tianyikenan zhangjiekui ericxsun zhaopu7 wushicanasl renhongkai brucexia6116 makai281 stevenlol jiasir803 qiuwei afeena lzbgt byecc wuqingzhou828 simplzyu jluo41

ncrfpp's Issues

about char pretrained embedding

Thank you for this excellent open source code.
But I have one question about the pre-trained embedding for charaters,In the class "Data",we load the pre-trained embedding for characters,but i donot known where to use it,maybe I have to add one parameter called "pretrained_char_embedding",and pass it into the class CharBilstm(for example),and modify the code like below:
if pretrain_char_embedding is not None: self.char_embeddings.weight.data.copy_(torch.from_numpy(pretrain_char_embedding)) else: self.char_embeddings.weight.data.copy_( torch.from_numpy(self.random_embedding(alphabet_size, embedding_dim)))

Data information&other tasks' performances

There are two questions I want to ask you:
1, The numbers of sentences of data used in your code are 14987, 3466, 3684 or 14041, 3250, 3453 (train, dev, test respectively). Can you tell me?
2, Whether your code can obtain comparable performances in CoNLL03 Germany NER data as Lample .etc, NAACL16 and in WSJ data as Ma .etc, ACL16. That is, how does your code perform in other tasks compared to existing related works? Did you have a try?

How to train model in more Epochs?

Hi, I run the demo code successfully. It worked pretty well.
And thus I wish to use it in my application. But I encounter a problem here.
Even if I set the iteration=30, the training process will stop after the first Epoch.
What should I do in this case?

句子长度超过256后发生的奇妙bug

# model/crf.py
#  def _viterbi_decode(self, feats, mask):
length_mask = torch.sum(mask, dim = 1).view(batch_size,1).long()

mask == [1, 1, ...,1]  # szie: 1 * 256
torch.sum(mask, dim=1)   # 当mask是ByteTensor时，结果是0

血崩这个bug拖了我3天

[feature request] entity level sentiment analysis

Is it possible that you add a entity level sentiment analysis wrapper to this?

problems in reproducibility

I have tried running the configuration mentioned in readme on GPU, with 10 different seeds.
I am still not able to hit f score of 90+ (for non lstm based results) or 91+ for lstm based result.

word embeddings

which word embeddings did you use to get the results as displayed?

conll2003 results are not reproducible by using params written in paper

#41

results are not reproducible by using params written in paper. Can you better provide config file for your results.

Integration

Hello, I want to integrate your code with my system "for academic purposes"
So I want to enable the system to take input a stream of tokens and output their respective POS tags
how can I do that ?

thanks

RuntimeError in crf.py line 247

Hi,

I am trying to run the training demo using python 3.5 and torch 0.40 (with cuda on an NVIDIA 1050 GTX). I get the following error:
THCudaCheck FAIL file=/pytorch/aten/src/THC/generic/THCTensorCopy.c line=70 error=59 : device-side assert triggered /pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = long, Dims = 2]: block: [0,0,0], thread: [0,0,0] AssertionindexValue >= 0 && indexValue < src.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = long, Dims = 2]: block: [0,0,0], thread: [1,0,0] AssertionindexValue >= 0 && indexValue < src.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = long, Dims = 2]: block: [0,0,0], thread: [2,0,0] AssertionindexValue >= 0 && indexValue < src.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = long, Dims = 2]: block: [0,0,0], thread: [3,0,0] AssertionindexValue >= 0 && indexValue < src.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = long, Dims = 2]: block: [0,0,0], thread: [4,0,0] AssertionindexValue >= 0 && indexValue < src.sizes[dim]failed. /pytorch/aten/src/THC/THCTensorScatterGather.cu:97: void THCudaTensor_gatherKernel(TensorInfo<Real, IndexType>, TensorInfo<Real, IndexType>, TensorInfo<long, IndexType>, int, IndexType) [with IndexType = unsigned int, Real = long, Dims = 2]: block: [0,0,0], thread: [5,0,0] AssertionindexValue >= 0 && indexValue < src.sizes[dim] failed. Traceback (most recent call last): File "/home/spike/Software/PyCharm/helpers/pydev/pydevd.py", line 1668, in <module> main() File "/home/spike/Software/PyCharm/helpers/pydev/pydevd.py", line 1662, in main globals = debugger.run(setup['file'], None, None, is_module) File "/home/spike/Software/PyCharm/helpers/pydev/pydevd.py", line 1072, in run pydev_imports.execfile(file, globals, locals) # execute the script File "/home/spike/Software/PyCharm/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "/home/spike/Projects/NCRFpp/main.py", line 434, in <module> train(data) File "/home/spike/Projects/NCRFpp/main.py", line 326, in train loss, tag_seq = model.neg_log_likelihood_loss(batch_word,batch_features, batch_wordlen, batch_char, batch_charlen, batch_charrecover, batch_label, mask) File "/home/spike/Projects/NCRFpp/model/seqmodel.py", line 43, in neg_log_likelihood_loss total_loss = self.crf.neg_log_likelihood_loss(outs, mask, batch_label) File "/home/spike/Projects/NCRFpp/model/crf.py", line 262, in neg_log_likelihood_loss gold_score = self._score_sentence(scores, mask, tags) File "/home/spike/Projects/NCRFpp/model/crf.py", line 247, in _score_sentence tg_energy = tg_energy.masked_select(mask.transpose(1,0)) RuntimeError: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generated/../THCReduceAll.cuh:339

Does anyone know what might be causing this error?

Some problems with the dataset.

Excuse me, I can't find the conll data with BIOES tags.Where can I get the data to perform your score?
The score of BIO tags is something worse than BIOES ' s.

Can you give me a url to download the data?

thanks a lot for your help.

Design of CNN word feature extractor

It seems like for the CNN word feature extractor, you are simply applying 1D convolution to each word embedding directly, it doesn't look at the neighboring words (which I think will be useful). Please correct me if I am wrong.

Can you point to the reference for the CNN method? I found this seminal work http://www.jmlr.org/papers/volume12/collobert11a/collobert11a.pdf

should it not consider the neighboring words (something like the diagram shown)?

Documentation for main_parse

I'm trying to get a pre-trained model to run via the command line using main_parse but am having issues.

It would be helpful to have some documentation on the command line arguments for this function.

Thanks

Problem

hello,
In this experiment, if the development set is too small, will it have a big impact on the experimental results?thanks

Python 3 Support

Thanks for the nice work. A minor issue, you have implemented support for Python 3 by catching the ModuleNotFoundError exception, which is fine for Python version 3.6 but will cause an error in versions <=3.5.

A quick solution would be to use ImportError instead of ModuleNotFoundError, at lines 24 and 14 in main.py and utils/data.py, respectively.

didn't match because some of the arguments have invalid types: (list)

Hi I'm trying to run the demo traning:
python main.py --config demo.train.config, then got

 ++++++++++++++++++++++++++++++++++++++++
 Hyperparameters:
     Hyper              lr: 0.015
     Hyper        lr_decay: 0.05
     Hyper         HP_clip: None
     Hyper        momentum: 0.0
     Hyper              l2: 1e-08
     Hyper      hidden_dim: 200
     Hyper         dropout: 0.5
     Hyper      lstm_layer: 1
     Hyper          bilstm: True
     Hyper             GPU: True
DATA SUMMARY END.
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
build network...
use_char:  True
char feature extractor:  CNN
word feature extractor:  LSTM
use crf:  True
build word sequence feature extractor: LSTM...
build word representation...
build char sequence feature extractor: CNN ...
build CRF...
Epoch: 0/1
 Learning rate is setted as: 0.015
Traceback (most recent call last):
  File "main.py", line 436, in <module>
    train(data)
  File "main.py", line 326, in train
    batch_word, batch_features, batch_wordlen, batch_wordrecover, batch_char, batch_charlen, batch_charrecover, batch_label, mask  = batchify_with_label(instance, data.HP_gpu)
  File "main.py", line 234, in batchify_with_label
    mask[idx, :seqlen] = torch.Tensor([1]*seqlen)
TypeError: mul() received an invalid combination of arguments - got (list), but expected one of:
 * (Tensor other)
      didn't match because some of the arguments have invalid types: (list)
 * (float other)
      didn't match because some of the arguments have invalid types: (list)

Any suggestion would be appreciated. Thanks!

Question about alphabet

Hello,I have a problem about the alphabet.
In the file alphabet.py ,the function size returns len(self.instances) + 1,I think it is cause of the padding /pad,but in the file seqmodel.py,why we have to add two more labels for down stream lstm?Though we use the original label size for CRF,actually in the file CFR model,in the transition matrix ,still add "start" and "end".this confused me.
And if I do not use CRF,_, tag_seq = torch.max(outs, 1)maybe lead to the wrong index.
Thank you~

Can I use cpu running this program?

Hi @jiesutd jiesutd,

Thanks for sharing your work, it is an incredible job!
My GPU is too old to run this program, I was wondering that if I can use CPU to run this program?
I set the parameter "#gpu" to "gpu = 0" in "demo.train.config" file, and it does not work.

Regards,
Thanks!

How to do POS tagging (and NOT NER)?

How to do POS tagging (and NOT NER)? Is there any option to set that?

ask for advice

hello,
Can I change raw.bmes to test.bmes during decoding? And,
I want to know the role of raw.bmes.

Predicted output sample count not equals to input count

I found the sample number of decode output is less than decode input.
Do you just skip the sentence with no tag predicted?
Can I have all predicted result?

build word sequence feature extractor: LSTM...
build word representation...
build CRF...
Decode raw data, nbest: 1 ...
gold_num =  48612  pred_num =  44576  right_num =  32770
raw: time:150.02s, speed:149.76st/s; acc: 0.9298, p: 0.7351, r: 0.6741, f: 0.7033

My input count is 1514142, and output count is 1510350.

Unfrozen word vector

Hi,
Can you please give some guidelines as to how can I unfreeze the word vectors?

I have the embedding file, word gets converted to embedding, but I want this embedding to get tweaked while training ('hence unfreezing'). Can you help?

Thanks

1-best is the same with 10-best??

hi, i use this to do a sequence tag task, but when i decode, the resulf of 1 best if the same of 10-best, how to use n-best??

Print statement doesn't automatically flush stdout/stderr on python 3

As the title states, the behavior is different on python2.
A better option is to use a standard logging module.

Optimize with sgd

Hi,
I am using ncrfpp on my own dataset.
Adam can converge normally in fewer than 20 epochs.

However, optimizing with SGD is extremely hard. I got gradient explosion or non-convergence most of the time.
Removing dropouts and l2 regularization and using very small lr makes the training converge, but extremely slow.

Could you share your parameters used for training with SGD?
Many thanks!

What is decoding status？

Data format

Hi, for NER what's the format of the file that the script expects:

AL-AIN NNP I-NP I-LOC
, , O O
United NNP I-NP I-LOC
Arab NNP I-NP I-LOC
Emirates NNPS I-NP I-LOC
1996-12-06 CD I-NP O
....

like the above? I saw that you need some .bme files.

Mask without CRF

In case when crf is false, you do not use mask in calculation of loss. Is there any reason for that?

Probability of an Output Sequence

Hi,
I want to get probability of each output sequence ( not n-best score ) when decoding.
How to get this?
(How to get partition function Z when decoding?)

f score is -1

In the file demo.train.config I changed the iterations to 100 and batch_size to 32, dev and test scores are almost always -1. (Note this is on the sample_dataset that you have provided with the embeddings that you have provided)

Problem of using glove 100 on Windows

If you using glove 100 on Windows, it will probably have some errors about gbk encoding.
So the solution is changing the code in functions.py, line 128:
with open(embedding_path, 'r') as file:
add encoding = "utf-8" like the following:
with open(embedding_path, 'r',encoding="utf-8") as file:

Deterministic Training Behaviour

Hi,
what am I doing wrong that the training with the same hyperparameters results always in the same performance? There should be some variance because of the random initialization or am i wrong?

Config looks like this:

train_dir=.../conll2003/en/ner/train.txt
dev_dir=.../conll2003/en/ner/valid.txt
test_dir=.../conll2003/en/ner/test.txt
model_dir=test.model
word_emb_dir=.../sample_data/sample.word.emb
norm_word_emb=False
norm_char_emb=False
number_normalized=True
seg=True
word_emb_dim=50
char_emb_dim=30
use_crf=True
use_char=True
word_seq_feature=LSTM
char_seq_feature=CNN
nbest=1
status=train
optimizer=Adam
iteration=1
batch_size=8
ave_batch_loss=False
cnn_layer=4
char_hidden_dim=30
hidden_dim=200
dropout=0.5
lstm_layer=2
bilstm=True
learning_rate=0.002
lr_decay=0.05
momentum=0
l2=1e-08
gpu=True

What are word_batch_features?

There are 2 sets of features for each sentence. What are those? One looks like capitalization ?

Unable to replicate the reported numbers on CoNLL dataset

After 100 epochs on the train-dev-test splits of CoNLL 2003, dataset with LSTM and CNN character features, I get the following results for the best dev f-score:

LSTM
Dev: time: 5.59s, speed: 627.24st/s; acc: 0.9891, p: 0.9460, r: 0.9465, f: 0.9463
Exceed previous best f score: 0.945258548088
Test: time: 5.57s, speed: 712.46st/s; acc: 0.9808, p: 0.9102, r: 0.9107, f: 0.9104
CNN
Dev: time: 5.32s, speed: 660.84st/s; acc: 0.9891, p: 0.9458, r: 0.9460, f: 0.9459
Exceed previous best f score: 0.945809491754
Test: time: 4.88s, speed: 788.98st/s; acc: 0.9804, p: 0.9081, r: 0.9068, f: 0.9074

I'm trying to understand what it takes to reproduce the reported numbers, and also use as a baseline for my experiments. Let me know what are the other parameters that I need to change.

Also, thanks for open-sourcing the code!

a problem

`Traceback (most recent call last):
File "main.py", line 436, in
train(data)
File "main.py", line 326, in train
batch_word, batch_features, batch_wordlen, batch_wordrecover, batch_char, batch_charlen, batch_charrecover, batch_label, mask = batchify_with_label(instance, data.HP_gpu)
File "main.py", line 234, in batchify_with_label
mask[idx, :seqlen] = torch.Tensor([1]*seqlen)
TypeError: mul() received an invalid combination of arguments - got (list), but expected one of:

(Tensor other)
didn't match because some of the arguments have invalid types: (list)
(float other)
didn't match because some of the arguments have invalid types: (list)`

Can you help me ?

Thanks for share this good work, can you help me?
I set the iteration to be 15000 times and find that the training process is too slow.And the F1 score always be 0.7 percent . So can you help me with the training problems?

BTW, I find that your f1 depends on p and recall. But I found that others's work calculates it with acc and recall. May be I am wrong. So how can I get your score.And after how many epoch did you get the best score, so that I can set the iteration param.

I will appreciate for you help!

doubt in metric.py

I think get_ner_BIO() in metric.py is wrong.

consider the example where label_list = [I-MISC, I-MISC, O, I-PER, I-PER, O, O, O, O, O I-ORG, O] according to current function the following will happen :

Since there is no tag involving B-, whole_tag and tag_index will always be [] and hence the output of the function is [] which is wrong?

problem in config reader

why is tagScheme is initalised with noSeg?

and it is not getting updated upon reading the config file, so far in my experiments tagscheme always defaults to BIO and not BMES how do you set the tagScheme? through config file (I don't think you've even written code for that)

f1 score is -1, pred_num = 0

The same issue as 22#. We use our dataset to train the NER model. The tag scheme is BIOES (The only difference is we used "M-" instead of "I-"). These data have been test on your "Lattice LSTM model". They can get accurate p,r,f1 value. So I am confused about this. Why our f1 score is -1 and pred_num = 0 on this model?

can we expect python=3.6 and pytorch=0.4 ?

can we expect python=3.6 and pytorch=0.4 ? would you like me to help port it for you?

请问大佬的代码有参考哪些开源项目吗？

关于 CNN_BILSTM_CRF model的一些问题

先膜拜大佬：
我想把这个模型用在一个中文的序列标注问题上：这里面有POS的标记：这个和CNN_character的特征冲突吗，你的项目里面是手动标记特征和CNN_character的特征可以共存吗？另外看了一下数据的预处理的格式：Friday [Cap]1 [POS]NNP O ，我只用到了POS的特征数据是不是应该写成Friday [POS]NNP O ，[POS]是必须要的吗，还是你只是作为一个标记？
烦请指教

Question about the F1 score in Section 2.

Hello. I want to know if the performance on CONLL 2003 English NER reported in Section 2 is the average performance or the maximum performance? Did you done the significant test? I think 91.20/91.26 is good, but the result of the significant test is needed for CONLL 2003 English NER.

CRF PZ calculation

In log sum exp why take argmax and then gather instead of just taking max ? any gradient flow issues ?

Config files for best reported CoNLL results

Hi,

Can you provide your config files for the reported CoNLL numbers (under section 3. Performance)?

Thanks,
Yi

Python 3+

By any chance, do you know if this is compatible with Python3+? I noticed you mentioned 2.7 in your requirements.

total_batch = train_num//batch_size+1

NCRFpp/main.py

Line 319 in d560e27

total_batch = train_num//batch_size+1

I think it should be 'total_batch = (train_num-1)//batch_size+1'.

Bug in IOBES converter?

It appears from your sample that there may be a bug in your IOBES converter, which Im assuming will affect your paper findings slightly?

An example would be at line 6845 of train.bmes.txt:

English S-MISC
County S-MISC
Championship S-MISC
cricket O
matches O
on O
Thursday O
: O

The original file in IOB1 has this:

English NNP I-NP I-MISC
County NNP I-NP B-MISC
Championship NNP I-NP I-MISC
cricket NN I-NP O
matches NNS I-NP O
on IN I-PP O
Thursday NNP I-NP O
: : O O

Design of character level feature extractor

Hi,
In the cnn feature extractor, is it the case that in a batch you are assuming all the words to be of the same length? If so, then there must have been padding to the smaller words, will it not disturb the char level features?

RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 3 and 2

Hi,Thank you for this perfect code,and I have learned a lot from this code.But when I run this code,I get one problem ：
Traceback (most recent call last): File "main.py", line 438, in <module> train(data, save_model_dir, seg) File "main.py", line 265, in train batch_charrecover, batch_label, mask) File "/Users/fengxiachong/Desktop/PyTorchSeqLabel-master/model/bilstmcrf.py", line 33, in neg_log_likelihood_loss scores, tag_seq = self.crf._viterbi_decode(outs, mask) File "/Users/fengxiachong/Desktop/PyTorchSeqLabel-master/model/crf.py", line 162, in _viterbi_decode partition_history = torch.cat(partition_history).view(seq_len, batch_size, -1).transpose(1, RuntimeError: invalid argument 0: Tensors must have same number of dimensions: got 3 and 2 at /Users/soumith/minicondabuild3/conda-bld/pytorch_1518371252923/work/torch/lib/TH/generic/THTensorMath.c:2888
could you please help me？