sinalab / arabicner Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 16.0 303 KB

Arabic nested named entity recognition

License: MIT License

Python 100.00%

arabicner's People

Contributors

Stargazers

Watchers

Forkers

fadybaly asabbah44 eng-aomar aqhali tymaahammouda khaled3rd homenshum ahmadhakami issifuabdulmajeed omarnagy91 tunytrinh ramycodes elbarbary01 mariamashaheen annakholkina mrmdrx

arabicner's Issues

How to load the model uploaded to HF using transformers?

Hi,

Thanks for your efforts developing the model.
I am trying to load the NER model, but I am getting strange results as an output. Do you have any thoughts for why that's the case?

Code:

from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification

NER_model_name = "SinaLab/ArabicWojood-FlatNER"
tokenizer = AutoTokenizer.from_pretrained(NER_model_name)
model = AutoModelForTokenClassification.from_pretrained(NER_model_name)

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

def get_ne(text):
    output = ner_pipeline(text)
    return {"text": text, "entities": output}

get_ne("أنا اسمي محمد")

Output:

{'text': 'أنا اسمي محمد',
 'entities': [{'entity': 'B-FAC',
   'score': 0.111766666,
   'index': 1,
   'word': 'انا',
   'start': 0,
   'end': 3},
  {'entity': 'B-FAC',
   'score': 0.08029653,
   'index': 2,
   'word': 'اسمي',
   'start': 4,
   'end': 8},
  {'entity': 'B-FAC',
   'score': 0.072174296,
   'index': 3,
   'word': 'محمد',
   'start': 9,
   'end': 13}]}

How to calculate cohen's kappa in NER setting without the usage of "O"?

Thanks for your paper. I have a question about the calculation of kappa without including "O" label in agreement, i.e. $\kappa_{\sim o}$. Let me show an example to let you know my concern.

In Table 6 in the paper, the "ORG" entity tag contains 1713 TP, 33 FN and 30 FP, and as you mentioned, the total num of annotated tokens is 24K (first paragraph in Sec 3.4), so TN (true negative) is $24000-1713-33-20=22224$

If you include "O" in agreement between two annotations, then $TN=22224, TP=1713, FN=33, FP=30, totalnum=24000$. So, $p_o$ in $\kappa_o$ is $(1713+22224)/24000=0.997375$, and $p_e=((1713+30)\times(1713+33)+(22224+30)\times(22224+33))/(24000^2)\approx 0.86519$, so $\kappa_o=(p_o-p_e)/(1-p_e)\approx 0.980528 \approx 0.981$, which is consistent with your result in Table 6.

However, when we exclude "O", then TN here should be 0, and simply use $TN=0, TP=1713, FN=33, FP=30, totalnum=1776$ to compute $\kappa_{\sim o}$ will give us a negative result ($-0.018015$), since change TN to 0 will greatly increase the baseline in Kappa, i.e. $p_e$. This problem is common for computing cohen's kappa in NER setting since the "negative samples" is hard to define: using all "O" may return biased result since the data is imbalanced, while excluding "O“ will give negative result unexpextedly.

So, I want to ask the question: How do you compute $\kappa_{\sim o}$ here, as in Table 6 you claimed that of "ORG" tag is 0.974 and so on ? How do you define $p_o$ and $p_e$ here when excluding "O" ? Can you give me a concrete definition equation?

Error when non-UTF-8 encoding char occurs in a text during inference

Hi,

I got the following error while running inference on some text samples. After some investigations, it seems that the error occurs whenever an input text has a non-utf-8 encoding character. In such a case, the difference in size between pred and segment arrays' size in "arabiner/trainers/BertNestedTrainer.py" line 187-188 is more than 1 due to the non-utf-8 char(s) in sample text input. (To be confirmed?)

Traceback (most recent call last): File "ner_tester.py", line 35, in <module> run_inference_for_file(file_path) File "ner_tester.py", line 23, in run_inference_for_file batch_list = inference_model.predict(ner_inputs, lang) File "/app/model/ner_inference.py", line 148, in predict segments = self.tagger.infer(dataloader) File "/app/arabiner/trainers/BertNestedTrainer.py", line 174, in infer segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab) File "/app/arabiner/trainers/BertNestedTrainer.py", line 193, in to_segments for tag_id, vocab in zip(pred[i, :].int().tolist(), vocab.tags[1:])] IndexError: index 146 is out of bounds for dimension 0 with size 146

You may want to run the inference code using the following text sample to reproduce the error:

text_sampel = "يبدو أن فكر التنظيم الداعشيّ -الذي ينتشر بصورةٍ واسعة عبر وسائل التواصل الاجتماعي، ومقاطع فيديو دعائية بارعة- قد نجح في إلهام موجةٍ من العنف على مدار ما يزيد عن عامٍ: تتضمن إطلاق النار في سان بيرناردينيو بكاليفورنيا، وقتل العديد من رواد مقهى للمثليين بأورلاندو في شهر ‏ ‏‏يونيو/‏‏حزيران، والهجمة القاتلة في أول شهر ‏يوليو/‏تموز 2016 على مقهى آخر ببنغلاديش. يُضاف إليها الهجمات التي يُرجح أن واضعي خططها هم أكبر مُهندسي العمليات في الدولة الإسلامية، مثل هجمات باريس في نوفمبر /‏تشرين الثاني 2015، وتفجيرات بروكسل في مارس/‏آذار ‏2016. ‏"

Question about future work (tag sub-entity of the same class)

Thank you for your article.
I am solving a nested ner task for clinical notes and using similar architecture with some additions like contrastive learning etc.
So I have the same problem with nested entities of the same type. What’s your idea for a solution?

Any suggestions would be helpful for me.

No module named 'arabiner'

Hello
I am trying to run the model but got the following error:

Traceback (most recent call last):
File "/notebooks/arabiner/bin/train.py", line 8, in
from arabiner.utils.data import get_dataloaders, parse_conll_files
ModuleNotFoundError: No module named 'arabiner'

this is my command:
python arabiner/bin/train.py --output_path data/output/dir --train_path data/train.txt --val_path data/val.txt --test_path data/test.txt --batch_size 8 --data_config '{"fn":"arabiner.data.datasets.DefaultDataset","kwargs":{"max_seq_len":512}}' --trainer_config '{"fn":"arabiner.trainers.BertTrainer","kwargs":{"max_epochs":50}}' --network_config '{"fn":"arabiner.nn.BertSeqTagger","kwargs":{"dropout":0.1,"bert_model":"aubmindlab/bert-base-arabertv2"}}' --optimizer '{"fn":"torch.optim.AdamW","kwargs":{"lr":0.0001}}'

Can you help please?

Only O and <pad> in inference mode

Hi! Your work looks great! I tried to train my own model in Russian language. I made train/val/test like yours and changed pretrained BERT to another one. This is my args:

python arabiner/bin/train.py --output_path ./ArabicNER/output 
                                                --train_path ./ArabicNER/data/train.txt 
                                                --val_path ./ArabicNER/data/val.txt 
                                                --test_path ./ArabicNER/data/test.txt 
                                                --batch_size 8 
                                                --data_config '{"fn":"arabiner.data.datasets.NestedTagsDataset","kwargs":{"max_seq_len":512}}' 
                                                --trainer_config '{"fn":"arabiner.trainers.BertNestedTrainer","kwargs":{"max_epochs":50}}' 
                                                --network_config '{"fn":"arabiner.nn.BertNestedTagger","kwargs": 
                                                                               {"dropout":0.1,"bert_model":"DeepPavlov/rubert-base-cased-conversational"}}' 
                                               --optimizer '{"fn":"torch.optim.AdamW","kwargs":{"lr":0.0001}}'

Model trained with this args good. Metrics on test set:

But when i try to inference model on text, I have troubles with only 'O' or pad in output on example from train.txt:

In this example the second word is B-PER. And in no other example did the model predict an entity.
Code for run inference:

python -u ./ArabicNER/arabiner/bin/infer.py 
              --model_path ./ArabicNER/output
              --text "привет андрей"

Can you help me with this trouble?

PKL file loading error in inference code

Hello,

I am encountering an error when trying to run the inference to perform tagging.

Command
!python /content/ArabicNER-Wojood/arabiner/bin/infer.py \ --model_path /content/ArabicNER-Wojood \ --text "وثائق نفوس شخصية من الفترة العثمانية للسيد نعمان عقل"

Code
Colab

Error message:
Traceback (most recent call last): File "/content/ArabicNER-Wojood/arabiner/bin/infer.py", line 73, in <module> main(parse_args()) File "/content/ArabicNER-Wojood/arabiner/bin/infer.py", line 61, in main segments = tagger.infer(dataloader) File "/content/ArabicNER-Wojood/arabiner/trainers/BertNestedTrainer.py", line 167, in infer for _, gold_tags, tokens, valid_len, logits in self.tag( File "/content/ArabicNER-Wojood/arabiner/trainers/BertNestedTrainer.py", line 125, in tag for subwords, gold_tags, tokens, mask, valid_len in dataloader: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__ data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/ArabicNER-Wojood/arabiner/data/datasets.py", line 121, in __getitem__ subwords, tags, tokens, masks, valid_len = self.transform(self.examples[item]) File "/content/ArabicNER-Wojood/arabiner/data/transforms.py", line 84, in __call__ vocab_tags = "|".join([t for t in vocab.get_itos() if "-" in t]) File "/usr/local/lib/python3.10/dist-packages/torchtext/vocab/vocab.py", line 158, in get_itos return self.vocab.get_itos() File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1265, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'Vocab' object has no attribute 'vocab'

I'm using the provided versions:

Python 3.10.10
torch==1.13.0
torchtext==0.14.0
torchtext==0.14.0
transformers==4.24.0

can you help me to resolve this issue...

Thanks & Regards,
Haneen Abdulrhman

sinalab / arabicner Goto Github PK

arabicner's People

Contributors

Stargazers

Watchers

Forkers

arabicner's Issues

How to load the model uploaded to HF using transformers?

How to calculate cohen's kappa in NER setting without the usage of "O"?

Error when non-UTF-8 encoding char occurs in a text during inference

Question about future work (tag sub-entity of the same class)

No module named 'arabiner'

Only O and <pad> in inference mode

PKL file loading error in inference code

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent