Git Product home page Git Product logo

arabicner's People

Contributors

annakholkina avatar mohammedkhalilia avatar mustafajarrar avatar naghamghanim avatar tymaahammouda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

arabicner's Issues

How to load the model uploaded to HF using transformers?

Hi,

Thanks for your efforts developing the model.
I am trying to load the NER model, but I am getting strange results as an output. Do you have any thoughts for why that's the case?

  • Code:
from transformers import pipeline
from transformers import AutoTokenizer, AutoModelForTokenClassification

NER_model_name = "SinaLab/ArabicWojood-FlatNER"
tokenizer = AutoTokenizer.from_pretrained(NER_model_name)
model = AutoModelForTokenClassification.from_pretrained(NER_model_name)

ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

def get_ne(text):
    output = ner_pipeline(text)
    return {"text": text, "entities": output}

get_ne("أنا اسمي محمد")
  • Output:
{'text': 'أنا اسمي محمد',
 'entities': [{'entity': 'B-FAC',
   'score': 0.111766666,
   'index': 1,
   'word': 'انا',
   'start': 0,
   'end': 3},
  {'entity': 'B-FAC',
   'score': 0.08029653,
   'index': 2,
   'word': 'اسمي',
   'start': 4,
   'end': 8},
  {'entity': 'B-FAC',
   'score': 0.072174296,
   'index': 3,
   'word': 'محمد',
   'start': 9,
   'end': 13}]}

How to calculate cohen's kappa in NER setting without the usage of "O"?

Thanks for your paper. I have a question about the calculation of kappa without including "O" label in agreement, i.e. $\kappa_{\sim o}$. Let me show an example to let you know my concern.

In Table 6 in the paper, the "ORG" entity tag contains 1713 TP, 33 FN and 30 FP, and as you mentioned, the total num of annotated tokens is 24K (first paragraph in Sec 3.4), so TN (true negative) is $24000-1713-33-20=22224$

If you include "O" in agreement between two annotations, then $TN=22224, TP=1713, FN=33, FP=30, totalnum=24000$. So, $p_o$ in $\kappa_o$ is $(1713+22224)/24000=0.997375$, and $p_e=((1713+30)\times(1713+33)+(22224+30)\times(22224+33))/(24000^2)\approx 0.86519$, so $\kappa_o=(p_o-p_e)/(1-p_e)\approx 0.980528 \approx 0.981$, which is consistent with your result in Table 6.

However, when we exclude "O", then TN here should be 0, and simply use $TN=0, TP=1713, FN=33, FP=30, totalnum=1776$ to compute $\kappa_{\sim o}$ will give us a negative result ($-0.018015$), since change TN to 0 will greatly increase the baseline in Kappa, i.e. $p_e$. This problem is common for computing cohen's kappa in NER setting since the "negative samples" is hard to define: using all "O" may return biased result since the data is imbalanced, while excluding "O“ will give negative result unexpextedly.

So, I want to ask the question: How do you compute $\kappa_{\sim o}$ here, as in Table 6 you claimed that of "ORG" tag is 0.974 and so on ? How do you define $p_o$ and $p_e$ here when excluding "O" ? Can you give me a concrete definition equation?

Error when non-UTF-8 encoding char occurs in a text during inference

Hi,

I got the following error while running inference on some text samples. After some investigations, it seems that the error occurs whenever an input text has a non-utf-8 encoding character. In such a case, the difference in size between pred and segment arrays' size in "arabiner/trainers/BertNestedTrainer.py" line 187-188 is more than 1 due to the non-utf-8 char(s) in sample text input. (To be confirmed?)

Traceback (most recent call last): File "ner_tester.py", line 35, in <module> run_inference_for_file(file_path) File "ner_tester.py", line 23, in run_inference_for_file batch_list = inference_model.predict(ner_inputs, lang) File "/app/model/ner_inference.py", line 148, in predict segments = self.tagger.infer(dataloader) File "/app/arabiner/trainers/BertNestedTrainer.py", line 174, in infer segments = self.to_segments(segments, preds, valid_lens, dataloader.dataset.vocab) File "/app/arabiner/trainers/BertNestedTrainer.py", line 193, in to_segments for tag_id, vocab in zip(pred[i, :].int().tolist(), vocab.tags[1:])] IndexError: index 146 is out of bounds for dimension 0 with size 146

You may want to run the inference code using the following text sample to reproduce the error:

text_sampel = "يبدو أن فكر التنظيم الداعشيّ -الذي ينتشر بصورةٍ واسعة عبر وسائل التواصل الاجتماعي، ومقاطع فيديو دعائية بارعة- قد نجح في إلهام موجةٍ من العنف على مدار ما يزيد عن عامٍ: تتضمن إطلاق النار في سان بيرناردينيو بكاليفورنيا، وقتل العديد من رواد مقهى للمثليين بأورلاندو في شهر ‏ ‏‏يونيو/‏‏حزيران، والهجمة القاتلة في أول شهر ‏يوليو/‏تموز 2016 على مقهى آخر ببنغلاديش. يُضاف إليها الهجمات التي يُرجح أن واضعي خططها هم أكبر مُهندسي العمليات في الدولة الإسلامية، مثل هجمات باريس في نوفمبر /‏تشرين الثاني 2015، وتفجيرات بروكسل في مارس/‏آذار ‏2016. ‏"

Question about future work (tag sub-entity of the same class)

Thank you for your article.
I am solving a nested ner task for clinical notes and using similar architecture with some additions like contrastive learning etc.
So I have the same problem with nested entities of the same type. What’s your idea for a solution?

Any suggestions would be helpful for me.

No module named 'arabiner'

Hello
I am trying to run the model but got the following error:

Traceback (most recent call last):
File "/notebooks/arabiner/bin/train.py", line 8, in
from arabiner.utils.data import get_dataloaders, parse_conll_files
ModuleNotFoundError: No module named 'arabiner'

this is my command:
python arabiner/bin/train.py --output_path data/output/dir --train_path data/train.txt --val_path data/val.txt --test_path data/test.txt --batch_size 8 --data_config '{"fn":"arabiner.data.datasets.DefaultDataset","kwargs":{"max_seq_len":512}}' --trainer_config '{"fn":"arabiner.trainers.BertTrainer","kwargs":{"max_epochs":50}}' --network_config '{"fn":"arabiner.nn.BertSeqTagger","kwargs":{"dropout":0.1,"bert_model":"aubmindlab/bert-base-arabertv2"}}' --optimizer '{"fn":"torch.optim.AdamW","kwargs":{"lr":0.0001}}'

Can you help please?

Only O and <pad> in inference mode

Hi! Your work looks great! I tried to train my own model in Russian language. I made train/val/test like yours and changed pretrained BERT to another one. This is my args:

python arabiner/bin/train.py --output_path ./ArabicNER/output 
                                                --train_path ./ArabicNER/data/train.txt 
                                                --val_path ./ArabicNER/data/val.txt 
                                                --test_path ./ArabicNER/data/test.txt 
                                                --batch_size 8 
                                                --data_config '{"fn":"arabiner.data.datasets.NestedTagsDataset","kwargs":{"max_seq_len":512}}' 
                                                --trainer_config '{"fn":"arabiner.trainers.BertNestedTrainer","kwargs":{"max_epochs":50}}' 
                                                --network_config '{"fn":"arabiner.nn.BertNestedTagger","kwargs": 
                                                                               {"dropout":0.1,"bert_model":"DeepPavlov/rubert-base-cased-conversational"}}' 
                                               --optimizer '{"fn":"torch.optim.AdamW","kwargs":{"lr":0.0001}}'

Model trained with this args good. Metrics on test set:
image
But when i try to inference model on text, I have troubles with only 'O' or pad in output on example from train.txt:
image
In this example the second word is B-PER. And in no other example did the model predict an entity.
Code for run inference:

python -u ./ArabicNER/arabiner/bin/infer.py 
              --model_path ./ArabicNER/output
              --text "привет андрей"

Can you help me with this trouble?

PKL file loading error in inference code

Hello,

I am encountering an error when trying to run the inference to perform tagging.

Command
!python /content/ArabicNER-Wojood/arabiner/bin/infer.py \ --model_path /content/ArabicNER-Wojood \ --text "وثائق نفوس شخصية من الفترة العثمانية للسيد نعمان عقل"

Code
Colab

Error message:
Traceback (most recent call last): File "/content/ArabicNER-Wojood/arabiner/bin/infer.py", line 73, in <module> main(parse_args()) File "/content/ArabicNER-Wojood/arabiner/bin/infer.py", line 61, in main segments = tagger.infer(dataloader) File "/content/ArabicNER-Wojood/arabiner/trainers/BertNestedTrainer.py", line 167, in infer for _, gold_tags, tokens, valid_len, logits in self.tag( File "/content/ArabicNER-Wojood/arabiner/trainers/BertNestedTrainer.py", line 125, in tag for subwords, gold_tags, tokens, mask, valid_len in dataloader: File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 628, in __next__ data = self._next_data() File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/dataloader.py", line 671, in _next_data data = self._dataset_fetcher.fetch(index) # may raise StopIteration File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch data = [self.dataset[idx] for idx in possibly_batched_index] File "/usr/local/lib/python3.10/dist-packages/torch/utils/data/_utils/fetch.py", line 58, in <listcomp> data = [self.dataset[idx] for idx in possibly_batched_index] File "/content/ArabicNER-Wojood/arabiner/data/datasets.py", line 121, in __getitem__ subwords, tags, tokens, masks, valid_len = self.transform(self.examples[item]) File "/content/ArabicNER-Wojood/arabiner/data/transforms.py", line 84, in __call__ vocab_tags = "|".join([t for t in vocab.get_itos() if "-" in t]) File "/usr/local/lib/python3.10/dist-packages/torchtext/vocab/vocab.py", line 158, in get_itos return self.vocab.get_itos() File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1265, in __getattr__ raise AttributeError("'{}' object has no attribute '{}'".format( AttributeError: 'Vocab' object has no attribute 'vocab'

I'm using the provided versions:

Python 3.10.10
torch==1.13.0
torchtext==0.14.0
torchtext==0.14.0
transformers==4.24.0

can you help me to resolve this issue...

Thanks & Regards,
Haneen Abdulrhman

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.