Git Product home page Git Product logo

arabic-ner's Introduction

ARABIC NER TRAINING:

onto_to_spacy_json.py reads in a directory of OntoNotes 5 annotations and creates training data in spaCy's JSON input for training. The program currently only gets NER tags, ignoring POS (and dependency, which is not natively in OntoNotes).

Use it like this: (if you pip install the spacy model)

python onto_to_spacy_json.py -i "ontonotes-release-5.0/data/arabic/annotations/nw/ann/00" -t "ar_train.json" -e "ar_eval.json" -v 0.1

Use it like this to train arabic ner model

 python -m spacy train ar ar_test_output_all ar_train_all.json ar_eval_all.json --no-tagger --no-parser

In order to load the model and use it take a look at the file: test_spacy_model.ipynb

Use it like this:(if you customozied build the model for me is v2.0.9 , the difference is you need to give thd dir of the output model)

python -m spacy train ar /Users/yanliang/arabicNer/data/ar_output_all /Users/yanliang/arabicNer/data/ar_train_all.json /Users/yanliang/arabicNer/data/ar_eval_all.json --no-tagger --no-parser```

Rehearsing OntoNotes to prevent forgetting

rehearsal.py is a script that generates a new Prodigy dataset containing both NER labeled examples from a given dataset, as well as a number of OntoNotes examples per annotation. Mixing in old gold standard annotations prevents catastrophic forgetting.

For example, the following will augment the annotations in the loc_ner_db dataset with OntoNotes annotations:

python rehearsal.py "loc_ner_db" 5

The augmented data is written to a dataset called augmented_for_training, which should be treated as temporary because the script overwrites it each time. NER training can then be performed as usual:

prodigy ner.batch-train augmented_for_training en_core_web_sm --eval-split 0.2 

Steps using onto_notes data mixed in the prodigy data and use prodigy to train.

First of all, if you don't have prodigy on your local, you need to install it, and create a db (sqlite by default) for where to import your prodigy data:

create sqlite db through prodigy

python3 -m prodigy dataset arabicner "train arabic ner"

import jsonl data that you exported from the prodigy app:

python3 -m prodigy db-in arabicner single_arabic_ner.jsonl 

reheasal your dataset with onto_notes, the dir for onto_notes data is hard coded in rehearsal.py, you need to edit from there (5 here means that 5* onto_notes many records will be mixed in prodigy data)

python3 rehearsal.py "arabicner" 5

Last train your data with the following command:

python3 -m prodigy ner.batch-train augmented_for_training /home/yan/arabicner/Arabic-NER/testmodel/model8 --eval-split 0.2

Remark:

if you want to explore the sqlitedb for prodigy, you need to go to your home directory and do sqlite3 .prodigy/prodigy.db it has "dataset", "example", "link" tables, and your data will be under example table.

the pretrained vector model could be too big for the training process to process, so we can prune the huge vector first then use it

python3 generatePruningVectorModel.py -l ar -v 0.0.0
(version is needed for later training, since it iwll look at that field), the directory of the .vec is hard coded in the code

then you will get a language model with a pretrained pruned vectors, then you use this model to train you ner model with this command.

python3 -m spacy train ar /home/yan/arabicNER/Arabic-NER/experiments/exp2/ar_output_all /home/yan/arabicNER/Arabic-NER/data/combined.json /home/yan/arabicNER/Arabic-NER/data/ar_eval_all.json --no-tagger --no-parser --vectors "/the_model_you_just_got"

after that you will get several ner model as output and say model0 , if you make the training now with this command ,

python3 -m prodigy ner.batch-train augmented_for_training /nermodel --eval-split 0.2

spacy will throw error, it does not like the /vocab defined in this ner model. it throws exception. but what I did is inside of ner model. it has a ner directory, you can copy this ner directory to the pruned-language model, and then update its meta.json under the directory, then make prodigy ner.batch-train looking at the language model (add in ner directory and udpated meta.json) it will succesfully run in this way.

arabic-ner's People

Contributors

ahalterman avatar yanliang1102 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

arabic-ner's Issues

Issue for using customized pretained vector.

yan@hanover:~/ou-spacy/spaCy$ python3 -m spacy init-model ar /tmp/ar_vectors_wiki_lg --vectors-loc ../cc.ar.300.bin.gz
Reading vectors from ../cc.ar.300.bin.gz
Open loc
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/yan/ou-spacy/spaCy/spacy/main.py", line 31, in
plac.call(commands[command], sys.argv[1:])
File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/yan/ou-spacy/spaCy/spacy/cli/init_model.py", line 49, in init_model
vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None)
File "/home/yan/ou-spacy/spaCy/spacy/cli/init_model.py", line 111, in read_vectors
shape = tuple(int(size) for size in next(f).split())
File "/home/yan/ou-spacy/spaCy/spacy/cli/init_model.py", line 64, in
return (line.decode('utf8') for line in gzip.open(str(loc), 'r'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte

spaCy Training Performance With different configuration and set up

spaCy training output

'dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc', 'ents_p', 'ents_r', 'ents_f', 'cpu_wps', 'gpu_wps'
eval_data=eval_data
29, 0.000, 11.698, 0.000, 58.589, 49.871, 53.880, 91.894, 85.899, 15363.7, 0.0
eval_data-training_data performance (means the model does work)
29, 0.000, 13.261, 0.000, 81.946, 73.552, 77.523, 91.815, 85.866, 13092.9, 0.0
based on these data I think the model does work , just we don't have enough data. we only have 401 tagged documents in ontoNotes data. And all the entity tags are from that 401 documents.

exception that during training spaCy throw, and I made the code to eat the exception

[E067] Invalid BILUO tag sequence: Got a tag starting with 'I' (inside an entity) without a preceding 'B' (beginning of an entity). Tag sequence:
['O', 'U-GPE', 'O', 'B-EVENT', 'I-EVENT', 'L-EVENT', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'B-ORG', 'L-ORG', 'U-GPE" S_OFF="1', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'L-EVENT', 'U-ORDINAL', 'B-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'B-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'B-FAC', 'L-FAC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'B-PERSON', 'L-PERSON', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-NORP', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'U-DATE', 'O', 'U-GPE', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'O', 'B-GPE', 'I-GPE', 'L-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'U-GPE', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-FAC', 'L-FAC', 'O', 'U-GPE', 'O', 'O', 'B-TIME', 'I-TIME', 'L-TIME', 'U-DATE', 'O', 'B-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'U-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL', 'O', 'O', 'O', 'O', 'U-EVENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL" S_OFF="1', 'O', 'B-ORG', 'L-ORG', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'U-TIME" S_OFF="1', 'U-DATE', 'U-CARDINAL', 'B-FAC', 'I-FAC', 'I-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'U-GPE', 'O', 'B-ORG', 'L-ORG', 'U-CARDINAL', 'O', 'U-CARDINAL', 'O', 'B-CARDINAL', 'I-CARDINAL', 'L-CARDINAL', 'O', 'O', 'O', 'B-CARDINAL', 'L-CARDINAL', 'O', 'O', 'B-CARDINAL', 'L-CARDINAL', 'O', 'O', 'B-CARDINAL', 'L-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'U-DATE', 'U-CARDINAL', 'O', 'O', 'B-TIME', 'L-TIME', 'U-ORG', 'O', 'U-ORG', 'B-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'O', 'U-TIME', 'U-ORG', 'O', 'U-ORG', 'B-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'O', 'U-TIME', 'U-ORG', 'O', 'U-FAC', 'O', 'O', 'O', 'O', 'U-TIME', 'U-ORG', 'O', 'B-ORG', 'L-ORG', 'O', 'U-FAC', 'O', 'O', 'U-TIME', 'U-ORG', 'O', 'U-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'B-ORG', 'L-ORG', 'O', 'U-GPE" S_OFF="1', 'O', 'O', 'O', 'O', 'O', 'B-TIME', 'I-TIME', 'I-TIME', 'L-TIME', 'B-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'B-FAC', 'L-FAC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'O', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL', 'O', 'U-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'B-DATE', 'L-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'U-DATE']

using spacy to train ner merge nertagged data, error

Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/yan/spacyou/spaCy/spacy/main.py", line 31, in
plac.call(commands[command], sys.argv[1:])
File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/yan/spacyou/spaCy/spacy/cli/train.py", line 118, in train
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
File "/home/yan/spacyou/spaCy/spacy/language.py", line 479, in begin_training
**cfg)
File "nn_parser.pyx", line 834, in spacy.syntax.nn_parser.Parser.begin_training
File "ner.pyx", line 88, in spacy.syntax.ner.BiluoPushDown.get_actions
ValueError: need more than 1 value to unpack

Training Result

@khaledJabr @ahalterman
1.With Pretrained pruned vector and ner spacy trained model, then update the model only with prodigy labled data, like 800 tokens, we get this: no merged ner class yet
image

2.with no pretrained model eveything else is the same as case 1 we got this: no merged ner class yet (so yes, the pretained model does help)
image

3 trained with only ldc data with Prodigy with pretraind spacy ner model, other case like case 1.
image

4. prodigy data + 23 times prodigy size reheasal data other case like case 3.
since we have 18670 (onto token) and 801 (prodigy labeled token) in order to get used all of the data we use 23 as multiplier since 18670/801=23
with 18670 training samples we get 4122 empty spanned removed.
image

5. with merged ner class, other condition like 4.
image

6 with Khaled cleaned data , other condition like 5
image

Merge ner tags in ANERCORP and LDC ontonotes

@ahalterman @cegme
Andy and Dr. Grant any idea? weird tags like " 'B-DATE" E_OFF="5'
'head': 0,
'id': 18,
'ner': 'B-DATE" E_OFF="5',
'orth': '2001',
'tag': ''},
{'dep': '',
'head': 0,
'id': 19,
'ner': 'I-DATE" E_OFF="5',
'orth': '-',
'tag': ''},
{'dep': '',
'head': 0,
'id': 20,
'ner': 'L-DATE" E_OFF="5',
'orth': '2002 ',
'tag': ''},

I have no idea what does this "5" means, the data only has 3 element in the range why it is 5,
and here is all the ner tags in LDConToNotes with count:

{'-': 4,
 'B-CARDINAL': 149,
 'B-CARDINAL" E_OFF="1': 3,
 'B-CARDINAL" E_OFF="3': 1,
 'B-DATE': 1186,
 'B-DATE" E_OFF="1': 38,
 'B-DATE" E_OFF="5': 3,
 'B-DATE" S_OFF="1': 1,
 'B-EVENT': 422,
 'B-FAC': 470,
 'B-FAC" S_OFF="1': 4,
 'B-GPE': 575,
 'B-GPE" E_OFF="4': 1,
 'B-GPE" S_OFF="1': 3,
 'B-LANGUAGE': 6,
 'B-LAW': 127,
 'B-LOC': 293,
 'B-LOC" S_OFF="1': 1,
 'B-MONEY': 233,
 'B-NORP': 61,
 'B-ORDINAL': 34,
 'B-ORG': 2848,
 'B-ORG" E_OFF="1': 1,
 'B-ORG" S_OFF="1': 13,
 'B-PERCENT': 129,
 'B-PERSON': 3869,
 'B-PERSON" E_OFF="2': 1,
 'B-PERSON" S_OFF="1': 46,
 'B-PRODUCT': 55,
 'B-QUANTITY': 222,
 'B-QUANTITY" E_OFF="2': 6,
 'B-QUANTITY" E_OFF="3': 1,
 'B-QUANTITY" S_OFF="2': 2,
 'B-TIME': 340,
 'B-TIME" E_OFF="1': 7,
 'B-WORK_OF_ART': 159,
 'I-CARDINAL': 136,
 'I-CARDINAL" E_OFF="1': 9,
 'I-CARDINAL" E_OFF="3': 1,
 'I-DATE': 860,
 'I-DATE" E_OFF="1': 14,
 'I-DATE" E_OFF="5': 3,
 'I-DATE" S_OFF="1': 1,
 'I-EVENT': 973,
 'I-FAC': 494,
 'I-GPE': 98,
 'I-LAW': 233,
 'I-LOC': 97,
 'I-MONEY': 177,
 'I-NORP': 8,
 'I-ORDINAL': 4,
 'I-ORG': 3290,
 'I-ORG" E_OFF="1': 1,
 'I-ORG" S_OFF="1': 3,
 'I-PERCENT': 156,
 'I-PERSON': 917,
 'I-PERSON" S_OFF="1': 1,
 'I-PRODUCT': 96,
 'I-QUANTITY': 73,
 'I-QUANTITY" E_OFF="2': 6,
 'I-QUANTITY" E_OFF="3': 1,
 'I-QUANTITY" S_OFF="2': 2,
 'I-TIME': 260,
 'I-WORK_OF_ART': 579,
 'L-CARDINAL': 148,
 'L-CARDINAL" E_OFF="1': 3,
 'L-CARDINAL" E_OFF="3': 1,
 'L-DATE': 1187,
 'L-DATE" E_OFF="1': 38,
 'L-DATE" E_OFF="5': 3,
 'L-DATE" S_OFF="1': 1,
 'L-EVENT': 420,
 'L-FAC': 462,
 'L-FAC" S_OFF="1': 4,
 'L-GPE': 573,
 'L-GPE" E_OFF="4': 1,
 'L-GPE" S_OFF="1': 3,
 'L-LANGUAGE': 6,
 'L-LAW': 127,
 'L-LOC': 289,
 'L-LOC" S_OFF="1': 1,
 'L-MONEY': 233,
 'L-NORP': 61,
 'L-ORDINAL': 34,
 'L-ORG': 2827,
 'L-ORG" E_OFF="1': 1,
 'L-ORG" S_OFF="1': 13,
 'L-PERCENT': 129,
 'L-PERSON': 3837,
 'L-PERSON" E_OFF="2': 1,
 'L-PERSON" S_OFF="1': 46,
 'L-PRODUCT': 55,
 'L-QUANTITY': 222,
 'L-QUANTITY" E_OFF="2': 6,
 'L-QUANTITY" E_OFF="3': 1,
 'L-QUANTITY" S_OFF="2': 2,
 'L-TIME': 339,
 'L-TIME" E_OFF="1': 7,
 'L-WORK_OF_ART': 159,
 'O': 225156,
 'U-CARDINAL': 670,
 'U-DATE': 1149,
 'U-DATE" E_OFF="1': 12,
 'U-DATE" S_OFF="1': 1,
 'U-EVENT': 33,
 'U-FAC': 42,
 'U-FAC" E_OFF="2': 1,
 'U-FAC" S_OFF="1': 1,
 'U-GPE': 3228,
 'U-GPE" E_OFF="1': 1,
 'U-GPE" E_OFF="2': 1,
 'U-GPE" S_OFF="1': 76,
 'U-GPE" S_OFF="1" E_OFF="1': 7,
 'U-LANGUAGE': 41,
 'U-LAW': 26,
 'U-LOC': 71,
 'U-LOC" S_OFF="1': 1,
 'U-MONEY': 4,
 'U-NORP': 3386,
 'U-NORP" E_OFF="2': 1,
 'U-NORP" S_OFF="1': 2,
 'U-NORP" S_OFF="2': 1,
 'U-ORDINAL': 843,
 'U-ORDINAL" E_OFF="1': 2,
 'U-ORDINAL" E_OFF="3': 1,
 'U-ORG': 1552,
 'U-ORG" E_OFF="1': 7,
 'U-ORG" E_OFF="2': 3,
 'U-ORG" S_OFF="1': 36,
 'U-ORG" S_OFF="1" E_OFF="1': 5,
 'U-PERCENT': 8,
 'U-PERSON': 1554,
 'U-PERSON" S_OFF="1': 85,
 'U-PERSON" S_OFF="1" E_OFF="1': 1,
 'U-PERSON" S_OFF="1" E_OFF="2': 1,
 'U-PRODUCT': 21,
 'U-PRODUCT" E_OFF="1': 1,
 'U-PRODUCT" S_OFF="1': 1,
 'U-QUANTITY': 121,
 'U-QUANTITY" E_OFF="1': 5,
 'U-TIME': 100,
 'U-TIME" E_OFF="1': 3,
 'U-WORK_OF_ART': 50}

and here is the ner tag class in ANERCorp.

{'B-LOC': 542,
 'B-MISC': 336,
 'B-ORG': 1050,
 'B-PERS': 2098,
 'I-LOC': 55,
 'I-MISC': 220,
 'I-ORG': 336,
 'I-PERS': 747,
 'L-LOC': 542,
 'L-MISC': 336,
 'L-ORG': 1050,
 'L-PERS': 2098,
 'O': 133705,
 'U-LOC': 3894,
 'U-MISC': 782,
 'U-ORG': 973,
 'U-PERS': 1508,
 'U-ts': 1}

Remove dependency parse files from spaCy formatter

The old spaCy OntoNotes converter has code to import dependency parse info from OntoNotes 5.0, but my copy doesn't have any nor does the documentation mention it. The .parse.dep files they use might be the output of another tool.

Take out all the dependency parse code in the formatter so it just uses the data we have available.

prodigy training can not handle big pre-trained vector, so need to prune that vector

two things to try:
1.prune the vector before use spacy to train and put the output vectors (it output a language model from the pruning, copy and paste the vectors to the model) get error when using this model to train the mixed in data. --failed
2. prune the vector , then use spacy to train with this vector and get the model and use this model to train the mixedin data again. ---let see.

Need to make Arabic language model and Arabic ner work under Spacy

Things need to do:

Train Arabic language Model

  1. we need stopwords
  2. infix, prefix and surfix
  3. may include the lemmatizer that Khaled had so far. --Khaled
  4. collect Arabic wiki articles and together with our lexisnexis arabic data using gensim to train the word vectors that needed to train the parser and tagger with Spacy, --Yan
  5. implement the necessary class and get language model trained!

Train Arabic Ner Model

using ontoNotes together with the prodigy data we have, we should be able to get like 66K records of training data, we need to writ e a customized ner model for Arabic in Spacy and get it trained.

@ahalterman @khaledJabr

Arabic data issue and potential fixes

I had the chance to look at the training data we are using for this, and there are two main issues with the data:

  1. The training data includes diacritics. Diacritics are extra short vowels added to Arabic words to help with pronunciation and differentiating the means of two matching words or more, and usually this is needed on the lemma level. Diacritics are not used in modern Arabic writing. This includes our news sources, and the the data we collected from the coders using the prodigy interface. I am suspecting that this might be one of the reasons hurting the NER model. One really important thing to look at here is to see whether the word embeddings that we are using were trained on data with diacritics or not. I don't have a clear answer of how this have or could have affected our training, but my main intuition is normalizing/standardizing our data as much as we can is a always a good thing.

  2. Aside from the diacritics, I have noticed that most (if not all) of the tokens (the actual tokens, the ones stored as orth) have an extra space at the end of them, plus a lot of them have weird extra characters. Here are some examples :

'orth': 'ال{ِسْتِسْلامُ '
'orth': '-مُعالَجَةِ '
'orth': '-{ِعْتِباراتِ- '

Although many of these have a ner label of o ,i still think they are worth fixing. Here is how I would go about fixing both issues (there are other ways, but this is the first thing that comes to mind):

import re 
import pyarabic.araby as araby 

text =   '-آمِلَةً '
no_diacritics = araby.strip_tashkeel(text) # removes all diacritics 
just_arabic_text = re.sub(r'\W+', '',no_diacritics ) # removes everything else but the word. This assumes there's only one word in orth 
just_arabic_text


output : 
آملة

One last thing, do we have a key or a table somewhere that lists the labels we are using in our big NER dataset (the combined one)?

ValueError: Can't read file: ar_test1/model0/accuracy.json

python -m spacy train ar ./ar_test1 ./data/train.json ./data/dev.json
Training pipeline: ['tagger', 'parser', 'ner']
Starting with blank model 'ar'
Counting training words (limit=0)

Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS


✔ Saved model to output directory
ar_test1/model-final

Traceback (most recent call last):
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 281, in train
scorer = nlp_loaded.evaluate(dev_docs, debug)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/language.py", line 631, in evaluate
docs, golds = zip(*docs_golds)
ValueError: not enough values to unpack (expected 2, got 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/zakaria/anaconda3/envs/py/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/zakaria/anaconda3/envs/py/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/main.py", line 35, in
plac.call(commands[command], sys.argv[1:])
File "/home/zakaria/.local/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/zakaria/.local/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 368, in train
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 425, in _collate_best_model
bests[component] = _find_best(output_path, component)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 444, in _find_best
accs = srsly.read_json(epoch_model / "accuracy.json")
File "/home/zakaria/.local/lib/python3.6/site-packages/srsly/_json_api.py", line 50, in read_json
file_path = force_path(location)
File "/home/zakaria/.local/lib/python3.6/site-packages/srsly/util.py", line 21, in force_path
raise ValueError("Can't read file: {}".format(location))
ValueError: Can't read file: ar_test1/model0/accuracy.json

Cleaning code

@YanLiang1102, can you post the code that produces combined_cleaned_removed (from exp 5)? Then @khaledJabr can take a look and we can make sure all the data's in the right/same format.

Add code for converting from OntoNotes to Prodigy-style

In the second phase, when we're using our Prodigy annotations, we'll need to mix in old OntoNotes annotations to keep spaCy from forgetting them. We should have code that can convert OntoNotes into the Prodigy span format so we can intermingle them.

question

lets say i'm having a csv file withe tags how can i reproduce the model (just the steps)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.