oudalab / arabic-ner Goto Github PK

View Code? Open in Web Editor NEW

31.0 31.0 11.0 1.75 GB

Jupyter Notebook 80.04% Python 19.96%

arabic-ner's People

Contributors

Stargazers

Watchers

Forkers

ahalterman openeventdata howl-anderson zeroows nadhemm aiedward motazsaad ruhmany mohamed-okasha unrealbato

arabic-ner's Issues

Build a docker for this process

the docker should have prodigy and spacy in it with the correct dependency,

use prodigy with mixedin data and pretrained spacy ner model get error.

I created an issue in Spacy for this.

Arabic data issue and potential fixes

I had the chance to look at the training data we are using for this, and there are two main issues with the data:

The training data includes diacritics. Diacritics are extra short vowels added to Arabic words to help with pronunciation and differentiating the means of two matching words or more, and usually this is needed on the lemma level. Diacritics are not used in modern Arabic writing. This includes our news sources, and the the data we collected from the coders using the prodigy interface. I am suspecting that this might be one of the reasons hurting the NER model. One really important thing to look at here is to see whether the word embeddings that we are using were trained on data with diacritics or not. I don't have a clear answer of how this have or could have affected our training, but my main intuition is normalizing/standardizing our data as much as we can is a always a good thing.
Aside from the diacritics, I have noticed that most (if not all) of the tokens (the actual tokens, the ones stored as orth) have an extra space at the end of them, plus a lot of them have weird extra characters. Here are some examples :

'orth': 'ال{ِسْتِسْلامُ '
'orth': '-مُعالَجَةِ '
'orth': '-{ِعْتِباراتِ- '

Although many of these have a ner label of o ,i still think they are worth fixing. Here is how I would go about fixing both issues (there are other ways, but this is the first thing that comes to mind):

import re 
import pyarabic.araby as araby 

text =   '-آمِلَةً '
no_diacritics = araby.strip_tashkeel(text) # removes all diacritics 
just_arabic_text = re.sub(r'\W+', '',no_diacritics ) # removes everything else but the word. This assumes there's only one word in orth 
just_arabic_text


output : 
آملة

One last thing, do we have a key or a table somewhere that lists the labels we are using in our big NER dataset (the combined one)?

Need to make Arabic language model and Arabic ner work under Spacy

Things need to do:

Train Arabic language Model

we need stopwords
infix, prefix and surfix
may include the lemmatizer that Khaled had so far. --Khaled
collect Arabic wiki articles and together with our lexisnexis arabic data using gensim to train the word vectors that needed to train the parser and tagger with Spacy, --Yan
implement the necessary class and get language model trained!

Train Arabic Ner Model

using ontoNotes together with the prodigy data we have, we should be able to get like 66K records of training data, we need to writ e a customized ner model for Arabic in Spacy and get it trained.

@ahalterman @khaledJabr

Training Result

@khaledJabr @ahalterman
1.With Pretrained pruned vector and ner spacy trained model, then update the model only with prodigy labled data, like 800 tokens, we get this: no merged ner class yet

2.with no pretrained model eveything else is the same as case 1 we got this: no merged ner class yet (so yes, the pretained model does help)

3 trained with only ldc data with Prodigy with pretraind spacy ner model, other case like case 1.

4. prodigy data + 23 times prodigy size reheasal data other case like case 3.
since we have 18670 (onto token) and 801 (prodigy labeled token) in order to get used all of the data we use 23 as multiplier since 18670/801=23
with 18670 training samples we get 4122 empty spanned removed.

5. with merged ner class, other condition like 4.

6 with Khaled cleaned data , other condition like 5

prodigy training can not handle big pre-trained vector, so need to prune that vector

two things to try:
1.prune the vector before use spacy to train and put the output vectors (it output a language model from the pruning, copy and paste the vectors to the model) get error when using this model to train the mixed in data. --failed
2. prune the vector , then use spacy to train with this vector and get the model and use this model to train the mixedin data again. ---let see.

spaCy Training Performance With different configuration and set up

spaCy training output

'dep_loss', 'tag_loss', 'uas', 'tags_acc', 'token_acc', 'ents_p', 'ents_r', 'ents_f', 'cpu_wps', 'gpu_wps'
eval_data=eval_data
29, 0.000, 11.698, 0.000, 58.589, 49.871, 53.880, 91.894, 85.899, 15363.7, 0.0
eval_data-training_data performance (means the model does work)
29, 0.000, 13.261, 0.000, 81.946, 73.552, 77.523, 91.815, 85.866, 13092.9, 0.0
based on these data I think the model does work , just we don't have enough data. we only have 401 tagged documents in ontoNotes data. And all the entity tags are from that 401 documents.

exception that during training spaCy throw, and I made the code to eat the exception

[E067] Invalid BILUO tag sequence: Got a tag starting with 'I' (inside an entity) without a preceding 'B' (beginning of an entity). Tag sequence:
['O', 'U-GPE', 'O', 'B-EVENT', 'I-EVENT', 'L-EVENT', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'B-ORG', 'L-ORG', 'U-GPE" S_OFF="1', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'L-EVENT', 'U-ORDINAL', 'B-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'B-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'B-FAC', 'L-FAC', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'B-PERSON', 'L-PERSON', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-NORP', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'U-DATE', 'O', 'U-GPE', 'O', 'U-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'O', 'B-GPE', 'I-GPE', 'L-GPE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'U-GPE', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-FAC', 'L-FAC', 'O', 'U-GPE', 'O', 'O', 'B-TIME', 'I-TIME', 'L-TIME', 'U-DATE', 'O', 'B-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'O', 'O', 'O', 'B-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'I-EVENT', 'L-EVENT', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'U-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL', 'O', 'O', 'O', 'O', 'U-EVENT', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL" S_OFF="1', 'O', 'B-ORG', 'L-ORG', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'U-TIME" S_OFF="1', 'U-DATE', 'U-CARDINAL', 'B-FAC', 'I-FAC', 'I-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'U-GPE', 'O', 'B-ORG', 'L-ORG', 'U-CARDINAL', 'O', 'U-CARDINAL', 'O', 'B-CARDINAL', 'I-CARDINAL', 'L-CARDINAL', 'O', 'O', 'O', 'B-CARDINAL', 'L-CARDINAL', 'O', 'O', 'B-CARDINAL', 'L-CARDINAL', 'O', 'O', 'B-CARDINAL', 'L-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'B-PERSON', 'I-PERSON', 'L-PERSON', 'O', 'O', 'O', 'B-PERSON', 'L-PERSON', 'O', 'O', 'O', 'U-DATE', 'U-CARDINAL', 'O', 'O', 'B-TIME', 'L-TIME', 'U-ORG', 'O', 'U-ORG', 'B-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'O', 'U-TIME', 'U-ORG', 'O', 'U-ORG', 'B-FAC', 'I-FAC', 'I-FAC', 'L-FAC', 'O', 'U-TIME', 'U-ORG', 'O', 'U-FAC', 'O', 'O', 'O', 'O', 'U-TIME', 'U-ORG', 'O', 'B-ORG', 'L-ORG', 'O', 'U-FAC', 'O', 'O', 'U-TIME', 'U-ORG', 'O', 'U-ORG', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'B-ORG', 'L-ORG', 'O', 'U-GPE" S_OFF="1', 'O', 'O', 'O', 'O', 'O', 'B-TIME', 'I-TIME', 'I-TIME', 'L-TIME', 'B-DATE', 'I-DATE', 'L-DATE', 'O', 'O', 'B-FAC', 'L-FAC', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'O', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'U-ORDINAL', 'O', 'U-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'L-ORG', 'O', 'O', 'B-DATE', 'L-DATE', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'U-CARDINAL', 'O', 'O', 'O', 'O', 'O', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'I-ORG', 'L-ORG', 'O', 'U-DATE']

Cleaning code

@YanLiang1102, can you post the code that produces combined_cleaned_removed (from exp 5)? Then @khaledJabr can take a look and we can make sure all the data's in the right/same format.

I have no idea what does this "5" means, the data only has 3 element in the range why it is 5,
and here is all the ner tags in LDConToNotes with count:

{'-': 4,
 'B-CARDINAL': 149,
 'B-CARDINAL" E_OFF="1': 3,
 'B-CARDINAL" E_OFF="3': 1,
 'B-DATE': 1186,
 'B-DATE" E_OFF="1': 38,
 'B-DATE" E_OFF="5': 3,
 'B-DATE" S_OFF="1': 1,
 'B-EVENT': 422,
 'B-FAC': 470,
 'B-FAC" S_OFF="1': 4,
 'B-GPE': 575,
 'B-GPE" E_OFF="4': 1,
 'B-GPE" S_OFF="1': 3,
 'B-LANGUAGE': 6,
 'B-LAW': 127,
 'B-LOC': 293,
 'B-LOC" S_OFF="1': 1,
 'B-MONEY': 233,
 'B-NORP': 61,
 'B-ORDINAL': 34,
 'B-ORG': 2848,
 'B-ORG" E_OFF="1': 1,
 'B-ORG" S_OFF="1': 13,
 'B-PERCENT': 129,
 'B-PERSON': 3869,
 'B-PERSON" E_OFF="2': 1,
 'B-PERSON" S_OFF="1': 46,
 'B-PRODUCT': 55,
 'B-QUANTITY': 222,
 'B-QUANTITY" E_OFF="2': 6,
 'B-QUANTITY" E_OFF="3': 1,
 'B-QUANTITY" S_OFF="2': 2,
 'B-TIME': 340,
 'B-TIME" E_OFF="1': 7,
 'B-WORK_OF_ART': 159,
 'I-CARDINAL': 136,
 'I-CARDINAL" E_OFF="1': 9,
 'I-CARDINAL" E_OFF="3': 1,
 'I-DATE': 860,
 'I-DATE" E_OFF="1': 14,
 'I-DATE" E_OFF="5': 3,
 'I-DATE" S_OFF="1': 1,
 'I-EVENT': 973,
 'I-FAC': 494,
 'I-GPE': 98,
 'I-LAW': 233,
 'I-LOC': 97,
 'I-MONEY': 177,
 'I-NORP': 8,
 'I-ORDINAL': 4,
 'I-ORG': 3290,
 'I-ORG" E_OFF="1': 1,
 'I-ORG" S_OFF="1': 3,
 'I-PERCENT': 156,
 'I-PERSON': 917,
 'I-PERSON" S_OFF="1': 1,
 'I-PRODUCT': 96,
 'I-QUANTITY': 73,
 'I-QUANTITY" E_OFF="2': 6,
 'I-QUANTITY" E_OFF="3': 1,
 'I-QUANTITY" S_OFF="2': 2,
 'I-TIME': 260,
 'I-WORK_OF_ART': 579,
 'L-CARDINAL': 148,
 'L-CARDINAL" E_OFF="1': 3,
 'L-CARDINAL" E_OFF="3': 1,
 'L-DATE': 1187,
 'L-DATE" E_OFF="1': 38,
 'L-DATE" E_OFF="5': 3,
 'L-DATE" S_OFF="1': 1,
 'L-EVENT': 420,
 'L-FAC': 462,
 'L-FAC" S_OFF="1': 4,
 'L-GPE': 573,
 'L-GPE" E_OFF="4': 1,
 'L-GPE" S_OFF="1': 3,
 'L-LANGUAGE': 6,
 'L-LAW': 127,
 'L-LOC': 289,
 'L-LOC" S_OFF="1': 1,
 'L-MONEY': 233,
 'L-NORP': 61,
 'L-ORDINAL': 34,
 'L-ORG': 2827,
 'L-ORG" E_OFF="1': 1,
 'L-ORG" S_OFF="1': 13,
 'L-PERCENT': 129,
 'L-PERSON': 3837,
 'L-PERSON" E_OFF="2': 1,
 'L-PERSON" S_OFF="1': 46,
 'L-PRODUCT': 55,
 'L-QUANTITY': 222,
 'L-QUANTITY" E_OFF="2': 6,
 'L-QUANTITY" E_OFF="3': 1,
 'L-QUANTITY" S_OFF="2': 2,
 'L-TIME': 339,
 'L-TIME" E_OFF="1': 7,
 'L-WORK_OF_ART': 159,
 'O': 225156,
 'U-CARDINAL': 670,
 'U-DATE': 1149,
 'U-DATE" E_OFF="1': 12,
 'U-DATE" S_OFF="1': 1,
 'U-EVENT': 33,
 'U-FAC': 42,
 'U-FAC" E_OFF="2': 1,
 'U-FAC" S_OFF="1': 1,
 'U-GPE': 3228,
 'U-GPE" E_OFF="1': 1,
 'U-GPE" E_OFF="2': 1,
 'U-GPE" S_OFF="1': 76,
 'U-GPE" S_OFF="1" E_OFF="1': 7,
 'U-LANGUAGE': 41,
 'U-LAW': 26,
 'U-LOC': 71,
 'U-LOC" S_OFF="1': 1,
 'U-MONEY': 4,
 'U-NORP': 3386,
 'U-NORP" E_OFF="2': 1,
 'U-NORP" S_OFF="1': 2,
 'U-NORP" S_OFF="2': 1,
 'U-ORDINAL': 843,
 'U-ORDINAL" E_OFF="1': 2,
 'U-ORDINAL" E_OFF="3': 1,
 'U-ORG': 1552,
 'U-ORG" E_OFF="1': 7,
 'U-ORG" E_OFF="2': 3,
 'U-ORG" S_OFF="1': 36,
 'U-ORG" S_OFF="1" E_OFF="1': 5,
 'U-PERCENT': 8,
 'U-PERSON': 1554,
 'U-PERSON" S_OFF="1': 85,
 'U-PERSON" S_OFF="1" E_OFF="1': 1,
 'U-PERSON" S_OFF="1" E_OFF="2': 1,
 'U-PRODUCT': 21,
 'U-PRODUCT" E_OFF="1': 1,
 'U-PRODUCT" S_OFF="1': 1,
 'U-QUANTITY': 121,
 'U-QUANTITY" E_OFF="1': 5,
 'U-TIME': 100,
 'U-TIME" E_OFF="1': 3,
 'U-WORK_OF_ART': 50}

and here is the ner tag class in ANERCorp.

{'B-LOC': 542,
 'B-MISC': 336,
 'B-ORG': 1050,
 'B-PERS': 2098,
 'I-LOC': 55,
 'I-MISC': 220,
 'I-ORG': 336,
 'I-PERS': 747,
 'L-LOC': 542,
 'L-MISC': 336,
 'L-ORG': 1050,
 'L-PERS': 2098,
 'O': 133705,
 'U-LOC': 3894,
 'U-MISC': 782,
 'U-ORG': 973,
 'U-PERS': 1508,
 'U-ts': 1}

Convert example NER/Prodigy format to full spaCy format

I trained an initial Arabic NER model using annotations in the format in the train_ner example, which I think is also the Prodigy format. Use the Prodigy to spaCy format converter and use spacy train instead, which should make it way faster.

based on how prodigy train the model , need to add in ontoNotes test data into the training dataset.

ValueError: Can't read file: ar_test1/model0/accuracy.json

python -m spacy train ar ./ar_test1 ./data/train.json ./data/dev.json
Training pipeline: ['tagger', 'parser', 'ner']
Starting with blank model 'ar'
Counting training words (limit=0)

Itn Dep Loss NER Loss UAS NER P NER R NER F Tag % Token % CPU WPS GPU WPS

✔ Saved model to output directory
ar_test1/model-final

Traceback (most recent call last):
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 281, in train
scorer = nlp_loaded.evaluate(dev_docs, debug)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/language.py", line 631, in evaluate
docs, golds = zip(*docs_golds)
ValueError: not enough values to unpack (expected 2, got 0)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/zakaria/anaconda3/envs/py/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/zakaria/anaconda3/envs/py/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/main.py", line 35, in
plac.call(commands[command], sys.argv[1:])
File "/home/zakaria/.local/lib/python3.6/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/zakaria/.local/lib/python3.6/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 368, in train
best_model_path = _collate_best_model(meta, output_path, nlp.pipe_names)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 425, in _collate_best_model
bests[component] = _find_best(output_path, component)
File "/home/zakaria/.local/lib/python3.6/site-packages/spacy/cli/train.py", line 444, in _find_best
accs = srsly.read_json(epoch_model / "accuracy.json")
File "/home/zakaria/.local/lib/python3.6/site-packages/srsly/_json_api.py", line 50, in read_json
file_path = force_path(location)
File "/home/zakaria/.local/lib/python3.6/site-packages/srsly/util.py", line 21, in force_path
raise ValueError("Can't read file: {}".format(location))
ValueError: Can't read file: ar_test1/model0/accuracy.json

Rehearsal.py is using ontoNotes raw format not bilou format

better to make it look at bilou format and change to prodigy format since if in OntoNotes format it does not take advantage of the ner tag merged and anercorp data merged that we already worked on.

ar model is not in the branch 2.0.9 if yoyu giut clone its

solve this issue by simply copy the ar model under site-packages/spacy/lang/ar to the customized ready build spacy directory.

question

lets say i'm having a csv file withe tags how can i reproduce the model (just the steps)

Remove dependency parse files from spaCy formatter

The old spaCy OntoNotes converter has code to import dependency parse info from OntoNotes 5.0, but my copy doesn't have any nor does the documentation mention it. The .parse.dep files they use might be the output of another tool.

Take out all the dependency parse code in the formatter so it just uses the data we have available.

instead of spacy some other solutions to look at.

http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html

Add code for converting from OntoNotes to Prodigy-style

In the second phase, when we're using our Prodigy annotations, we'll need to mix in old OntoNotes annotations to keep spaCy from forgetting them. We should have code that can convert OntoNotes into the Prodigy span format so we can intermingle them.

Need to change the spaCy cli train code to save the each iteration training result to disk for NER training

In that way we can see how each iteration enhance the tag and token accuracy since sometimes I can see that the later iteration actually get a worse result...

using spacy to train ner merge nertagged data, error

Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/yan/spacyou/spaCy/spacy/main.py", line 31, in
plac.call(commands[command], sys.argv[1:])
File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/yan/spacyou/spaCy/spacy/cli/train.py", line 118, in train
optimizer = nlp.begin_training(lambda: corpus.train_tuples, device=use_gpu)
File "/home/yan/spacyou/spaCy/spacy/language.py", line 479, in begin_training
**cfg)
File "nn_parser.pyx", line 834, in spacy.syntax.nn_parser.Parser.begin_training
File "ner.pyx", line 88, in spacy.syntax.ner.BiluoPushDown.get_actions
ValueError: need more than 1 value to unpack

Issue for using customized pretained vector.

yan@hanover:~/ou-spacy/spaCy$ python3 -m spacy init-model ar /tmp/ar_vectors_wiki_lg --vectors-loc ../cc.ar.300.bin.gz
Reading vectors from ../cc.ar.300.bin.gz
Open loc
Traceback (most recent call last):
File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"main", mod_spec)
File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/yan/ou-spacy/spaCy/spacy/main.py", line 31, in
plac.call(commands[command], sys.argv[1:])
File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 328, in call
cmd, result = parser.consume(arglist)
File "/home/yan/.local/lib/python3.5/site-packages/plac_core.py", line 207, in consume
return cmd, self.func(*(args + varargs + extraopts), **kwargs)
File "/home/yan/ou-spacy/spaCy/spacy/cli/init_model.py", line 49, in init_model
vectors_data, vector_keys = read_vectors(vectors_loc) if vectors_loc else (None, None)
File "/home/yan/ou-spacy/spaCy/spacy/cli/init_model.py", line 111, in read_vectors
shape = tuple(int(size) for size in next(f).split())
File "/home/yan/ou-spacy/spaCy/spacy/cli/init_model.py", line 64, in
return (line.decode('utf8') for line in gzip.open(str(loc), 'r'))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xba in position 0: invalid start byte