bakrianoo / aravec Goto Github PK

AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models.

Python 15.14% Jupyter Notebook 84.86%

embedded-models nlp gensim arabic text-mining word2vec

aravec's Issues

Hello ,

Hello
nice effort first of all,
how did u train ur model on unigrams and n-grams models, i am trying to do similar approach for my native(urdu) language , would appreciate if you share your code for creating similar results for urdu language
regards

Download links are not working

All download links do not work. It keeps loading for while, then it responds with "Gateway Timeout".

Gensim version

Could you please state the version of Gensim as every time i run the code i get the following error:

AttributeError: 'module' object has no attribute 'call_on_class_only'

Arabic Sentence Embedding (AraSIF)

Hi,
In case someone is interested in computing sentence embeddings for Arabic text, we have a recent WANLP paper titled Incremental Domain Adaptation for Neural Machine Translation in Low-Resource Settings where we leveraged AraVec (i.e. pre-trained model from this repo) and SIF-Smooth Inverse Frequency approach for computing (300D) sentence embeddings.

The code for the whole procedure is publicly available on GitHub @ DFKI-Interactive-Machine-Learning/AraSIF.

what is the diffrence between the numpy array ?

i downloaded "tweets_cbow_300" and i found that there is two numpy array and both are in float format and both have the same shape but they don't have the same values ? so what is the difference between them ?

how to convert your embeddings to be in .bin format

First thank you for your effort
I am trying to use your word embedding file with meachine learning algorithms
However, I had error as I mentioned in isuue #6 and when I try to convert it to .txt also I had error as I mentioned in isuue #5
I think if I convert your word embedding file from .mdl to .bin will solve the problem
I am wirting a paper about the effect of the exist Arabic word embedding on diffrent algorithm
I hope I add your word embedding on my study cases
Thanks

Errors: utf-8' codec can't decode

hello,
I am writing a paper for word Embeddings effect on machine learning algorithm
my code is almost same this with little different https://github.com/iamaziz/ar-embeddings/blob/master/asa.py
I already use 2 type of word Embeddings for other authors with extension .bin

however, when I use "full_grams_sg_300_twitter.mdl" I have this error:

UnicodeDecodeError Traceback (most recent call last)
in ()
5 dataset_path = "Sport_TrainingSet_1000_01.csv"
6 # run
----> 7 ArSentiment(embeddings_path, dataset_path, plot_roc=True)
8

in init(self, embeddings_file, dataset_file, plot_roc, split, detailed)
12 self.split = split
13
---> 14 self.embeddings, self.dimension = self.load_vectors(embeddings_file)
15
16 # read dataset

in load_vectors(model_name, binary)
66 """load the pre-trained embedding model"""
67 if binary:
---> 68 w2v_model = KeyedVectors.load_word2vec_format(model_name, binary=True)
69 else:
70 w2v_model = gensim.models.Word2Vec.load(model_name)

/anaconda3/lib/python3.7/site-packages/gensim/models/keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
1436 return _load_word2vec_format(
1437 cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors,
-> 1438 limit=limit, datatype=datatype)
1439
1440 def get_keras_embedding(self, train_embeddings=False):

/anaconda3/lib/python3.7/site-packages/gensim/models/utils_any2vec.py in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
170 logger.info("loading projection weights from %s", fname)
171 with utils.smart_open(fname) as fin:
--> 172 header = utils.to_unicode(fin.readline(), encoding=encoding)
173 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
174 if limit:

/anaconda3/lib/python3.7/site-packages/gensim/utils.py in any2unicode(text, encoding, errors)
353 if isinstance(text, unicode):
354 return text
--> 355 return unicode(text, encoding, errors=errors)
356
357

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Translate augmentation is not working, giving 404 error

I am trying the translation augmentation, but for some reason it's not working:

src = "ar"
to = "en" 
from textaugment import Translate
t = Translate(src="ar", to="en")
t.augment('من اول يوم كلت اسحبوا الثقة من اهل العمايم و الدين لانها عميلة لنفسها لانهم يريدون الناس خدام و عبيد الهم')

I also tried different languages as the example provided in the Readme file, it gives the following error:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-13-899c5e83912f> in <module>()
      3 from textaugment import Translate
      4 t = Translate(src="ar", to="en")
----> 5 t.augment('من اول يوم كلت اسحبوا الثقة من اهل العمايم و الدين لانها عميلة لنفسها لانهم يريدون الناس خدام و عبيد الهم')

9 frames
/usr/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

Download links broken

I get the following error:

<Error>
<Code>UserSuspended</Code>
<BucketName>bakrianoo</BucketName>
<RequestId>tx000000000000057c105c6-005f16fefe-95f8c6-sfo2a</RequestId>
<HostId>95f8c6-sfo2a-sfo</HostId>
</Error>

i couldn't use tweets_sg_300

i face this error
ValueError: cannot reshape array of size 14155744 into shape (331679,300)

License

I cannot find license file for this project.
Can you please include it in the project root.

Usage: python -m spacy [OPTIONS] COMMAND [ARGS]... Try 'python -m spacy --help' for help. Error: No such command 'init-model'.

when : !python -m spacy init-model ar spacy.aravec.model --vectors-loc ./spacyModel/aravec.txt.gz
i have the following issue :
Usage: python -m spacy [OPTIONS] COMMAND [ARGS]...
Try 'python -m spacy --help' for help.

Error: No such command 'init-model'.

bakrianoo / aravec Goto Github PK

aravec's Issues

Hello ,

Download links are not working

Gensim version

Arabic Sentence Embedding (AraSIF)

what is the diffrence between the numpy array ?

i downloaded "tweets_cbow_300" and i found that there is two numpy array and both are in float format and both have the same shape but they don't have the same values ? so what is the difference between them ?

how to convert your embeddings to be in .bin format

Errors: utf-8' codec can't decode

Translate augmentation is not working, giving 404 error

Download links broken

i couldn't use tweets_sg_300

License

Usage: python -m spacy [OPTIONS] COMMAND [ARGS]... Try 'python -m spacy --help' for help. Error: No such command 'init-model'.

Request for the datasets

Download links are not working

Question

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent