bakrianoo / aravec Goto Github PK

Python 15.14% Jupyter Notebook 84.86%

embedded-models nlp gensim arabic text-mining word2vec

aravec's Introduction

AraVec 3.0

Advancements in neural networks have led to developments in fields like computer vision, speech recognition and natural language processing (NLP). One of the most influential recent developments in NLP is the use of word embeddings, where words are represented as vectors in a continuous space, capturing many syntactic and semantic relations among them.

AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models. The first version of AraVec provides six different word embedding models built on top of three different Arabic content domains; Tweets and Wikipedia This paper describes the resources used for building the models, the employed data cleaning techniques, the carried out preprocessing step, as well as the details of the employed word embedding creation techniques.

The third version of AraVec provides 16 different word embedding models built on top of two different Arabic content domains; Tweets and Wikipedia Arabic articles. The major difference between this version and the previous ones, is that the we produced two different types of models, unigrams and n-grams models. We utilized set of statistical techniques to genrate the most common used n-grams of each data domain.

Twitter tweets
Wikipedia Arabic articles

By total tokens of more than 1,169,075,128 tokens.

Take a look on how ngrams models are represented:

Please view the results page for more queries.

Citation

Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

Read the Full-Text Paper

How To Use

These models were built using gensim Python library. Here's a simple code for loading and using one of the models by following these steps:

Install gensim >= 3.4 and nltk >= 3.2 using either pip or conda

pip install gensim nltk

conda install gensim nltk

extract the compressed model files to a directory [ e.g. Twittert-CBOW ]
keep the .npy files. You are gonna to load the file with no extension, like what you'll see in the following code.
run this python code to load and use the model

How to integrate AraVec with Spacy.io

NoteBook Codes

Code Samples

# -*- coding: utf8 -*-
import gensim
import re
import numpy as np
from nltk import ngrams

from utilities import * # import utilities.py module

# ============================   
# ====== N-Grams Models ======

t_model = gensim.models.Word2Vec.load('models/full_grams_cbow_100_twitter.mdl')

# python 3.X
token = clean_str(u'ابو تريكه').replace(" ", "_")
# python 2.7
# token = clean_str(u'ابو تريكه'.decode('utf8', errors='ignore')).replace(" ", "_")

if token in t_model.wv:
    most_similar = t_model.wv.most_similar( token, topn=10 )
    for term, score in most_similar:
        term = clean_str(term).replace(" ", "_")
        if term != token:
            print(term, score)

# تريكه 0.752911388874054
# حسام_غالي 0.7516342401504517
# وائل_جمعه 0.7244222164154053
# وليد_سليمان 0.7177559733390808
# ...

# =========================================
# == Get the most similar tokens to a compound query
# most similar to 
# عمرو دياب + الخليج - مصر

pos_tokens=[ clean_str(t.strip()).replace(" ", "_") for t in ['عمرو دياب', 'الخليج'] if t.strip() != ""]
neg_tokens=[ clean_str(t.strip()).replace(" ", "_") for t in ['مصر'] if t.strip() != ""]

vec = calc_vec(pos_tokens=pos_tokens, neg_tokens=neg_tokens, n_model=t_model, dim=t_model.vector_size)

most_sims = t_model.wv.similar_by_vector(vec, topn=10)
for term, score in most_sims:
    if term not in pos_tokens+neg_tokens:
        print(term, score)

# راشد_الماجد 0.7094649076461792
# ماجد_المهندس 0.6979793906211853
# عبدالله_رويشد 0.6942606568336487
# ...

# ====================
# ====================




# ============================== 
# ====== Uni-Grams Models ======

t_model = gensim.models.Word2Vec.load('models/full_uni_cbow_100_twitter.mdl')

# python 3.X
token = clean_str(u'تونس')
# python 2.7
# token = clean_str('تونس'.decode('utf8', errors='ignore'))

most_similar = t_model.wv.most_similar( token, topn=10 )
for term, score in most_similar:
    print(term, score)

# ليبيا 0.8864325284957886
# الجزائر 0.8783721327781677
# السودان 0.8573237061500549
# مصر 0.8277812600135803
# ...



# get a word vector
word_vector = t_model.wv[ token ]

Download

N-Grams Models

to take a look on what we can retieve from the n-grams models using some most similar queries. Please view the results page

N-Grams Models

Model	Docs No.	Vocabularies No.	Vec-Size	Download
Twitter-CBOW	66,900,000	1,476,715	300	Download
Twitter-CBOW	66,900,000	1,476,715	100	Download
Twitter-SkipGram	66,900,000	1,476,715	300	Download
Twitter-SkipGram	66,900,000	1,476,715	100	Download
Wikipedia-CBOW	1,800,000	662,109	300	Download
Wikipedia-CBOW	1,800,000	662,109	100	Download
Wikipedia-SkipGram	1,800,000	662,109	300	Download
Wikipedia-SkipGram	1,800,000	662,109	100	Download

Unigrams Models

Model	Docs No.	Vocabularies No.	Vec-Size	Download
Twitter-CBOW	66,900,000	1,259,756	300	Download
Twitter-CBOW	66,900,000	1,259,756	100	Download
Twitter-SkipGram	66,900,000	1,259,756	300	Download
Twitter-SkipGram	66,900,000	1,259,756	100	Download
Wikipedia-CBOW	1,800,000	320,636	300	Download
Wikipedia-CBOW	1,800,000	320,636	100	Download
Wikipedia-SkipGram	1,800,000	320,636	300	Download
Wikipedia-SkipGram	1,800,000	320,636	100	Download

Citation

Read the Full-Text Paper

aravec's People

Contributors

Stargazers

Watchers

Forkers

mahmoudzareef tareksinosi ahmedhindi mamonraab alaakh42 hussien mohamedmamdouh ahmedfawzy alomdaelmasry abdarhman wshibl alisaad mahmoudshoaala hendalaa ahmednabil950 andow7 moustafaamahmoud nourgalaby nininininini abdelrahman-atia resalaa hossamhasanin mohamed-eid sabirdvd fatmas1982 abdelrahman-t zeroows goleo8 kabdeslam msyem2014 hussein-alahmad mostafaxp1 mma1979 loaiabdalslam wasancs ibrahim85 yangsustc keissa kliff2606 rafikt1992 a7medbahgat khalidezzeldeen abeermohamed1 ae2020 al-dailami fayez-khazalah hzitoun elsayed-issa biwaro aliabdelaal mohzary kahinasassi taha7ussein007 abubakr-soliman abdullahmuaad9 ertosns lailamb mohaimi faiz-hub cjbarrie geehaad lamia-ouch ameenreda1 ahmadabdulhakeem pardusnimr eya-ghodhbeni musitafa0032 python-repository-hub hudakas alzeem11 standardgalactic omarelsayeed aqhali habibaabderrahim eslamahmed235 wanes-tutunjian mayr123123

aravec's Issues

Errors: utf-8' codec can't decode

hello,
I am writing a paper for word Embeddings effect on machine learning algorithm
my code is almost same this with little different https://github.com/iamaziz/ar-embeddings/blob/master/asa.py
I already use 2 type of word Embeddings for other authors with extension .bin

however, when I use "full_grams_sg_300_twitter.mdl" I have this error:

UnicodeDecodeError Traceback (most recent call last)
in ()
5 dataset_path = "Sport_TrainingSet_1000_01.csv"
6 # run
----> 7 ArSentiment(embeddings_path, dataset_path, plot_roc=True)
8

in init(self, embeddings_file, dataset_file, plot_roc, split, detailed)
12 self.split = split
13
---> 14 self.embeddings, self.dimension = self.load_vectors(embeddings_file)
15
16 # read dataset

in load_vectors(model_name, binary)
66 """load the pre-trained embedding model"""
67 if binary:
---> 68 w2v_model = KeyedVectors.load_word2vec_format(model_name, binary=True)
69 else:
70 w2v_model = gensim.models.Word2Vec.load(model_name)

/anaconda3/lib/python3.7/site-packages/gensim/models/keyedvectors.py in load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
1436 return _load_word2vec_format(
1437 cls, fname, fvocab=fvocab, binary=binary, encoding=encoding, unicode_errors=unicode_errors,
-> 1438 limit=limit, datatype=datatype)
1439
1440 def get_keras_embedding(self, train_embeddings=False):

/anaconda3/lib/python3.7/site-packages/gensim/models/utils_any2vec.py in _load_word2vec_format(cls, fname, fvocab, binary, encoding, unicode_errors, limit, datatype)
170 logger.info("loading projection weights from %s", fname)
171 with utils.smart_open(fname) as fin:
--> 172 header = utils.to_unicode(fin.readline(), encoding=encoding)
173 vocab_size, vector_size = (int(x) for x in header.split()) # throws for invalid file format
174 if limit:

/anaconda3/lib/python3.7/site-packages/gensim/utils.py in any2unicode(text, encoding, errors)
353 if isinstance(text, unicode):
354 return text
--> 355 return unicode(text, encoding, errors=errors)
356
357

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 0: invalid start byte

Request for the datasets

Hello bakrianoo
Is the dataset you've collected available for academic purposes ?
I'm more interested in the wikipedia articles
and thanks 😁

Hello ,

Hello
nice effort first of all,
how did u train ur model on unigrams and n-grams models, i am trying to do similar approach for my native(urdu) language , would appreciate if you share your code for creating similar results for urdu language
regards

Question

First of all thank you so much for your effort. But I need to ask you about how to convert your embeddings to be in a text file format like the available Glove embeddings for English?

Thanks,

Download links are not working

what is the diffrence between the numpy array ?

i downloaded "tweets_cbow_300" and i found that there is two numpy array and both are in float format and both have the same shape but they don't have the same values ? so what is the difference between them ?

Translate augmentation is not working, giving 404 error

I am trying the translation augmentation, but for some reason it's not working:

src = "ar"
to = "en" 
from textaugment import Translate
t = Translate(src="ar", to="en")
t.augment('من اول يوم كلت اسحبوا الثقة من اهل العمايم و الدين لانها عميلة لنفسها لانهم يريدون الناس خدام و عبيد الهم')

I also tried different languages as the example provided in the Readme file, it gives the following error:

---------------------------------------------------------------------------
HTTPError                                 Traceback (most recent call last)
<ipython-input-13-899c5e83912f> in <module>()
      3 from textaugment import Translate
      4 t = Translate(src="ar", to="en")
----> 5 t.augment('من اول يوم كلت اسحبوا الثقة من اهل العمايم و الدين لانها عميلة لنفسها لانهم يريدون الناس خدام و عبيد الهم')

9 frames
/usr/lib/python3.7/urllib/request.py in http_error_default(self, req, fp, code, msg, hdrs)
    647 class HTTPDefaultErrorHandler(BaseHandler):
    648     def http_error_default(self, req, fp, code, msg, hdrs):
--> 649         raise HTTPError(req.full_url, code, msg, hdrs, fp)
    650 
    651 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 404: Not Found

Usage: python -m spacy [OPTIONS] COMMAND [ARGS]... Try 'python -m spacy --help' for help. Error: No such command 'init-model'.

when : !python -m spacy init-model ar spacy.aravec.model --vectors-loc ./spacyModel/aravec.txt.gz
i have the following issue :
Usage: python -m spacy [OPTIONS] COMMAND [ARGS]...
Try 'python -m spacy --help' for help.

Error: No such command 'init-model'.

Arabic Sentence Embedding (AraSIF)

Hi,
In case someone is interested in computing sentence embeddings for Arabic text, we have a recent WANLP paper titled Incremental Domain Adaptation for Neural Machine Translation in Low-Resource Settings where we leveraged AraVec (i.e. pre-trained model from this repo) and SIF-Smooth Inverse Frequency approach for computing (300D) sentence embeddings.

The code for the whole procedure is publicly available on GitHub @ DFKI-Interactive-Machine-Learning/AraSIF.

Download links are not working

All download links do not work. It keeps loading for while, then it responds with "Gateway Timeout".

how to convert your embeddings to be in .bin format

First thank you for your effort
I am trying to use your word embedding file with meachine learning algorithms
However, I had error as I mentioned in isuue #6 and when I try to convert it to .txt also I had error as I mentioned in isuue #5
I think if I convert your word embedding file from .mdl to .bin will solve the problem
I am wirting a paper about the effect of the exist Arabic word embedding on diffrent algorithm
I hope I add your word embedding on my study cases
Thanks

License

I cannot find license file for this project.
Can you please include it in the project root.

Download links broken

I get the following error:

<Error>
<Code>UserSuspended</Code>
<BucketName>bakrianoo</BucketName>
<RequestId>tx000000000000057c105c6-005f16fefe-95f8c6-sfo2a</RequestId>
<HostId>95f8c6-sfo2a-sfo</HostId>
</Error>

i couldn't use tweets_sg_300

i face this error
ValueError: cannot reshape array of size 14155744 into shape (331679,300)

Gensim version

Could you please state the version of Gensim as every time i run the code i get the following error:

AttributeError: 'module' object has no attribute 'call_on_class_only'

bakrianoo / aravec Goto Github PK

aravec's Introduction

AraVec 3.0

Citation

How To Use

How to integrate AraVec with Spacy.io

Code Samples

Download

N-Grams Models

N-Grams Models

Unigrams Models

Citation

aravec's People

Contributors

Stargazers

Watchers

Forkers

aravec's Issues

i downloaded "tweets_cbow_300" and i found that there is two numpy array and both are in float format and both have the same shape but they don't have the same values ? so what is the difference between them ?

Recommend Projects

Recommend Topics

Recommend Org