Git Product home page Git Product logo

aravec's Introduction

AraVec 2.0

Advancements in neural networks have led to developments in fields like computer vision, speech recognition and natural language processing (NLP). One of the most influential recent developments in NLP is the use of word embeddings, where words are represented as vectors in a continuous space, capturing many syntactic and semantic relations among them.

AraVec is a pre-trained distributed word representation (word embedding) open source project which aims to provide the Arabic NLP research community with free to use and powerful word embedding models. The first version of AraVec provides six different word embedding models built on top of three different Arabic content domains; Tweets, World Wide Web pages and Wikipedia Arabic articles. The total number of tokens used to build the models amounts to more than 3,300,000,000. This paper describes the resources used for building the models, the employed data cleaning techniques, the carried out preprocessing step, as well as the details of the employed word embedding creation techniques.

The second version of AraVec provides twelve different word embedding models built on top of three different Arabic content domains; Tweets, World Wide Web pages and Wikipedia Arabic articles. The difference between this version and the first, is that the hyper-parameter for minimum count was reduced to 50 instead of 500 for Tweets dataset, 200 for World Wide Web pages dataset and 5 for Wikipedia articels dataset. This resulted in models that have more coverage in terms of vocabulary. The other change, is the we produced a set of six embedding models that have a dimension of 100.

  1. Twitter tweets
  2. World Wide Web pages
  3. Wikipedia Arabic articles

By total tokens of more than 3,300,000,000 tokens.

Citation

Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

Download

Model Docs No. Vocabularies No. Dimension Download Mirror-1
Twitter-CBOW 66,900,000 331,679 300 Download Download
Twitter-Skipgram 66,900,000 331,679 300 Download Download
Twitter-CBOW 66,900,000 331,679 100 Download Download
Twitter-Skipgram 66,900,000 331,679 100 Download Download
Wikipedia-CBOW 1,800,000 162,516 300 Download Download
Wikipedia-Skipgram 1,800,000 162,516 300 Download Download
Wikipedia-CBOW 1,800,000 162,516 100 Download Download
Wikipedia-Skipgram 1,800,000 162,516 100 Download Download
Web-CBOW 132,750,000 234,961 300 Download Download
Web-Skipgram 132,750,000 234,961 300 Download Download
Web-CBOW 132,750,000 234,961 100 Download Download
Web-Skipgram 132,750,000 234,961 100 Download Download

How to use

These models were built using gensim Python library. Here's a simple code for loading and using one of the models by following these steps:

  1. Install gensim using either pip or conda

pip install gensim

conda install gensim

  1. extract the compressed model files to a directory [ e.g. Twittert-CBOW ]
  2. keep the .npy files. You are gonna to load the file with no extension, like what you'll see in the following code.
  3. run this python code to load and use the model
# -*- coding: utf8 -*-
import gensim
import re

# load the model
model = gensim.models.Word2Vec.load('Twittert-CBOW/tweets_cbow_300')

# Clean/Normalize Arabic Text
def clean_str(text):
    search = ["أ","إ","آ","ة","_","-","/",".","،"," و "," يا ",'"',"ـ","'","ى","\\",'\n', '\t','"','?','؟','!']
    replace = ["ا","ا","ا","ه"," "," ","","",""," و"," يا","","","","ي","",' ', ' ',' ',' ? ',' ؟ ',' ! ']
    
    #remove tashkeel
    p_tashkeel = re.compile(r'[\u0617-\u061A\u064B-\u0652]')
    text = re.sub(p_tashkeel,"", text)
    
    #remove longation
    p_longation = re.compile(r'(.)\1+')
    subst = r"\1\1"
    text = re.sub(p_longation, subst, text)
    
    text = text.replace('وو', 'و')
    text = text.replace('يي', 'ي')
    text = text.replace('اا', 'ا')
    
    for i in range(0, len(search)):
        text = text.replace(search[i], replace[i])
    
    #trim    
    text = text.strip()

    return text

# python 3.X
word = clean_str(u'القاهرة')
# python 2.7
# word = clean_str('القاهرة'.decode('utf8', errors='ignore'))

# find and print the most similar terms to a word
most_similar = model.wv.most_similar( word )
for term, score in most_similar:
	print(term, score)
	
# get a word vector
word_vector = model.wv[ word ]

Citation

Abu Bakr Soliman, Kareem Eisa, and Samhaa R. El-Beltagy, “AraVec: A set of Arabic Word Embedding Models for use in Arabic NLP”, in proceedings of the 3rd International Conference on Arabic Computational Linguistics (ACLing 2017), Dubai, UAE, 2017.

aravec's People

Contributors

bakrianoo avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.