Git Product home page Git Product logo

dutchembeddings's Introduction

dutchembeddings

Repository for the word embeddings described in Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource, presented at LREC 2016.

All embeddings are released under the CC-BY-SA-4.0 license.

The software is released under the GNU GPL 2.0.

Embeddings

To download the embeddings, please click any of the links in the following table. In almost all cases, the 320-dimensional embeddings outperform the 160-dimensional embeddings.

Corpus 160 320
Roularta link link
Wikipedia link link
Sonar500 link link
Combined link link
COW - small, big

The embeddings are currently provided in .txt files which contain vectors in word2vec format, which is structured as follows:

The first line contains the size of the vectors and the vocabulary size, separated by a space.

Ex: 320 50000

Each line thereafter contains the vector data for a single word, and is presented as a string delimited by spaces. The first item on each line is the word itself, the n following items are numbers, representing the vector of length n. Because the items are represented as strings, these should be converted to floating point numbers.

Ex: hond 0.2 -0.542 0.253 etc.

If you use python, these files can be loaded with gensim or reach, as follows.

# Gensim
from gensim.models import word2vec

model = Word2Vec.load_word2vec_format("path/to/vector", binary=False)
katvec = model['kat']
model.most_similar('kat')

# Reach
from reach import Reach

r = Reach("path/to/vector", header=True)
katvec = r['kat']
r.most_similar('kat')

Relationship dataset

If you want to test the quality of your embeddings, you can use the relation.py script. This script takes a .txt file of predicates, and creates dataset which is used for evaluation.

This currently only works with the gensim word2vec models or the SPPMI model, as defined above.

Example:

from relation import Relation

# Load the predicates.
rel = Relation("data/question-words.txt")

# load a word2vec model
model = Word2vec.load_word2vec_format("path/to/model")

# Test the model
rel.test_model(model)

Citing

If you use any of the resources from this paper, please cite our paper, as follows:

@InProceedings{tulkens2016evaluating,
  author = {Stephan Tulkens and Chris Emmery and Walter Daelemans},
  title = {Evaluating Unsupervised Dutch Word Embeddings as a Linguistic Resource},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portorož, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  language = {english}
 }

Please also consider citing the corpora of the embeddings you use. Without the people who made the corpora, the embeddings could never have been created.

dutchembeddings's People

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.