Git Product home page Git Product logo

embeddings's Introduction

Embeddings - subword2vec

subword2vec is the code repository for training word embeddings enriched with sub-word knowledge like character ngrams, lemma, morphological tags and phonemes. This library was used in our upcoming paper Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations (to appear in EMNLP-2018). subword2vec is based on the fastText library and prop2vec library.

Requirements

  • gcc-4.6.3 or newer (for compiling)

Input format

Each word is represented with its sub-word information. For instance, a Hindi word नदी (river) is represented as w:नदी~l:नद~m:N+Fem+Dir+Sg~ipa:nədiː. An example input file is shown in the \example directory.

Command

cd Embeddings
make
./fasttext skipgram -input example/sample_input.txt -output example/sample_output -lr 0.025 -dim 100 -t 1e-3 -props w+l+m -minCount 2 -ws 3 -bucket 2000000 -lemmaoutput example/sample_lemma_output -morphoutput example/sample_morph_output

This command will train embeddings by averaging the sub_word units: charngrams (w:), lemma (l:), morph(m:). One can provide different combinations in the -props field depending on one's requirements. If one needs a combination of charngrams + morph, one should provide -props w+m. The arguments -lemmaoutput and -morphoutput would output the embeddings of the each sub-word unit.

Using pretrained sub-word embeddings

To use pretrained embeddings for initializing vectors for the sub-words, the pretrained embeddings should have the following format:

102345 100
Punc 1.4163 1.697 -0.95646 0.4587 1.2924 0 ...

where 102345 denotes the number of unique sub-words 100 denotes the embedding size. Then run the following:

./fasttext skipgram -input example/sample_input.txt -output  example/sample_output_pretrained_with_morph -ws 3 -t 1e-3 -minCount 2 -lr 0.025 -bucket 2000000 -props w+l+m -pretrainedVectors example/morph_output.vec

Best embeddings (in reference to the paper)

The embeddings which gave best performance in the NER task used for our work here Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations (to appear in EMNLP-2018), are made available in this folder: \embeddings_released.

References

If you make use of this software for research purposes, we will appreciate citing the following:

@InProceedings{D18-1366,
  author = 	"Chaudhary, Aditi
		and Zhou, Chunting
		and Levin, Lori
		and Neubig, Graham
		and Mortensen, David R.
		and Carbonell, Jaime",
  title = 	"Adapting Word Embeddings to New Languages with Morphological and Phonological Subword Representations",
  booktitle = 	"Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing",
  year = 	"2018",
  publisher = 	"Association for Computational Linguistics",
  pages = 	"3285--3295",
  location = 	"Brussels, Belgium",
  url = 	"http://aclweb.org/anthology/D18-1366"
}

Contact

For any issues, please feel free to reach out to [email protected].

embeddings's People

Contributors

aditi138 avatar ajoulin avatar alexbeletsky avatar bcampbell avatar brettkoonce avatar cpuhrsch avatar edizel avatar edouardgrave avatar emilstenstrom avatar gojomo avatar hyunyoung2 avatar icoxfog417 avatar infinite-joy avatar jaytaylor avatar jernkuan avatar joelmarcey avatar joeykrim avatar kahne avatar ma2bd avatar mbyzhang avatar ola13 avatar ot avatar piotr-bojanowski avatar pyk avatar sleepinyourhat avatar tomtung avatar valeriyvan avatar vessovit avatar xeb avatar zphang avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.