Word embeddings, trained on large unlabeled corpora are useful for many natural language processing tasks. FastText (Bojanowski et al., 2016) in contrast to Word2vec model accounts for sub-word information by also embedding sub-word n-grams. FastText word representation is the word embedding vector plus sum of n-grams contained in it. Word2vec vector norms have been shown (Schakel & Wilson, 2015) to be correlated to word significance. This blog post visualize vector norms of FastText embedding and evaluates use of FastText word vector norm multiplied with number of word n-grams for detecting non-english OOV words.
vackosar / fasttext-vector-norms-and-oov-words Goto Github PK
View Code? Open in Web Editor NEWThis blog post visualize vector norms of FastText embedding and evaluates use of FastText word vector norm multiplied with number of word n-grams for detecting non-english OOV words.