dselivanov / text2vec Goto Github PK

Fast vectorization, topic modeling, distances and GloVe word embeddings in R.

License: Other

R 67.27% C++ 32.73%

word2vec text-mining natural-language-processing glove vectorization topic-modeling word-embeddings latent-dirichlet-allocation

text2vec's Introduction

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

Goals which we aimed to achieve as a result of development of text2vec:

Concise - expose as few functions as possible
Consistent - expose unified interfaces, no need to explore new interface for each task
Flexible - allow to easily solve complex tasks
Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
Memory efficient - use streams and iterators, not keep data in RAM if possible

See API section for details.

Performance

This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts are fully parallelized using OpenMP.

Other emrassingly parallel tasks (such as vectorization) can use any fork-based parallel backend on UNIX-like machines. They can achieve near-linear scalability with the number of available cores.

Finally, a streaming API means that users do not have to load all the data into RAM.

Contributing

The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by:

testing and leaving feedback on the GitHub issuer tracker (preferably) or directly by e-mail
forking and contributing (check code our style guide). Vignettes, docs, tests, and use cases are very welcome
by giving me a star on project page :-)

License

GPL (>= 2)

text2vec's People

Contributors

Stargazers

Watchers

Forkers

guanlongtianzi jimmeister albiondervishi trinker sandy4321 nsquare jimsow wavelets korterling bekterra jimhester jiunnguo aakbarie kbenoit whizzalan zior dfalbel ww880412 icewwn stevenlol benknightdark jonifranc theolivenbaum nunofernandes-plight iamkbpark cmldyu atomicjets davomichalko alexdecastroatgit hbyswzd007 lllabmaster duke-translational-bioinformatics own2pwn vspinu duankai chude francescoalb stassajin crazy121 wangcug voltek62 ftasr mnrmja007 benjamesbabala xkuang andland zahid2008 anirban18 eungyeol haines910 magellen kuonanhong strategist922 fulquan liqunchen0606 chrisss93 allanbrandonr keveene john-m-walls wskwon pshashk manuelbickel kevinwkc jbdatascience singularperturbation cosecant-csc aysegulzemnos manish-saraswat-olx p4css kakiac avpronkin mtoto afcarl wangchuan2008888 bobbyliujb julianflowers sacontreras tylerlittlefield voitsehovska matt32106 aymansalama fangego bedantaguru raivtash jasonzou rezasadeghiwsu wtianyi suryavan11 hamedmx hulalazz yongheshinian yha2017 hxxr zpeng1989 leungi y-he2 michaelpaulhirsch laranea ganesharun237 minghao2016

text2vec's Issues

unicode support

Investigate into unicode support. It can be tricky. see for example this stackoverflow post.

Weighting functions for cooccurrence matrix construction

Cooccurrence matrix construction is quite computationally expensive. Here is example on how user can supply his own compiled function from R to C++ call: rcpp-gallery.

Return `dgCMatrix` instead of `dgTMatrix`

Return dgCMatrix instead of dgTMatrix - should be quite simlpe to do.

performance comparison with quanteda

Hi - Interesting package, and I agree wholeheartedly with your API and performance qualms with tm(). That's why we started https://github.com/kbenoit/quanteda. My replication of your performance comparisons is very similar to what you reported. Here is quanteda:

> quantedaDtm <- quanteda::dfm(dt[['review']])
Creating a dfm from a character vector ...
   ... lowercasing
   ... tokenizing
   ... indexing documents: 25,000 documents
   ... indexing features: 100,605 feature types
   ... created a 25000 x 100605 sparse dfm
   ... complete. 
Elapsed time: 5.701 seconds.
> print(object.size(quantedaDtm), quote = FALSE, units = "Mb")
47.7 Mb

Note that this does everything in one pass, including lowercasing and tokenisation. There are methods defined for corpus management etc and dfm() methods for those objects as well, but this is the quickest way to go from the input text into a matrix representation.

consider to implement regex clean in c++

review this C++11 functionality
is it worth to move this into C++ or keep it in R?

Add wrapper for parallel dtm construction

It will be simple to construct dtm, tcm, vocabulary using foreach and corresponding reduce function

+ for tcm
rbind for dtm
data.table join or merge for vocabulary

Change vocabulary format

Do not keep doc_proportions. Keep number of processed documents instead.
Related to #47.

Add more options to dtm weighting

see scikit-learn TfidfVectorizer options:

smooth_idf
sublinear_tf
norm

Long vectors (> 2^31) and large matrices

While creation of tcm in VocabCorpus on english wikipedia dump with top 400K vocabulary I got following:

Error in create_vocab_corpus(iterator = it2, vocabulary = vocab, grow_dtm = F, : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:137

At the moment I can't figure out what is wrong, probably need more investigation into Rcpp Modules.

Add readme.md

something like FeatureHashing vignette.
describe how to work with connections.
describe how to work files larger then RAM
describe hashing trick

Add introduction vignette

Creating vocabulary corpus
Creating hash corpus
Term concurrence matrix, GloVe factorization, analogies on wikipedia dump

Use iterators as chunk abstractions.

See iterators package.

Remove Rproj file from deposit

As indicated in the name

Word2Vec wrapper

You may be interested in this package:
https://r-forge.r-project.org/R/?group_id=1571

The API is not that good but the link to the C lib is already done.

GloVe implementation

Implement idea from GloVe paper - state of the art algorithm for word embeddings which easily outperforms word2vec.
~~Consider to use RcppParallel and Intel TBB instead of pthreads for asynchronous parallel SGD.~~
See #27 (split this issue into two: #25 and #27)

Links:

Try to grow sparse matrix in multiple threads

See for example this romainfrancois's implementation, which I believe well suited for our case.
At least, we need some tests for that.

Improve tokenizers

Switch to stringi or stringr for tokenization and preprocessing. These libraries 2-3 times faster than R's default regular expression functions family.

Add wrapper for parallel vocabulary construction

It will be simple to construct dtm, tcm, vocabulary using foreach and corresponding reduce function

+ for tcm
rbind for dtm
data.table join or merge for vocabulary

Installing on OS X

Hi, I tried installing on a Mac and got the following messages.
Any ideas?

Thanks.

Downloading github repo dselivanov/text2vec@master
Installing text2vec
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore CMD INSTALL
'/private/var/folders/6l/r_hnn2w93z185znmtg8x__2r0000gn/T/RtmpLra2gU/devtools56c7644c81a/dselivanov-text2vec-eae208a'
--library='/Library/Frameworks/R.framework/Versions/3.2/Resources/library' --install-tests

installing source package ‘text2vec’ ...
** libs
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c Corpus.cpp -o Corpus.o
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c Glove.cpp -o Glove.o
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c RcppExports.cpp -o RcppExports.o
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c hash.cpp -o hash.o
clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c ngram.cpp -o ngram.o
clang++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/opt/X11/lib -L/usr/local/lib /usr/local/lib/libcairo.a /usr/local/lib/libpixman-1.a /usr/local/lib/libfreetype.a /usr/local/lib/libfontconfig.a -lxml2 /usr/local/lib/libreadline.a -o text2vec.so Corpus.o Glove.o RcppExports.o hash.o ngram.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
clang: error: no such file or directory: '/usr/local/lib/libcairo.a'
clang: error: no such file or directory: '/usr/local/lib/libpixman-1.a'
clang: error: no such file or directory: '/usr/local/lib/libfreetype.a'
clang: error: no such file or directory: '/usr/local/lib/libfontconfig.a'
clang: error: no such file or directory: '/usr/local/lib/libreadline.a'
make: *** [text2vec.so] Error 1
ERROR: compilation failed for package ‘text2vec’

removing ‘/Library/Frameworks/R.framework/Versions/3.2/Resources/library/text2vec’
Error: Command failed (1)

Add tests

Need to write tests for C++ and R modules.

get_dtm() sometimes fails, returning a matrix one row shorter than number of docs in corpus

Hi,
first of all thanks a lot for you great work. text2vec has in my experience been easily the fastest and most memory efficient way in R to work with text in vector space. It's been working great for me so far, but now I seem to have come across a bug.

The issue comes up regularly when I try to apply a previously trained model on some new text data. In this case I create a corpus from the new text using the same vocabulary as was used for the original text. However, when I then try to get the DTM (with get_dtm), the returned matrix is sometimes one row shorter than the corpus :(

Here is the weird thing. I can create a corpus and then a DTM from a data frame containing more than 200.000 documents. But if I take, say, 100 random samples of only 10 documents from this data frame, then sometimes a DTM will have only 9 rows, although the corresponding corpus has in fact 10 documents, as it should.

I short, get_dtm doesn't always correctly return a DTM for a corpus. And so far, when it fails, the DTM is always exactly one row short. Any ideas why this may happen?

Thanks,
Thomas

P.S. Example console output:

corp$corpus$get_doc_count()
[1] 10
corp$corpus$get_dtm()
9 x 5388 sparse Matrix of class "dgTMatrix"

Compilation error on CentOS 6.6

When installing text2vec on CentOS 6.6 I get the following errors:

> install.packages("text2vec")
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/text2vec_0.2.0.tar.gz'
Content type 'application/x-gzip' length 2149487 bytes (2.0 MB)
==================================================
downloaded 2.0 MB

* installing *source* package ‘text2vec’ ...
** package ‘text2vec’ successfully unpacked and MD5 sums checked
** libs
g++ -m64 -std=c++0x -I/usr/include/R -DNDEBUG  -I/usr/local/include -I"/usr/lib64/R/library/Rcpp/include" -I"/usr/lib64/R/library/RcppParallel/include" -I"/usr/lib64/R/library/digest/include"   -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c GloveFitter.cpp -o GloveFitter.o
In file included from GloveFitter.cpp:1:
GloveFit.h: In member function ‘double GloveFit::partial_fit(size_t, size_t, const RcppParallel::RVector<int>&, const RcppParallel::RVector<int>&, const RcppParallel::RVector<double>&, const RcppParallel::RVector<int>&)’:
GloveFit.h:111: warning: comparison between signed and unsigned integer expressions
g++ -m64 -std=c++0x -I/usr/include/R -DNDEBUG  -I/usr/local/include -I"/usr/lib64/R/library/Rcpp/include" -I"/usr/lib64/R/library/RcppParallel/include" -I"/usr/lib64/R/library/digest/include"   -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c HashCorpus.cpp -o HashCorpus.o
In file included from Corpus.h:1,
                 from HashCorpus.h:1,
                 from HashCorpus.cpp:1:
SparseTripletMatrix.h: In member function ‘SEXPREC* SparseTripletMatrix<T>::get_sparse_triplet_matrix(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)’:
SparseTripletMatrix.h:74: error: expected initializer before ‘:’ token
SparseTripletMatrix.h:83: error: expected ‘)’ before ‘;’ token
In file included from HashCorpus.cpp:1:
HashCorpus.h: In member function ‘void HashCorpus::insert_terms(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, int)’:
HashCorpus.h:58: error: expected initializer before ‘:’ token
HashCorpus.h:95: error: expected primary-expression at end of input
HashCorpus.h:95: error: expected ‘;’ at end of input
HashCorpus.h:95: error: expected primary-expression at end of input
HashCorpus.h:95: error: expected ‘)’ at end of input
HashCorpus.h:95: error: expected statement at end of input
HashCorpus.h:52: warning: unused variable ‘term_index’
HashCorpus.h:52: warning: unused variable ‘context_term_index’
HashCorpus.h:54: warning: unused variable ‘K’
HashCorpus.h:55: warning: unused variable ‘i’
HashCorpus.h:56: warning: unused variable ‘increment’
HashCorpus.h:95: error: expected ‘}’ at end of input
HashCorpus.h: In member function ‘void HashCorpus::insert_document_batch(Rcpp::ListOf<const Rcpp::Vector<16, Rcpp::PreserveStorage> >, int)’:
HashCorpus.h:103: error: expected initializer before ‘:’ token
HashCorpus.h:105: error: expected primary-expression before ‘}’ token
HashCorpus.h:105: error: expected ‘;’ before ‘}’ token
HashCorpus.h:105: error: expected primary-expression before ‘}’ token
HashCorpus.h:105: error: expected ‘)’ before ‘}’ token
HashCorpus.h:105: error: expected primary-expression before ‘}’ token
HashCorpus.h:105: error: expected ‘;’ before ‘}’ token
In file included from Corpus.h:1,
                 from HashCorpus.h:1,
                 from HashCorpus.cpp:1:
SparseTripletMatrix.h: In member function ‘SEXPREC* SparseTripletMatrix<T>::get_sparse_triplet_matrix(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) [with T = float]’:
HashCorpus.h:109:   instantiated from here
SparseTripletMatrix.h:73: warning: unused variable ‘n’
SparseTripletMatrix.h: In member function ‘SEXPREC* SparseTripletMatrix<T>::get_sparse_triplet_matrix(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) [with T = unsigned int]’:
HashCorpus.h:114:   instantiated from here
SparseTripletMatrix.h:73: warning: unused variable ‘n’
make: *** [HashCorpus.o] Error 1
ERROR: compilation failed for package ‘text2vec’
* removing ‘/usr/lib64/R/library/text2vec’

The downloaded source packages are in
    ‘/tmp/RtmptqzYpf/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Warning message:
In install.packages("text2vec") :
  installation of package ‘text2vec’ had non-zero exit status

Implement n-gram tokenizer

subj. Interesting package - r2vec.
Also look at google ngrams lib

Extend create_dict_corpus / create_hash_corpus with a skips argument + weighting function

Bug related to uninitialized unordered_map

It is not possible to create unordered_map with default value. (we need initialise with 0 for our word-counts)
So there are definitely bugs here and here.
Seems solution can be found here.

Add wrapper for parallel tcm construction

It will be simple to construct dtm, tcm, vocabulary using foreach and corresponding reduce function

+ for tcm
rbind for dtm
data.table join or merge for vocabulary

Data for vignette and tests

Consider to use subset of Large Movie Review Dataset which is used in kaggle's Bag of Words Meets Bags of Popcorn competition for tests and vignettes.

Functions to processing/cleaning wikipedia dump

We need something like gensim's wikicorpus.py and make_wikicorpus.py for cleaning wikipedia markup in R.

This is really useful feature for future experiments.
Contributions are very welcome - believe it will be not too hard to implement something similar in R using efficient stringi library.

Add dtm formats

LDA-C format
~~ngCMatrx format to work with LSHR~~ - use dgCMatrix instead
List-of-lists format to work with LSHR

Which hash function?

Which hash function do you use for the HashCoprpus?

predict methods

I think all the transformers also need predict methods, which can be used to apply the exact same transformation to new data.

Add get_cooc_matrix to create_dict_corpus / create_hash_corpus

Term cooccurence matrix should be symmetrical

Definitely a bug. Created it here, while matrix construction. Need to remove this ugly function and get triangular matrix to R side. And on R's side simply do tcm <- tcm_triangular + t(tcm_triangular), thanks to zachmayer, see #34.

Сonsider to use google sparse_hash_map for triplet sparse matrix growing

2-3 times slower inserts, but only 2-3 bit rep record overhead.

Windows installation failed - Glove.cpp:

Hello,
tried to getand compile the package for R in Windows.
Unfortunately it did not work.
Log below.
Hope this helps.
Thank you

devtools::install_github("dselivanov/text2vec")
Downloading GitHub repo dselivanov/text2vec@master
Installing text2vec
"C:/dev/RRO/R-32~1.2/bin/x64/R" --no-site-file --no-environ --no-save --no-restore CMD INSTALL
"C:/Users/heiling/AppData/Local/Temp/RtmpSAsW21/devtools23806cbe639b/dselivanov-text2vec-9497722"
--library="C:/dev/RRO/R-3.2.2/library" --install-tests

Multithreaded BLAS/LAPACK libraries detected. Using 4 cores for math algorithms.

installing source package 'text2vec' ...
** libs
g++ -m64 -std=c++0x -I"C:/dev/RRO/R-321.2/include" -DNDEBUG -I"C:/dev/RRO/R-3.2.2/library/Rcpp/include" -I"C:/dev/RRO/R-3.2.2/library/digest/include" -I"c:/applications/extsoft/include" -O2 -Wall -mtune=core2 -c Corpus.cpp -o Corpus.o
g++ -m64 -std=c++0x -I"C:/dev/RRO/R-321.2/include" -DNDEBUG -I"C:/dev/RRO/R-3.2.2/library/Rcpp/include" -I"C:/dev/RRO/R-3.2.2/library/digest/include" -I"c:/applications/extsoft/include" -O2 -Wall -mtune=core2 -c Glove.cpp -o Glove.o
Glove.cpp:150:26: sorry, unimplemented: non-static data member initializers
Glove.cpp:150:26: error: ISO C++ forbids in-class initialization of non-const static member 'tokens_number'
make: *** [Glove.o] Error 1
Warnung: Ausführung von Kommando 'make -f "Makevars.win" -f "C:/dev/RRO/R-32~~1.2/etc/x64/Makeconf" -f "C:/dev/RRO/R-32~~1.2/share/make/winshlib.mk" CXX='$(CXX1X) $(CXX1XSTD)' CXXFLAGS='$(CXX1XFLAGS)' CXXPICFLAGS='$(CXX1XPICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX1XLDFLAGS)' SHLIB_LD='$(SHLIB_CXX1XLD)' SHLIB="text2vec.dll" WIN=64 TCLBIN=64 OBJECTS="Corpus.o Glove.o RcppExports.o hash.o ngram.o"' ergab Status 2
ERROR: compilation failed for package 'text2vec'
removing 'C:/dev/RRO/R-3.2.2/library/text2vec'
Error: Command failed (1)

Consider to use RcppParallel instead of OpenMP for GloVe adagrad

Consider to use RcppParallel and Intel TBB instead of pthreads for asynchronous parallel SGD.

Add stopwords argument to prune_vocabulary

Subj

Systematic crash

When I try something basic on Windows 8 64 bits like :

> library("tmlite", lib.loc="~/R/win-library/3.2")
Le chargement a nécessité le package : Matrix
> a = rep("pipo", 100)
> b = create_dict_corpus(a)
  |===============================================================================================================            |  90%

It never finishes and it crashes.

Any reason?

support document ids for dtm

~~iterators should optionally return ids for documents.~~ If itoken resulted list iterator contain names - dtm should store them in rownames.
to_lda_c should set input matrix rownames as names for resulting list.

Integrate BigARTM for topic modeling

See https://github.com/bigartm/bigartm.

add NEWS file

subj

Add hash-vectorizer (hashing trick)

Implement high-speed, low-memory vectorizer that uses a technique known as feature hashing, or the “hashing trick” similar to gensim hashdictionary and scikit-learn hashing

Warnings related to refclasses

Sometimes, R throughs following strange warnings:

Warning messages:
1: In installClassMethod(value, self, field, selfEnv, thisClass) :
method .objectPackage from class Rcpp_VocabCorpus was not processed into a class method until being installed. Possible corruption of the methods in the class.
2: In installClassMethod(value, self, field, selfEnv, thisClass) :
method .objectParent from class Rcpp_VocabCorpus was not processed into a class method until being installed. Possible corruption of the methods in the class.

I didn't see any issues/errors, crashes, but suppose we need more investigation into this.

Implement c++ tokenization

strsplit is quite slow - it will be quite simple to tokenize in C++ using C++11 regex_token_iterator: 1, 2

paragraph2vec

See useful zachmayer comment:

Start with GloVe. In this case you have a square matrix: word X word, where the rows are the words and the columns are the contexts they occur in. This matrix is symmetrical, and in GloVe you factor this matrix to get vectors for your words.

Now consider the dtm matrix you already have defined in your package. In this case you have a rectangular matrix: document X word, where the rows are documents and the columns are the words (or contexts) that occur in those documents. If you do SVD on this matrix, you get pretty simple (but effective) vectors for your documents. I've actually used this approach in a lot of cases: bag-of-words + SVD gives pretty decent document vectors pretty quickly.

Now, it's time to get fancy! =D Both of our matrices have the same number of columns: each column is a word from our bag-of-words. So we can row bind these matrices together into a giant matrix where the rows are words / documents and the columns are contexts (either for the word, or that occur in the document). Now we factor this matrix and find words vectors AND document vectors at the same time. This simultaneous word/document vector strategy seems to give better results than doing step 1 & 2 on their own.
So in your package, you could start with create_hash_corpus with ngrams = 0 and skips = 0. Then extract the dtm from this matrix. Next, create_hash_corpus with skips = 1:8 and some fancy skip-gram weighting function. Extract the cooc matrix from this matrix.
Now run final_matrix = rbind(dtm, cooc) and then model = irlba(final_matrix, nu=100)) and then doc_vecs = model$u[1:nrow(dtm),], and you have really good document vectors!

How to speed up reading form connection?

Now we use readLines to consume string from connection. Seems it is quite slow, but did't see alternative.
Related materials:

readr don't have streaming API
hadleys gist - benchmarks for whole file reading
romain's gist - workarounds to get access to the c level connections api from c++
readr issues 16 and 19

add dictionary option to Corpus construction procedure

Now we can do dictionary post-filter while creating dtm. But in case HashDict it is impossible to keep dictionary (as workaround we can hash them and remove those columns, but this will be not fair and potentially we will loose some information).

Improve hash-vectorizer

Add second, single-bit output hash function - wiki
Use murmurhash3 from digest package instead of srd::hash. See FeatureHashing package implementation

Print methods

Add print methods for corpus classes.

Add benchmarks for vectorization

It would be great if someone could compare with scikit-learn's CountVectorizer and HashingVectorizer. Hope tmlite will easily outperform them.

dselivanov / text2vec Goto Github PK

text2vec's Introduction

Performance

Contributing

License

text2vec's People

Contributors

Stargazers

Watchers

Forkers

text2vec's Issues

Recommend Projects

Recommend Topics

Recommend Org