Git Product home page Git Product logo

text2vec's Introduction

text2vec is an R package which provides an efficient framework with a concise API for text analysis and natural language processing (NLP).

Goals which we aimed to achieve as a result of development of text2vec:

  • Concise - expose as few functions as possible
  • Consistent - expose unified interfaces, no need to explore new interface for each task
  • Flexible - allow to easily solve complex tasks
  • Fast - maximize efficiency per single thread, transparently scale to multiple threads on multicore machines
  • Memory efficient - use streams and iterators, not keep data in RAM if possible

See API section for details.

Performance

htop

This package is efficient because it is carefully written in C++, which also means that text2vec is memory friendly. Some parts are fully parallelized using OpenMP.

Other emrassingly parallel tasks (such as vectorization) can use any fork-based parallel backend on UNIX-like machines. They can achieve near-linear scalability with the number of available cores.

Finally, a streaming API means that users do not have to load all the data into RAM.

Contributing

The package has issue tracker on GitHub where I'm filing feature requests and notes for future work. Any ideas are appreciated.

Contributors are welcome. You can help by:

License

GPL (>= 2)

text2vec's People

Contributors

chrisss93 avatar dfalbel avatar dselivanov avatar kbenoit avatar lmullen avatar manuelbickel avatar michaelchirico avatar michaelpaulhirsch avatar mtoto avatar pshashk avatar tylerlittlefield avatar y-he2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

text2vec's Issues

performance comparison with quanteda

Hi - Interesting package, and I agree wholeheartedly with your API and performance qualms with tm(). That's why we started https://github.com/kbenoit/quanteda. My replication of your performance comparisons is very similar to what you reported. Here is quanteda:

> quantedaDtm <- quanteda::dfm(dt[['review']])
Creating a dfm from a character vector ...
   ... lowercasing
   ... tokenizing
   ... indexing documents: 25,000 documents
   ... indexing features: 100,605 feature types
   ... created a 25000 x 100605 sparse dfm
   ... complete. 
Elapsed time: 5.701 seconds.
> print(object.size(quantedaDtm), quote = FALSE, units = "Mb")
47.7 Mb

Note that this does everything in one pass, including lowercasing and tokenisation. There are methods defined for corpus management etc and dfm() methods for those objects as well, but this is the quickest way to go from the input text into a matrix representation.

Add wrapper for parallel dtm construction

It will be simple to construct dtm, tcm, vocabulary using foreach and corresponding reduce function

  1. + for tcm
  2. rbind for dtm
  3. data.table join or merge for vocabulary

Long vectors (> 2^31) and large matrices

While creation of tcm in VocabCorpus on english wikipedia dump with top 400K vocabulary I got following:

Error in create_vocab_corpus(iterator = it2, vocabulary = vocab, grow_dtm = F, : long vectors not supported yet: ../../src/include/Rinlinedfuns.h:137

At the moment I can't figure out what is wrong, probably need more investigation into Rcpp Modules.

Add introduction vignette

  • Creating vocabulary corpus
  • Creating hash corpus
  • Term concurrence matrix, GloVe factorization, analogies on wikipedia dump

Improve tokenizers

Switch to stringi or stringr for tokenization and preprocessing. These libraries 2-3 times faster than R's default regular expression functions family.

Installing on OS X

Hi, I tried installing on a Mac and got the following messages.
Any ideas?

Thanks.

Downloading github repo dselivanov/text2vec@master
Installing text2vec
'/Library/Frameworks/R.framework/Resources/bin/R' --no-site-file --no-environ --no-save --no-restore CMD INSTALL
'/private/var/folders/6l/r_hnn2w93z185znmtg8x__2r0000gn/T/RtmpLra2gU/devtools56c7644c81a/dselivanov-text2vec-eae208a'
--library='/Library/Frameworks/R.framework/Versions/3.2/Resources/library' --install-tests

  • installing source package ‘text2vec’ ...
    ** libs
    clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c Corpus.cpp -o Corpus.o
    clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c Glove.cpp -o Glove.o
    clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c RcppExports.cpp -o RcppExports.o
    clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c hash.cpp -o hash.o
    clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -DPLATFORM_PKGTYPE='"mac.binary.mavericks"' -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/Rcpp/include" -I"/Library/Frameworks/R.framework/Versions/3.2/Resources/library/digest/include" -fPIC -Wall -mtune=core2 -g -O2 -c ngram.cpp -o ngram.o
    clang++ -std=c++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/opt/X11/lib -L/usr/local/lib /usr/local/lib/libcairo.a /usr/local/lib/libpixman-1.a /usr/local/lib/libfreetype.a /usr/local/lib/libfontconfig.a -lxml2 /usr/local/lib/libreadline.a -o text2vec.so Corpus.o Glove.o RcppExports.o hash.o ngram.o -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
    clang: error: no such file or directory: '/usr/local/lib/libcairo.a'
    clang: error: no such file or directory: '/usr/local/lib/libpixman-1.a'
    clang: error: no such file or directory: '/usr/local/lib/libfreetype.a'
    clang: error: no such file or directory: '/usr/local/lib/libfontconfig.a'
    clang: error: no such file or directory: '/usr/local/lib/libreadline.a'
    make: *** [text2vec.so] Error 1
    ERROR: compilation failed for package ‘text2vec’
  • removing ‘/Library/Frameworks/R.framework/Versions/3.2/Resources/library/text2vec’
    Error: Command failed (1)

Add tests

Need to write tests for C++ and R modules.

get_dtm() sometimes fails, returning a matrix one row shorter than number of docs in corpus

Hi,
first of all thanks a lot for you great work. text2vec has in my experience been easily the fastest and most memory efficient way in R to work with text in vector space. It's been working great for me so far, but now I seem to have come across a bug.

The issue comes up regularly when I try to apply a previously trained model on some new text data. In this case I create a corpus from the new text using the same vocabulary as was used for the original text. However, when I then try to get the DTM (with get_dtm), the returned matrix is sometimes one row shorter than the corpus :(

Here is the weird thing. I can create a corpus and then a DTM from a data frame containing more than 200.000 documents. But if I take, say, 100 random samples of only 10 documents from this data frame, then sometimes a DTM will have only 9 rows, although the corresponding corpus has in fact 10 documents, as it should.

I short, get_dtm doesn't always correctly return a DTM for a corpus. And so far, when it fails, the DTM is always exactly one row short. Any ideas why this may happen?

Thanks,
Thomas

P.S. Example console output:

corp$corpus$get_doc_count()
[1] 10
corp$corpus$get_dtm()
9 x 5388 sparse Matrix of class "dgTMatrix"

Compilation error on CentOS 6.6

When installing text2vec on CentOS 6.6 I get the following errors:

> install.packages("text2vec")
Installing package into ‘/usr/lib64/R/library’
(as ‘lib’ is unspecified)
trying URL 'http://cran.rstudio.com/src/contrib/text2vec_0.2.0.tar.gz'
Content type 'application/x-gzip' length 2149487 bytes (2.0 MB)
==================================================
downloaded 2.0 MB

* installing *source* package ‘text2vec’ ...
** package ‘text2vec’ successfully unpacked and MD5 sums checked
** libs
g++ -m64 -std=c++0x -I/usr/include/R -DNDEBUG  -I/usr/local/include -I"/usr/lib64/R/library/Rcpp/include" -I"/usr/lib64/R/library/RcppParallel/include" -I"/usr/lib64/R/library/digest/include"   -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c GloveFitter.cpp -o GloveFitter.o
In file included from GloveFitter.cpp:1:
GloveFit.h: In member function ‘double GloveFit::partial_fit(size_t, size_t, const RcppParallel::RVector<int>&, const RcppParallel::RVector<int>&, const RcppParallel::RVector<double>&, const RcppParallel::RVector<int>&)’:
GloveFit.h:111: warning: comparison between signed and unsigned integer expressions
g++ -m64 -std=c++0x -I/usr/include/R -DNDEBUG  -I/usr/local/include -I"/usr/lib64/R/library/Rcpp/include" -I"/usr/lib64/R/library/RcppParallel/include" -I"/usr/lib64/R/library/digest/include"   -fpic  -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m64 -mtune=generic -c HashCorpus.cpp -o HashCorpus.o
In file included from Corpus.h:1,
                 from HashCorpus.h:1,
                 from HashCorpus.cpp:1:
SparseTripletMatrix.h: In member function ‘SEXPREC* SparseTripletMatrix<T>::get_sparse_triplet_matrix(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&)’:
SparseTripletMatrix.h:74: error: expected initializer before ‘:’ token
SparseTripletMatrix.h:83: error: expected ‘)’ before ‘;’ token
In file included from HashCorpus.cpp:1:
HashCorpus.h: In member function ‘void HashCorpus::insert_terms(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, int)’:
HashCorpus.h:58: error: expected initializer before ‘:’ token
HashCorpus.h:95: error: expected primary-expression at end of input
HashCorpus.h:95: error: expected ‘;’ at end of input
HashCorpus.h:95: error: expected primary-expression at end of input
HashCorpus.h:95: error: expected ‘)’ at end of input
HashCorpus.h:95: error: expected statement at end of input
HashCorpus.h:52: warning: unused variable ‘term_index’
HashCorpus.h:52: warning: unused variable ‘context_term_index’
HashCorpus.h:54: warning: unused variable ‘K’
HashCorpus.h:55: warning: unused variable ‘i’
HashCorpus.h:56: warning: unused variable ‘increment’
HashCorpus.h:95: error: expected ‘}’ at end of input
HashCorpus.h: In member function ‘void HashCorpus::insert_document_batch(Rcpp::ListOf<const Rcpp::Vector<16, Rcpp::PreserveStorage> >, int)’:
HashCorpus.h:103: error: expected initializer before ‘:’ token
HashCorpus.h:105: error: expected primary-expression before ‘}’ token
HashCorpus.h:105: error: expected ‘;’ before ‘}’ token
HashCorpus.h:105: error: expected primary-expression before ‘}’ token
HashCorpus.h:105: error: expected ‘)’ before ‘}’ token
HashCorpus.h:105: error: expected primary-expression before ‘}’ token
HashCorpus.h:105: error: expected ‘;’ before ‘}’ token
In file included from Corpus.h:1,
                 from HashCorpus.h:1,
                 from HashCorpus.cpp:1:
SparseTripletMatrix.h: In member function ‘SEXPREC* SparseTripletMatrix<T>::get_sparse_triplet_matrix(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) [with T = float]’:
HashCorpus.h:109:   instantiated from here
SparseTripletMatrix.h:73: warning: unused variable ‘n’
SparseTripletMatrix.h: In member function ‘SEXPREC* SparseTripletMatrix<T>::get_sparse_triplet_matrix(std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&, std::vector<std::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::basic_string<char, std::char_traits<char>, std::allocator<char> > > >&) [with T = unsigned int]’:
HashCorpus.h:114:   instantiated from here
SparseTripletMatrix.h:73: warning: unused variable ‘n’
make: *** [HashCorpus.o] Error 1
ERROR: compilation failed for package ‘text2vec’
* removing ‘/usr/lib64/R/library/text2vec’

The downloaded source packages are in
    ‘/tmp/RtmptqzYpf/downloaded_packages’
Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Warning message:
In install.packages("text2vec") :
  installation of package ‘text2vec’ had non-zero exit status

Add wrapper for parallel tcm construction

It will be simple to construct dtm, tcm, vocabulary using foreach and corresponding reduce function

  1. + for tcm
  2. rbind for dtm
  3. data.table join or merge for vocabulary

predict methods

I think all the transformers also need predict methods, which can be used to apply the exact same transformation to new data.

Term cooccurence matrix should be symmetrical

Definitely a bug. Created it here, while matrix construction. Need to remove this ugly function and get triangular matrix to R side. And on R's side simply do tcm <- tcm_triangular + t(tcm_triangular), thanks to zachmayer, see #34.

Windows installation failed - Glove.cpp:

Hello,
tried to getand compile the package for R in Windows.
Unfortunately it did not work.
Log below.
Hope this helps.
Thank you

devtools::install_github("dselivanov/text2vec")
Downloading GitHub repo dselivanov/text2vec@master
Installing text2vec
"C:/dev/RRO/R-32~1.2/bin/x64/R" --no-site-file --no-environ --no-save --no-restore CMD INSTALL
"C:/Users/heiling/AppData/Local/Temp/RtmpSAsW21/devtools23806cbe639b/dselivanov-text2vec-9497722"
--library="C:/dev/RRO/R-3.2.2/library" --install-tests

Multithreaded BLAS/LAPACK libraries detected. Using 4 cores for math algorithms.

  • installing source package 'text2vec' ...
    ** libs
    g++ -m64 -std=c++0x -I"C:/dev/RRO/R-321.2/include" -DNDEBUG -I"C:/dev/RRO/R-3.2.2/library/Rcpp/include" -I"C:/dev/RRO/R-3.2.2/library/digest/include" -I"c:/applications/extsoft/include" -O2 -Wall -mtune=core2 -c Corpus.cpp -o Corpus.o
    g++ -m64 -std=c++0x -I"C:/dev/RRO/R-32
    1.2/include" -DNDEBUG -I"C:/dev/RRO/R-3.2.2/library/Rcpp/include" -I"C:/dev/RRO/R-3.2.2/library/digest/include" -I"c:/applications/extsoft/include" -O2 -Wall -mtune=core2 -c Glove.cpp -o Glove.o
    Glove.cpp:150:26: sorry, unimplemented: non-static data member initializers
    Glove.cpp:150:26: error: ISO C++ forbids in-class initialization of non-const static member 'tokens_number'
    make: *** [Glove.o] Error 1
    Warnung: Ausführung von Kommando 'make -f "Makevars.win" -f "C:/dev/RRO/R-321.2/etc/x64/Makeconf" -f "C:/dev/RRO/R-321.2/share/make/winshlib.mk" CXX='$(CXX1X) $(CXX1XSTD)' CXXFLAGS='$(CXX1XFLAGS)' CXXPICFLAGS='$(CXX1XPICFLAGS)' SHLIB_LDFLAGS='$(SHLIB_CXX1XLDFLAGS)' SHLIB_LD='$(SHLIB_CXX1XLD)' SHLIB="text2vec.dll" WIN=64 TCLBIN=64 OBJECTS="Corpus.o Glove.o RcppExports.o hash.o ngram.o"' ergab Status 2
    ERROR: compilation failed for package 'text2vec'
  • removing 'C:/dev/RRO/R-3.2.2/library/text2vec'
    Error: Command failed (1)

Systematic crash

When I try something basic on Windows 8 64 bits like :

> library("tmlite", lib.loc="~/R/win-library/3.2")
Le chargement a nécessité le package : Matrix
> a = rep("pipo", 100)
> b = create_dict_corpus(a)
  |===============================================================================================================            |  90%

It never finishes and it crashes.

Any reason?

support document ids for dtm

  • iterators should optionally return ids for documents. If itoken resulted list iterator contain names - dtm should store them in rownames.
  • to_lda_c should set input matrix rownames as names for resulting list.

Warnings related to refclasses

Sometimes, R throughs following strange warnings:

Warning messages:
1: In installClassMethod(value, self, field, selfEnv, thisClass) :
method .objectPackage from class Rcpp_VocabCorpus was not processed into a class method until being installed. Possible corruption of the methods in the class.
2: In installClassMethod(value, self, field, selfEnv, thisClass) :
method .objectParent from class Rcpp_VocabCorpus was not processed into a class method until being installed. Possible corruption of the methods in the class.

I didn't see any issues/errors, crashes, but suppose we need more investigation into this.

paragraph2vec

See useful zachmayer comment:

  1. Start with GloVe. In this case you have a square matrix: word X word, where the rows are the words and the columns are the contexts they occur in. This matrix is symmetrical, and in GloVe you factor this matrix to get vectors for your words.
  2. Now consider the dtm matrix you already have defined in your package. In this case you have a rectangular matrix: document X word, where the rows are documents and the columns are the words (or contexts) that occur in those documents. If you do SVD on this matrix, you get pretty simple (but effective) vectors for your documents. I've actually used this approach in a lot of cases: bag-of-words + SVD gives pretty decent document vectors pretty quickly.

Now, it's time to get fancy! =D Both of our matrices have the same number of columns: each column is a word from our bag-of-words. So we can row bind these matrices together into a giant matrix where the rows are words / documents and the columns are contexts (either for the word, or that occur in the document). Now we factor this matrix and find words vectors AND document vectors at the same time. This simultaneous word/document vector strategy seems to give better results than doing step 1 & 2 on their own.
So in your package, you could start with create_hash_corpus with ngrams = 0 and skips = 0. Then extract the dtm from this matrix. Next, create_hash_corpus with skips = 1:8 and some fancy skip-gram weighting function. Extract the cooc matrix from this matrix.
Now run final_matrix = rbind(dtm, cooc) and then model = irlba(final_matrix, nu=100)) and then doc_vecs = model$u[1:nrow(dtm),], and you have really good document vectors!

add dictionary option to Corpus construction procedure

Now we can do dictionary post-filter while creating dtm. But in case HashDict it is impossible to keep dictionary (as workaround we can hash them and remove those columns, but this will be not fair and potentially we will loose some information).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.