kwartler / text_mining Goto Github PK
View Code? Open in Web Editor NEWThis repo contains data from Ted Kwartler's "Text Mining in Practice With R" book.
This repo contains data from Ted Kwartler's "Text Mining in Practice With R" book.
Ted, in your book you apply intercept = F
within your glmnet
models. All the other parameters you discuss but you do not provide justification for this parameter. I'm curious as to why you chose this parameter in the headline click bait case study?
Problem in page 176
fit.glove <- glove(tcm = tcm, word_vectors_size = 50, x_max = 10, learning_rate = 0.2, num_iters = 15)
RStudio Say
Error in .subset2(public_bind_env, "initialize")(...) :
unused argument (grain_size = 100000)
Además: Warning message:
'glove' is deprecated.
Use 'GloVe' instead.
See help("Deprecated")
In page 173,
> vectorizer <- vocab_vectorizer(vocab, + grow_dtm = FALSE, + skip_grams_window = 5)
I got the following error message:
Error in vocab_vectorizer(vocab, grow_dtm = FALSE,
skip_grams_window = 5) :
unused arguments (grow_dtm = FALSE, skip_grams_window = 5)
Thanks in advance,
Chang-Kyo Suh([email protected])
Dear Ted
Question: Can we input tf-idf document term matrix into Latent Dirichlet Allocation (LDA)? if yes, how?
it does not work in my case and the LDA function requires the 'term-frequency' document term matrix.
Thank you
(I make a question as concise as possible. So, if you need more details, I can add
##########################################################################
TF-IDF Document matrix construction
##########################################################################
> DTM_tfidf <-DocumentTermMatrix(corpora,control = list(weighting =
function(x)+ weightTfIdf(x, normalize = FALSE)))
> str(DTM_tfidf)
List of 6
$ i : int [1:4466] 1 1 1 1 1 1 1 1 1 1 ...
$ j : int [1:4466] 6 10 22 26 28 36 39 41 47 48 ...
$ v : num [1:4466] 6 2.09 1.05 3.19 2.19 ...
$ nrow : int 64
$ ncol : int 297
$ dimnames:List of 2
..$ Docs : chr [1:64] "1" "2" "3" "4" ...
..$ Terms: chr [1:297] "accommod" "account" "achiev" "act" ...
- attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
- attr(*, "weighting")= chr [1:2] "term frequency - inverse document
frequency" "tf-idf"
##########################################################################
LDA section
##########################################################################
> LDA_results <-LDA(DTM_tfidf,k, method="Gibbs", control=list(nstart=nstart,
+ seed = seed, best=best,
+ burnin = burnin, iter = iter, thin=thin))
##########################################################################
Error messages
##########################################################################
Error in LDA(DTM_tfidf, k, method = "Gibbs", control = list(nstart =
nstart, :
The DocumentTermMatrix needs to have a term frequency weighting
When I try to run the following code from p 255.
all.ner<-Map(function(tex,fea,id) cbind(fea,
entity=substring(tex, fea$start,fea$end),file=id),
all.emails,all.ner,temp)
I get this error message:
Error in substring(tex, fea$start, fea$end) : invalid substring arguments
I can't figure out what the problem is. Any idea?
thanks,
Marton
In section 6.2.3, page 196 the last line of the code is not readable as shown in the following image(yellow highlighted section):-
For those of who are suffering from this the full code is as follows
clean.test <- headline.clean(test.headlines$headline)
test.dtm <- match.matrix(clean.test, weighting = tm::weightTfIdf, original.matrix = train.dtm)
in the page 188
clean.train<-headline.cleanv(train.headlines$headline)
must be
clean.train<-headline.cleanv(train.headlines$headline)
In order to answer how often agents refer to phone numbers, which has the pattern xxx-xxx-xxxx in the USA, the code on the page 45 is
sum(grepl('[0-9]{3})|[0-9]{4}', text.df$text))/
nrow(text.df)
There is a )
after {3}
. What is this closing bracket for? Without it, the relative frequency of a phone number being referenced is 0.1445171.
Thank you for your advice.
I bought the printed version of this interesting book. Obviously the author is offering the data but not the R-Code for the different chapters online. Despite of some efforts I couldn't find the code-source. Yes, one can copy the code out of the book, but why this imposition?
p.188(last sentence).
The textbook says "You do not need to specify the third parameter, original.matrix, since train.dtm is the original matrix." But when I try the following code:
train.dtm <- match.matrix(clean.train, weighting = tm::weightTfIdf)
I got the following error message:
Error in match.matrix(clean.train, weighting = tm::weightTfIdf) : object 'original.martix' not found
In addition: Warning message:
In weighting(x) : empty document(s): 409
Would you please try the code to advise me how to fix it? I attatchd the code for match.matrix as follows:
#------------ End of an issue
Thanks in advance
Chang-Kyo Suh([email protected])
In the match.matrix function, there are three closed curly brackets "}" but only two open curly brackets "{", which leads to the function not working.
Since there is an if-statement that does not have an open {, I tried putting the missing bracket there; to no avail:
if (attr(original.matrix, "weighting")[2] == "tfidf") {
...
matrix <– fixed
}
Where does the missing bracket have to go?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.