abuchmueller / twitmo Goto Github PK
View Code? Open in Web Editor NEWCollect Twitter data and create topic models with R
License: Other
Collect Twitter data and create topic models with R
License: Other
> z <- parse_stream("fromUK.json")
opening fileoldClasskRp.connection input connection.
Found 2918 records...closing fileoldClasskRp.connection input connection.
Imported 22015 records. Simplifying...
Warning message:
In .Internal(gc(verbose, reset, full)) :
closing unused connection 3 (/var/folders/s0/sjz6vgxj7fj2x6dqx1pcyys00000gn/T//RtmpYrobz9/filed5c61b7051d)
parse_stream()
leaves the connection open after parsing and R issues a connection warning. This not really a useful warning and should be fixed by closing the connection properly after read in the data.
From issue #1
Currently usernames @USER
and #Hashtags
are not removed when pooling with pool_tweets()
.
Usernames however should be dropped completely from text as they are not relevant for meaningful topics. The handling of hashtags should be up to the user. Either Hashtags are completely dropped from the tweets after pooling OR the #
sign is removed and the hashtag kept.
> stm.covariates <- "retweet_count,followers_count,reply_count,quote_count,favorite_count"
> stm.model <- fit_stm(mytweets, n_topics = 7, xcov = stm.covariates)
Building corpus...
Converting to Lower Case...
Removing punctuation...
Removing stopwords...
Error in tm::stopwords(language) : no stopwords available for 'FALSE'
This also happens if no argument is supplied since FALSE
is the default argument for stopwords i.e.
stm.model <- fit_stm(mytweets, n_topics = 7, xcov = stm.covariates)
will lead to the same result.
FALSE
is passed to stm::textProcessor()
language
argument but only supports character strings.
language
argument should be NA
and removestopwords
set to FALSE
.
as of quanteda 3.0 textstat_simil()
is not part of quanteda anymore but now inside quanteda.textstats
usage of quanteda.textstats::textstat_simil()
from now on required for similarty calculation between pooled and unpooled tweets
stm's need covariates extracted i.e.
and addes as meta data to the dfm object.
Currently
library(TweetLocViz)
dat <- parse_stream("inst/extdata/tweets 20191027-141233.json")
#> opening file input connection.
#> Found 167 records... Found 193 records... Imported 193 records. Simplifying...
#> closing file input connection.
pool <- pool_tweets(dat)
#>
#> 193 Tweets found
#> Pooling 35 Tweets with Hashtags
#> 36 Unique Hashtags found
#> Begin pooling ...Done
unique(lapply(dat$hashtags, unique))
#> [[1]]
#> [1] "mood"
#>
#> [[2]]
#> [1] NA
#>
#> [[3]]
#> [1] "motivate"
#>
#> [[4]]
#> [1] "Healthcare"
#>
#> [[5]]
#> [1] "mrrbnsnathome" "newyork" "breakfast"
#>
#> [[6]]
#> [1] "ThisIsMyPlace" "P4L"
#>
#> [[7]]
#> [1] "chinup" "Sundayfunday" "saintsgameday" "instapuppy"
#> [5] "woof" "tailswagging"
#>
#> [[8]]
#> [1] "TickFire"
#>
#> [[9]]
#> [1] "MSIclassic"
#>
#> [[10]]
#> [1] "NYC" "About" "JoetheCrane"
#>
#> [[11]]
#> [1] "SundayMorning" "lawofattraction" "collaborate"
#>
#> [[12]]
#> [1] "WOW" "KeepOnGoin"
#>
#> [[13]]
#> [1] "Government"
#>
#> [[14]]
#> [1] "ladystrut19" "ladystrutaccessories"
#>
#> [[15]]
#> [1] "SmartNews"
#>
#> [[16]]
#> [1] "SundayThoughts"
#>
#> [[17]]
#> [1] "SF100"
#>
#> [[18]]
#> [1] "spayneuter"
#>
#> [[19]]
#> [1] "openhouse" "springtx"
#>
#> [[20]]
#> [1] "Labor" "Norfolk"
#>
#> [[21]]
#> [1] "inkAndIdeas"
#>
#> [[22]]
#> [1] "oprylandhotel"
#>
#> [[23]]
#> [1] "Pharmaceutical"
#>
#> [[24]]
#> [1] "EastHanover" "Sales"
#>
#> [[25]]
#> [1] "Scryingartist" "BeautifulSkyz"
#>
#> [[26]]
#> [1] "knoxvilletn" "downtownknoxville"
#>
#> [[27]]
#> [1] "heartofservice" "youthmagnet" "youthmentor"
#>
#> [[28]]
#> [1] "Bonjour"
#>
#> [[29]]
#> [1] "Trump2020"
#>
#> [[30]]
#> [1] "spiritchat"
#>
#> [[31]]
#> [1] "FreèJulianAssange"
#>
#> [[32]]
#> [1] "Columbia"
#>
#> [[33]]
#> [1] "NewCastle"
#>
#> [[34]]
#> [1] "Oncology"
#>
#> [[35]]
#> [1] "NBATwitter"
#>
#> [[36]]
#> [1] "Detroit"
clearly more than 36 unique hashtags.
also "NA" is not a valid hashtag.
Filtering certain keywords, can lead to more cohesive topics.
By default if any keyword is contained in the tweet text the tweet is preserved. All the other tweets are dropped (Inclusive filtering). Add option to perform an exclusive filter. If any keyword is contained, a tweet is excluded (exclusive filtering).
stream_tweets()
and search_tweets()
use different schemas for geo-located tweets.
stream_tweets()
needs a vector bbox coordinates while search_tweets()
uses lat/lng coordinates and a radius in miles. for ease of use it would be better if both would work in get_tweets()
via conversion. since conversion is approximate, a warning could be issued.
rtweet
does not support v2 endpoints yet. the academic research track of twitter's API is a special case of a v2 endpoint, has much higher rate limits (150,000 tweets/15min) a regular developer account.
rtools (windows) and rcpp compiler tools (mac) need to be installed before TweetLocViz
can be built from github.
the README should provide links to both.
quanteda issued a major update
dfm(): As of version 3, only tokens objects are supported as inputs to dfm(). Calling dfm() for character or corpus objects is still functional, but issues a warning. Convenience passing of arguments to tokens() via ... for dfm() is also deprecated, but undocumented, and functions only with a warning. Users should now create a tokens object (using tokens() from character or corpus inputs before calling dfm().
since pool_tweets()
still uses the corpus
and dfm
to construct its token objects TweetLocViz
needs to be updated.
> z <- parse_stream("fromUK.json")
opening fileoldClasskRp.connection input connection.
Found 2918 records...closing fileoldClasskRp.connection input connection.
Imported 22015 records. Simplifying...
> z <- jsonlite::stream_in(file("fromUK.json"))
opening fileoldClasskRp.connection input connection.
Found 2918 records...Error: parse error: premature EOF
{"created_at":"Thu Mar 11 15:20
(right here) ------^
closing fileoldClasskRp.connection input connection.
Parsing error at line 2918. However there is no indication of what exactly is happening since parse_stream()
only issues a connection warning.
Unfortunately it seems there are problems related to the authentication.
After running the get_tweets function I get the following error:
"Requesting token on behalf of user...
Error in twitter_init_oauth1.0(self$endpoint, self$app, permission = self$params$permission, :
Unauthorized (HTTP 401)."
In the tutorial you write "Make sure you have a regular Twitter Account before start to sample your tweets." But in the script there it is nowhere indicated how to authenticate.
Some other tutorial suggest to insert Twitter consumer key and secret using setup_twitter_oauth function, but it seems to be under a package which is no longer available.
Can you please provide guidance on how to solve the issue?
Here below the simple script I am trying to run. After that I get the error.
Thanks!
if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}
devtools::install_version("rtweet", version = "0.7.0", repos = "http://cran.us.r-project.org")
if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}
remotes::install_github("abuchmueller/Twitmo")
library(Twitmo)
get_tweets(method = 'stream',
location = "GBR",
timeout = 30,
file_name = "uk_tweets.json")
moved from #1 due to complexity
There are two possible scenarios for the implementation of external tokenizers:
rtweet:::citycoords
has 700+ cities and works out of the box with rtweet::lookup_coords()
but does not have regions / countries e.g. EU, China, Germany.
A list of 200 countries/regions with their bounding box can be found here
includes but not limited to
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.