abuchmueller / twitmo Goto Github PK

View Code? Open in Web Editor NEW

21.0 4.0 3.0 2.57 MB

Collect Twitter data and create topic models with R

License: Other

R 100.00%

topic-modeling lda stm twitter-api twitter geospatial nlp ctm r r-package

twitmo's People

Contributors

Stargazers

Watchers

Forkers

chrisw09 yingyingfan0059 imarcello

twitmo's Issues

unused connection warning when using parse_stream()

> z <- parse_stream("fromUK.json")
opening fileoldClasskRp.connection input connection.
 Found 2918 records...closing fileoldClasskRp.connection input connection.
 Imported 22015 records. Simplifying...
Warning message:
In .Internal(gc(verbose, reset, full)) :
  closing unused connection 3 (/var/folders/s0/sjz6vgxj7fj2x6dqx1pcyys00000gn/T//RtmpYrobz9/filed5c61b7051d)

parse_stream() leaves the connection open after parsing and R issues a connection warning. This not really a useful warning and should be fixed by closing the connection properly after read in the data.

plot_hashtags() only plots tweet if first hashtag in character vector of hashtags matches

Handle social media tags

From issue #1

Currently usernames @USER and #Hashtags are not removed when pooling with pool_tweets().

Usernames however should be dropped completely from text as they are not relevant for meaningful topics. The handling of hashtags should be up to the user. Either Hashtags are completely dropped from the tweets after pooling OR the # sign is removed and the hashtag kept.

fit_stm() throws error if stopwords argument is missing

Example

> stm.covariates <- "retweet_count,followers_count,reply_count,quote_count,favorite_count"
> stm.model <- fit_stm(mytweets, n_topics = 7, xcov =  stm.covariates)
Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
 Error in tm::stopwords(language) : no stopwords available for 'FALSE'

This also happens if no argument is supplied since FALSE is the default argument for stopwords i.e.
stm.model <- fit_stm(mytweets, n_topics = 7, xcov = stm.covariates)

will lead to the same result.

Reason

FALSE is passed to stm::textProcessor() language argument but only supports character strings.

Fix

language argument should be NA and removestopwords set to FALSE.

fix pool_tweets() after quanteda 3.0 update

as of quanteda 3.0 textstat_simil() is not part of quanteda anymore but now inside quanteda.textstats

usage of quanteda.textstats::textstat_simil() from now on required for similarty calculation between pooled and unpooled tweets

Add STM support

stm's need covariates extracted i.e.

favourite count
retweet count
emojis
hashtags

and addes as meta data to the dfm object.

pool_tweets() returns wrong number of unique hashtags

Example

Currently

library(TweetLocViz)
dat <- parse_stream("inst/extdata/tweets 20191027-141233.json")
#> opening file input connection.
#>  Found 167 records... Found 193 records... Imported 193 records. Simplifying...
#> closing file input connection.
pool <- pool_tweets(dat)
#> 
#> 193 Tweets found
#> Pooling 35 Tweets with Hashtags
#> 36 Unique Hashtags found
#> Begin pooling ...Done
unique(lapply(dat$hashtags, unique))
#> [[1]]
#> [1] "mood"
#> 
#> [[2]]
#> [1] NA
#> 
#> [[3]]
#> [1] "motivate"
#> 
#> [[4]]
#> [1] "Healthcare"
#> 
#> [[5]]
#> [1] "mrrbnsnathome" "newyork"       "breakfast"    
#> 
#> [[6]]
#> [1] "ThisIsMyPlace" "P4L"          
#> 
#> [[7]]
#> [1] "chinup"        "Sundayfunday"  "saintsgameday" "instapuppy"   
#> [5] "woof"          "tailswagging" 
#> 
#> [[8]]
#> [1] "TickFire"
#> 
#> [[9]]
#> [1] "MSIclassic"
#> 
#> [[10]]
#> [1] "NYC"         "About"       "JoetheCrane"
#> 
#> [[11]]
#> [1] "SundayMorning"   "lawofattraction" "collaborate"    
#> 
#> [[12]]
#> [1] "WOW"        "KeepOnGoin"
#> 
#> [[13]]
#> [1] "Government"
#> 
#> [[14]]
#> [1] "ladystrut19"          "ladystrutaccessories"
#> 
#> [[15]]
#> [1] "SmartNews"
#> 
#> [[16]]
#> [1] "SundayThoughts"
#> 
#> [[17]]
#> [1] "SF100"
#> 
#> [[18]]
#> [1] "spayneuter"
#> 
#> [[19]]
#> [1] "openhouse" "springtx" 
#> 
#> [[20]]
#> [1] "Labor"   "Norfolk"
#> 
#> [[21]]
#> [1] "inkAndIdeas"
#> 
#> [[22]]
#> [1] "oprylandhotel"
#> 
#> [[23]]
#> [1] "Pharmaceutical"
#> 
#> [[24]]
#> [1] "EastHanover" "Sales"      
#> 
#> [[25]]
#> [1] "Scryingartist" "BeautifulSkyz"
#> 
#> [[26]]
#> [1] "knoxvilletn"       "downtownknoxville"
#> 
#> [[27]]
#> [1] "heartofservice" "youthmagnet"    "youthmentor"   
#> 
#> [[28]]
#> [1] "Bonjour"
#> 
#> [[29]]
#> [1] "Trump2020"
#> 
#> [[30]]
#> [1] "spiritchat"
#> 
#> [[31]]
#> [1] "FreèJulianAssange"
#> 
#> [[32]]
#> [1] "Columbia"
#> 
#> [[33]]
#> [1] "NewCastle"
#> 
#> [[34]]
#> [1] "Oncology"
#> 
#> [[35]]
#> [1] "NBATwitter"
#> 
#> [[36]]
#> [1] "Detroit"

clearly more than 36 unique hashtags.
also "NA" is not a valid hashtag.

add geospatial plotting methods

Add support for geospatial plots by tweet and topics

e.g.

add support for filtering keywords

Add support for filtering tweets based on a keyword list.

Filtering certain keywords, can lead to more cohesive topics.
By default if any keyword is contained in the tweet text the tweet is preserved. All the other tweets are dropped (Inclusive filtering). Add option to perform an exclusive filter. If any keyword is contained, a tweet is excluded (exclusive filtering).

inclusive filtering
exclusive filtering

Better Tokenization

~~add support for additional tokenizers~~ (moved to #23)
add support for custom stopwords
handle usernames and hashtags automatically
handle emojis
n-gram support
~~handle social media tags~~ (moved to #24)

convert bbox coords into lat/lng in get_tweets() and vice versa

stream_tweets() and search_tweets() use different schemas for geo-located tweets.

stream_tweets() needs a vector bbox coordinates while search_tweets() uses lat/lng coordinates and a radius in miles. for ease of use it would be better if both would work in get_tweets() via conversion. since conversion is approximate, a warning could be issued.

add hexsticker and badges

use hexSticker for sticker
use badger for badges

Add academic research track support

rtweet does not support v2 endpoints yet. the academic research track of twitter's API is a special case of a v2 endpoint, has much higher rate limits (150,000 tweets/15min) a regular developer account.

update installation instructions for github version

rtools (windows) and rcpp compiler tools (mac) need to be installed before TweetLocViz can be built from github.
the README should provide links to both.

quanteda 3.0 update

quanteda issued a major update

dfm(): As of version 3, only tokens objects are supported as inputs to dfm(). Calling dfm() for character or corpus objects is still functional, but issues a warning. Convenience passing of arguments to tokens() via ... for dfm() is also deprecated, but undocumented, and functions only with a warning. Users should now create a tokens object (using tokens() from character or corpus inputs before calling dfm().

since pool_tweets() still uses the corpus and dfm to construct its token objects TweetLocViz needs to be updated.

parse_stream() has no indication of parsing error and instead issues cryptic warning message

> z <- parse_stream("fromUK.json")
opening fileoldClasskRp.connection input connection.
 Found 2918 records...closing fileoldClasskRp.connection input connection.
 Imported 22015 records. Simplifying...
> z <- jsonlite::stream_in(file("fromUK.json"))
opening fileoldClasskRp.connection input connection.
 Found 2918 records...Error: parse error: premature EOF
                                       {"created_at":"Thu Mar 11 15:20
                     (right here) ------^
closing fileoldClasskRp.connection input connection.

Parsing error at line 2918. However there is no indication of what exactly is happening since parse_stream() only issues a connection warning.

Twitmo: Unauthorized (HTTP 401)

Unfortunately it seems there are problems related to the authentication.
After running the get_tweets function I get the following error:

"Requesting token on behalf of user...
Error in twitter_init_oauth1.0(self$endpoint, self$app, permission = self$params$permission, :
Unauthorized (HTTP 401)."

In the tutorial you write "Make sure you have a regular Twitter Account before start to sample your tweets." But in the script there it is nowhere indicated how to authenticate.
Some other tutorial suggest to insert Twitter consumer key and secret using setup_twitter_oauth function, but it seems to be under a package which is no longer available.

Can you please provide guidance on how to solve the issue?
Here below the simple script I am trying to run. After that I get the error.
Thanks!

install remotes package if it's not already

if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}

devtools::install_version("rtweet", version = "0.7.0", repos = "http://cran.us.r-project.org")

install remotes package if it's not already

if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}

install dev version of Twitmo from github

remotes::install_github("abuchmueller/Twitmo")
library(Twitmo)

get_tweets(method = 'stream',
location = "GBR",
timeout = 30,
file_name = "uk_tweets.json")

Add external tokenizer support

moved from #1 due to complexity

There are two possible scenarios for the implementation of external tokenizers:

a copy of pool_tweets() where instead of passing arguments to the quanteda tokenizer, an external tokenizer function is passed as an argument. This will be diffucult and tedious to test since there are many tokenizers.
a copy of the pool_tweets() function where instead of a finished document term matrix only the text/full corpus of the final document pool is returned. This is easier to implement and test but requires more effort from the user to work with

supply get_tweets() with more locations

rtweet:::citycoords has 700+ cities and works out of the box with rtweet::lookup_coords() but does not have regions / countries e.g. EU, China, Germany.

A list of 200 countries/regions with their bounding box can be found here

add more plotting methods

includes but not limited to

wordclouds
interactive leaflet map of topics and hashtags
temporal plots (e.g. topic prevalence over time)
...

abuchmueller / twitmo Goto Github PK

twitmo's People

Contributors

Stargazers

Watchers

Forkers

twitmo's Issues

Example

Reason

Fix

Example

Add support for filtering tweets based on a keyword list.

install remotes package if it's not already

install remotes package if it's not already

install dev version of Twitmo from github

Recommend Projects

Recommend Topics

Recommend Org