Git Product home page Git Product logo

twitmo's People

Contributors

abuchmueller avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

twitmo's Issues

unused connection warning when using parse_stream()

> z <- parse_stream("fromUK.json")
opening fileoldClasskRp.connection input connection.
 Found 2918 records...closing fileoldClasskRp.connection input connection.
 Imported 22015 records. Simplifying...
Warning message:
In .Internal(gc(verbose, reset, full)) :
  closing unused connection 3 (/var/folders/s0/sjz6vgxj7fj2x6dqx1pcyys00000gn/T//RtmpYrobz9/filed5c61b7051d)

parse_stream() leaves the connection open after parsing and R issues a connection warning. This not really a useful warning and should be fixed by closing the connection properly after read in the data.

Handle social media tags

From issue #1

Currently usernames @USER and #Hashtags are not removed when pooling with pool_tweets().

Usernames however should be dropped completely from text as they are not relevant for meaningful topics. The handling of hashtags should be up to the user. Either Hashtags are completely dropped from the tweets after pooling OR the # sign is removed and the hashtag kept.

fit_stm() throws error if stopwords argument is missing

Example

> stm.covariates <- "retweet_count,followers_count,reply_count,quote_count,favorite_count"
> stm.model <- fit_stm(mytweets, n_topics = 7, xcov =  stm.covariates)
Building corpus... 
Converting to Lower Case... 
Removing punctuation... 
Removing stopwords... 
 Error in tm::stopwords(language) : no stopwords available for 'FALSE' 

This also happens if no argument is supplied since FALSE is the default argument for stopwords i.e.
stm.model <- fit_stm(mytweets, n_topics = 7, xcov = stm.covariates)

will lead to the same result.

Reason

FALSE is passed to stm::textProcessor() language argument but only supports character strings.

Fix

language argument should be NA and removestopwords set to FALSE.

fix pool_tweets() after quanteda 3.0 update

as of quanteda 3.0 textstat_simil() is not part of quanteda anymore but now inside quanteda.textstats

usage of quanteda.textstats::textstat_simil() from now on required for similarty calculation between pooled and unpooled tweets

Add STM support

stm's need covariates extracted i.e.

  • favourite count
  • retweet count
  • emojis
  • hashtags

and addes as meta data to the dfm object.

pool_tweets() returns wrong number of unique hashtags

Example

Currently

library(TweetLocViz)
dat <- parse_stream("inst/extdata/tweets 20191027-141233.json")
#> opening file input connection.
#>  Found 167 records... Found 193 records... Imported 193 records. Simplifying...
#> closing file input connection.
pool <- pool_tweets(dat)
#> 
#> 193 Tweets found
#> Pooling 35 Tweets with Hashtags
#> 36 Unique Hashtags found
#> Begin pooling ...Done
unique(lapply(dat$hashtags, unique))
#> [[1]]
#> [1] "mood"
#> 
#> [[2]]
#> [1] NA
#> 
#> [[3]]
#> [1] "motivate"
#> 
#> [[4]]
#> [1] "Healthcare"
#> 
#> [[5]]
#> [1] "mrrbnsnathome" "newyork"       "breakfast"    
#> 
#> [[6]]
#> [1] "ThisIsMyPlace" "P4L"          
#> 
#> [[7]]
#> [1] "chinup"        "Sundayfunday"  "saintsgameday" "instapuppy"   
#> [5] "woof"          "tailswagging" 
#> 
#> [[8]]
#> [1] "TickFire"
#> 
#> [[9]]
#> [1] "MSIclassic"
#> 
#> [[10]]
#> [1] "NYC"         "About"       "JoetheCrane"
#> 
#> [[11]]
#> [1] "SundayMorning"   "lawofattraction" "collaborate"    
#> 
#> [[12]]
#> [1] "WOW"        "KeepOnGoin"
#> 
#> [[13]]
#> [1] "Government"
#> 
#> [[14]]
#> [1] "ladystrut19"          "ladystrutaccessories"
#> 
#> [[15]]
#> [1] "SmartNews"
#> 
#> [[16]]
#> [1] "SundayThoughts"
#> 
#> [[17]]
#> [1] "SF100"
#> 
#> [[18]]
#> [1] "spayneuter"
#> 
#> [[19]]
#> [1] "openhouse" "springtx" 
#> 
#> [[20]]
#> [1] "Labor"   "Norfolk"
#> 
#> [[21]]
#> [1] "inkAndIdeas"
#> 
#> [[22]]
#> [1] "oprylandhotel"
#> 
#> [[23]]
#> [1] "Pharmaceutical"
#> 
#> [[24]]
#> [1] "EastHanover" "Sales"      
#> 
#> [[25]]
#> [1] "Scryingartist" "BeautifulSkyz"
#> 
#> [[26]]
#> [1] "knoxvilletn"       "downtownknoxville"
#> 
#> [[27]]
#> [1] "heartofservice" "youthmagnet"    "youthmentor"   
#> 
#> [[28]]
#> [1] "Bonjour"
#> 
#> [[29]]
#> [1] "Trump2020"
#> 
#> [[30]]
#> [1] "spiritchat"
#> 
#> [[31]]
#> [1] "FreèJulianAssange"
#> 
#> [[32]]
#> [1] "Columbia"
#> 
#> [[33]]
#> [1] "NewCastle"
#> 
#> [[34]]
#> [1] "Oncology"
#> 
#> [[35]]
#> [1] "NBATwitter"
#> 
#> [[36]]
#> [1] "Detroit"

clearly more than 36 unique hashtags.
also "NA" is not a valid hashtag.

add support for filtering keywords

Add support for filtering tweets based on a keyword list.

Filtering certain keywords, can lead to more cohesive topics.
By default if any keyword is contained in the tweet text the tweet is preserved. All the other tweets are dropped (Inclusive filtering). Add option to perform an exclusive filter. If any keyword is contained, a tweet is excluded (exclusive filtering).

  • inclusive filtering
  • exclusive filtering

Better Tokenization

  • add support for additional tokenizers (moved to #23)
  • add support for custom stopwords
  • handle usernames and hashtags automatically
  • handle emojis
  • n-gram support
  • handle social media tags (moved to #24)

convert bbox coords into lat/lng in get_tweets() and vice versa

stream_tweets() and search_tweets() use different schemas for geo-located tweets.

stream_tweets() needs a vector bbox coordinates while search_tweets() uses lat/lng coordinates and a radius in miles. for ease of use it would be better if both would work in get_tweets() via conversion. since conversion is approximate, a warning could be issued.

Add academic research track support

rtweet does not support v2 endpoints yet. the academic research track of twitter's API is a special case of a v2 endpoint, has much higher rate limits (150,000 tweets/15min) a regular developer account.

quanteda 3.0 update

quanteda issued a major update

dfm(): As of version 3, only tokens objects are supported as inputs to dfm(). Calling dfm() for character or corpus objects is still functional, but issues a warning. Convenience passing of arguments to tokens() via ... for dfm() is also deprecated, but undocumented, and functions only with a warning. Users should now create a tokens object (using tokens() from character or corpus inputs before calling dfm().

since pool_tweets() still uses the corpus and dfm to construct its token objects TweetLocViz needs to be updated.

parse_stream() has no indication of parsing error and instead issues cryptic warning message

> z <- parse_stream("fromUK.json")
opening fileoldClasskRp.connection input connection.
 Found 2918 records...closing fileoldClasskRp.connection input connection.
 Imported 22015 records. Simplifying...
> z <- jsonlite::stream_in(file("fromUK.json"))
opening fileoldClasskRp.connection input connection.
 Found 2918 records...Error: parse error: premature EOF
                                       {"created_at":"Thu Mar 11 15:20
                     (right here) ------^
closing fileoldClasskRp.connection input connection.

Parsing error at line 2918. However there is no indication of what exactly is happening since parse_stream() only issues a connection warning.

Twitmo: Unauthorized (HTTP 401)

Unfortunately it seems there are problems related to the authentication.
After running the get_tweets function I get the following error:

"Requesting token on behalf of user...
Error in twitter_init_oauth1.0(self$endpoint, self$app, permission = self$params$permission, :
Unauthorized (HTTP 401)."

In the tutorial you write "Make sure you have a regular Twitter Account before start to sample your tweets." But in the script there it is nowhere indicated how to authenticate.
Some other tutorial suggest to insert Twitter consumer key and secret using setup_twitter_oauth function, but it seems to be under a package which is no longer available.

Can you please provide guidance on how to solve the issue?
Here below the simple script I am trying to run. After that I get the error.
Thanks!

install remotes package if it's not already

if (!requireNamespace("devtools", quietly = TRUE)) {
install.packages("devtools")
}

devtools::install_version("rtweet", version = "0.7.0", repos = "http://cran.us.r-project.org")

install remotes package if it's not already

if (!requireNamespace("remotes", quietly = TRUE)) {
install.packages("remotes")
}

install dev version of Twitmo from github

remotes::install_github("abuchmueller/Twitmo")
library(Twitmo)

get_tweets(method = 'stream',
location = "GBR",
timeout = 30,
file_name = "uk_tweets.json")

Add external tokenizer support

moved from #1 due to complexity

There are two possible scenarios for the implementation of external tokenizers:

  1. a copy of pool_tweets() where instead of passing arguments to the quanteda tokenizer, an external tokenizer function is passed as an argument. This will be diffucult and tedious to test since there are many tokenizers.
  2. a copy of the pool_tweets() function where instead of a finished document term matrix only the text/full corpus of the final document pool is returned. This is easier to implement and test but requires more effort from the user to work with

supply get_tweets() with more locations

rtweet:::citycoords has 700+ cities and works out of the box with rtweet::lookup_coords() but does not have regions / countries e.g. EU, China, Germany.

A list of 200 countries/regions with their bounding box can be found here

add more plotting methods

includes but not limited to

  • wordclouds
  • interactive leaflet map of topics and hashtags
  • temporal plots (e.g. topic prevalence over time)
  • ...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.