trinker / lexicon Goto Github PK

View Code? Open in Web Editor NEW

111.0 13.0 14.0 9.39 MB

A data package containing lexicons and dictionaries for text analysis

R 100.00%

r text-dictionaries lexicon lookup hash stopwords names-frequent text-mining

lexicon's Introduction

lexicon

Description
Data
Installation
Contact

Description

lexicon is a collection of lexical hash tables, dictionaries, and word lists. The data prefixes help to categorize the data types:

Prefix	Meaning
`key_`	A `data.frame` with a lookup and return value
`hash_`	A keyed `data.table` hash table
`freq_`	A `data.table` of terms with frequencies
`profanity_`	A profane words `vector`
`pos_`	A part of speech `vector`
`pos_df_`	A part of speech `data.frame`
`sw_`	A stopword `vector`

Data

Data	Description
cliches	Common Cliches
common_names	First Names (U.S.)
constraining_loughran_mcdonald	Loughran-McDonald Constraining Words
emojis_sentiment	Emoji Sentiment Data
freq_first_names	Frequent U.S. First Names
freq_last_names	Frequent U.S. Last Names
function_words	Function Words
grady_augmented	Augmented List of Grady Ward’s English Words and Mark Kantrowitz’s Names List
hash_emojis	Emoji Description Lookup Table
hash_emojis_identifier	Emoji Identifier Lookup Table
hash_emoticons	Emoticons
hash_grady_pos	Grady Ward’s Moby Parts of Speech
hash_internet_slang	List of Internet Slang and Corresponding Meanings
hash_lemmas	Lemmatization List
hash_nrc_emotions	NRC Emotion Table
hash_sentiment_emojis	Emoji Sentiment Polarity Lookup Table
hash_sentiment_huliu	Hu Liu Polarity Lookup Table
hash_sentiment_jockers	Jockers Sentiment Polarity Table
hash_sentiment_jockers_rinker	Combined Jockers & Rinker Polarity Lookup Table
hash_sentiment_loughran_mcdonald	Loughran-McDonald Polarity Table
hash_sentiment_nrc	NRC Sentiment Polarity Table
hash_sentiment_senticnet	Augmented SenticNet Polarity Table
hash_sentiment_sentiword	Augmented Sentiword Polarity Table
hash_sentiment_slangsd	SlangSD Sentiment Polarity Table
hash_sentiment_socal_google	SO-CAL Google Polarity Table
hash_valence_shifters	Valence Shifters
key_contractions	Contraction Conversions
key_corporate_social_responsibility	Nadra Pencle and Irina Malaescu’s Corporate Social Responsibility Dictionary
key_grade	Grades Data Set
key_rating	Ratings Data Set
key_regressive_imagery	Colin Martindale’s English Regressive Imagery Dictionary
key_sentiment_jockers	Jockers Sentiment Data Set
modal_loughran_mcdonald	Loughran-McDonald Modal List
nrc_emotions	NRC Emotions
pos_action_verb	Action Word List
pos_df_irregular_nouns	Irregular Nouns Word Dataframe
pos_df_pronouns	Pronouns
pos_interjections	Interjections
pos_preposition	Preposition Words
profanity_alvarez	Alejandro U. Alvarez’s List of Profane Words
profanity_arr_bad	Stackoverflow user2592414’s List of Profane Words
profanity_banned	bannedwordlist.com’s List of Profane Words
profanity_racist	Titus Wormer’s List of Racist Words
profanity_zac_anger	Zac Anger’s List of Profane Words
sw_dolch	Leveled Dolch List of 220 Common Words
sw_fry_100	Fry’s 100 Most Commonly Used English Words
sw_fry_1000	Fry’s 1000 Most Commonly Used English Words
sw_fry_200	Fry’s 200 Most Commonly Used English Words
sw_fry_25	Fry’s 25 Most Commonly Used English Words
sw_jockers	Matthew Jocker’s Expanded Topic Modeling Stopword List
sw_loughran_mcdonald_long	Loughran-McDonald Long Stopword List
sw_loughran_mcdonald_short	Loughran-McDonald Short Stopword List
sw_lucene	Lucene Stopword List
sw_mallet	MALLET Stopword List
sw_python	Python Stopword List

Installation

To download the development version of lexicon:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/lexicon")

Contact

You are welcome to:

submit suggestions and bug-reports at: https://github.com/trinker/lexicon/issues
send a pull request on: https://github.com/trinker/lexicon/
compose a friendly e-mail to: [email protected]

lexicon's People

Contributors

Stargazers

Watchers

Forkers

georggr systats mi7plus ariden83 shahidmawan jiaxiangbu shaksham95 fpli-mbr ehsanasgari chandrasg antdurrant jon-chun mpellet771 sookja-kang

lexicon's Issues

add stopwords

Possibly add stopwords from: https://github.com/trinker/topicmodels_learning/tree/master/stopword_lists

Preprocessing for Sentiwordnet

Hi,

I want to perform a Sentiment-Analyses with SentiWordnet on Tweets. The analyses should be on the document level, i.e. I want to classify each Tweet as positive, negative or neutral. However, the quality (Accuracy) of the results are dependent on the preprocessing. While there are a bunch of papers dealing with the influence of preprocessing using machine learning applications (SVM, Randon Forest etc.) I found no paper which investigated the influence of preprocessing using Sentiwordnet. Is there a guideline which methods are essential to get good results?

create function that checks all sentiment tables against and updated valence shifters list

consider emotion detection function

use lexicon::nrc_emotions and similar negation handling as sentiment

Add hu liu reference

http://www.cs.uic.edu/~liub/publications/www05-p536.pdf this the one to use for the lexicon

Minqing Hu and Bing Liu. "Mining and summarizing customer reviews." Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004, full paper), Seattle, Washington, USA, Aug 22-25, 2004.

f-word in valence shifters table?

Currently:

> lexicon::hash_valence_shifters['fucking']
         x    y
1: fucking <NA>
> lexicon::hash_sentiment_jockers_rinker['fuckin']
        x  y
1: fuckin NA
> lexicon::hash_sentiment_jockers_rinker['fucking']
         x  y
1: fucking -1
> lexicon::hash_sentiment_jockers_rinker['fuckin']
        x  y
1: fuckin NA

No placment of fuckin and is fucking in the wrong table? Test empirically.

Additional lists to consider

Categorized

https://github.com/imsky/wordlists

Bad Words

Add regex arg to available_data()

#' Get Available \pkg{lexicon} Data
#'
#' See available \pkg{lexicon} data a data.frame.
#'
#' @param regex A regex to search for within the data columns.
#' @param \ldots Other arguments passed to \code{grep}.
#' @return Returns a data.frame
#' @export
#' @examples
#' available_data()
#' available_data('hash_')
#' available_data('hash_sentiment')
#' available_data('python')
#' available_data('prof')
#' available_data('English')
#' available_data('Stopword')
available_data <- function(regex = NULL, ...){


    results <- utils::data(package = 'lexicon')[["results"]]
    dat <- stats::setNames(data.frame(results[, 3:4, drop = FALSE],
        stringsAsFactors = FALSE), c("Data", "Description"))

    ns <- getNamespaceExports(loadNamespace('lexicon'))
    ns <- ns[!ns %in% c("available_data", 'grady_pos_feature')]
    dat <- rbind.data.frame(
        dat,
        data.frame(
            Data = ns,
            Description = c('Jockers Sentiment Polarity Table', 'Jockers Sentiment Data Set'),
            stringsAsFactors = FALSE
        )
    )
    dat <- dat[order(dat[['Data']]),]
    row.names(dat) <- NULL

    if (!is.null(regex)){
        locs <- sort(unique(unlist( lapply(dat, function(x){ grep(regex, x, ...) }) )))

        if (length(locs) > 0) {
            dat <- dat[locs,]
        } else {
            warning('`regex` not found, returning all available data')
        }
    }

    dat 

}

profanity_google and profanity_voh_ahn data files removed in latest commit

It appears as if profanity_google and profanity_von_ahn were removed for the DESCRIPTION file in the latest commit.

Was this intentional?

If people stumble upon this installing the previous commit brings the data in

devtools::install_github("trinker/lexicon", ref = "bd0e3901de45d5986d32d6e45670ac39b622a8d1")

lexicon::hash_emoticons has a 3 for laugher???incomplete?

Should 3 be =3 https://en.wikipedia.org/wiki/List_of_emoticons

Error message: Failed to install/load: trinker/lexicon

Hello,

I am relatively new to coding/ programming, and I encountered this issue which I hope you could assist me with:
I was trying to install and load the trinker/lexicon package using the code provided:
"if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/lexicon") "

However, whenever I tried this, I always received the warning message:
"In pacman::p_load_gh("trinker/lexicon") :
Failed to install/load:
trinker/lexicon"

Thank you!

lemmatizes "as" to "a"

"As" is almost always the conjunction/preposition/whatever, and is rarely the plural of "a". via @jonathanbratt trinker/textstem#11

CRAN NOTE

https://www.r-project.org/nosvn/R.check/r-devel-linux-x86_64-fedora-clang/lexicon-00check.html

grep('[^ -~]', profanity_zac_anger, value = TRUE)
grep('[^ -~]', key_corporate_social_responsibility$regex, value = TRUE)
grep('[^ -~]', key_corporate_social_responsibility$token, value = TRUE)

Add cliches

From https://github.com/dunckr/retext-cliches which is MIT

New resources

investigate: http://www.saifmohammad.com/WebPages/lexicons.html

probability words as a probability table

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, dplyr, textshape)


browseURL('https://github.com/zonination/perceptions/blob/master/probly.csv')

'https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv' %>%
    read_csv() %>%
    tidy_list('phrase', 'probability') %>%
    mutate(probability = probability/100) %>%
    tbl_df()

## # A tibble: 782 x 2
##    phrase           probability
##    <chr>                  <dbl>
##  1 Almost Certainly       0.950
##  2 Almost Certainly       0.950
##  3 Almost Certainly       0.950
##  4 Almost Certainly       0.950
##  5 Almost Certainly       0.980
##  6 Almost Certainly       0.950
##  7 Almost Certainly       0.850
##  8 Almost Certainly       0.970
##  9 Almost Certainly       0.950
## 10 Almost Certainly       0.900
## # ... with 772 more rows

NA within the dataset

Hi. Not sure whether this was intentional but I found one NA within the dataset. Try running as.vector(freq_first_names$Name[3300:3305]) which shows the location of the NA. This refers to version ‘1.3.0’. This might be important as it requires removing such values prior to creating other objects, e.g.: dictionaries.

Update key_sentiment and key_valence_shifters with sentimentr

Updated these data sets and make sentimentr dependent upon lexicon

New profane list ??

https://github.com/zacanger/profane-words/blob/master/words.json

indonesian nrc

As i know that NRC Emolex is available in 40+ languages including Indonesian language. is indonesian laguage of NRC also available on this package?

Add terms

sentimentr:::update_polarity_table(hash_sentiment,
    data.frame(
        words = c('spot on', 'on time'),
        polarity = c(1, 1),
        stringsAsFactors = FALSE
    )
)

hash_valence_shifters: typo in deamplification words?

From: Claire Kunesh [email protected]

Is there a typo in the qdap deamplification words? Should "sparesly" by "sparsely"?

Remove NRC

The license for the NRC data set in the README asks that the data not be duplicated.

https://juliasilge.com/blog/sentiment-lexicons/

Use https://github.com/EmilHvitfeldt/textdata to pull the lexicons and reshape on the fly like a udpipe language mode.

add vulgar terms

https://stackoverflow.com/questions/24515/bad-words-filter

the key_ and hash_ prefixes may be reversed????

A profanity racial slurs list?

http://www.rsdb.org/full

Useful for matching comments that may have racist remarks

Additional lists/dictionaries from discon???

https://github.com/trinker/discon

Import custom syuzhet dictionary

If Matthew Jockers exposes the custom syuzhet dictionary it will likely be as a function that will need to be imported, converted to a key, and then rexported. I may export the data set twice, in its original form and as a senitmentr based key.

If this happens it can look like a data set but can't be called with the data function: http://stackoverflow.com/q/42546514/1000343

Also the readme will need to address this since it uses data to list the data sets.

Maybe an available_data or datasets etc.function is in order that shows everything.

Add descriptions to your repos

Add some descriptions to your repositories you bum

trouble with exports - 'hash_sentiment'

I can't run the sentiment* functions.

pacman::p_load_current_gh("trinker/lexicon", "trinker/sentimentr")
sentiment_by(sam_i_am, by = NULL)
Error: 'hash_sentiment' is not an exported object from 'namespace:lexicon'

drop "incredibly" from sentiment words and add to shifters

library(sentimentr)

"i was very unhappy" %>% sentiment()

Dupe profanity list words results in error

dats <- c( 
    "crowdflower_deflategate", 
    "crowdflower_products", 
    "course_evaluations", 
    "crowdflower_self_driving_cars", 
    "crowdflower_weather", 
    "hotel_reviews", 
    "kaggle_movie_reviews", 
    "cannon_reviews", 
    "kotzias_reviews_amazon_cells"
) 


cdat <- combine_data(dats[1])


sdat <- get_sentences(cdat)
swears <- profanity(sdat, profanity_list = c( 'shit', 'shit'))

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 187093 rows; more than 187044 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the ..

Checking to make the list unique is needed.

Pronoun list

given pennebaker's book on hidden life of pronouns it may be useful to add a pronoun list:

c("he", "her", "hers", "herself", "him", "himself", "his", "i", 
"it", "its", "me", "mine", "my", "myself", "our", "ours", "ourselves", 
"she", "thee", "their", "them", "themselves", "they", "thou", 
"thy", "thyself", "us", "we", "ye", "you", "your", "yours", "yourself",
"we"
)

can't find hash_sentiment_inquirer in the package

Hi Tyler,

I notice that the description mentions the hash_sentiment_inquirer table as the Polarity Lookup Table using Inquirer (I presume that it's referring to the Harvard's General Inquirer). However, I can't seem to find it in the package.

> lexicon::hash_sentiment_inquirer
Error: 'hash_sentiment_inquirer' is not an exported object from 'namespace:lexicon'

I also tried lexicon::available_data() but can't seem to find it either.

Am I missing something?

Thank you for your help,

Ken.

function words has dupes

Add `combine_lexicon`

A function to take lexicons (named) combine them together removing dupes (for lookups dupes replaced by first value)

combine_lexicon <- function(lexicons, ask.dupes = FALSE, ...){

    ## check that all lexicons are of the same type (grouped together)
    ## manual mapping?  or by naming convention?

    ## get internal data
    dats <- lapply(lexicons, function(x) eval(parse(text = paste0('lexicon::', x))))

    ## handling for atomic vs tabular
    if (is.atomic(dats)) {

    } else {

        ## for tabular lexicons


        if (ask.dupes){
            ## interactively as for which dupe to keep
        } else {
            ## keep first dupe
        }

        ## handling for unequal nuber of columns
        ## handling for data.table keyed lookups
    }

}