Git Product home page Git Product logo

lexicon's Introduction

lexicon

Project Status: Active - The project has reached a stable, usable state and is being actively developed. Build Status

Table of Contents

Description

lexicon is a collection of lexical hash tables, dictionaries, and word lists. The data prefixes help to categorize the data types:

Prefix Meaning
key_ A data.frame with a lookup and return value
hash_ A keyed data.table hash table
freq_ A data.table of terms with frequencies
profanity_ A profane words vector
pos_ A part of speech vector
pos_df_ A part of speech data.frame
sw_ A stopword vector

Data

Data Description
cliches Common Cliches
common_names First Names (U.S.)
constraining_loughran_mcdonald Loughran-McDonald Constraining Words
emojis_sentiment Emoji Sentiment Data
freq_first_names Frequent U.S. First Names
freq_last_names Frequent U.S. Last Names
function_words Function Words
grady_augmented Augmented List of Grady Ward’s English Words and Mark Kantrowitz’s Names List
hash_emojis Emoji Description Lookup Table
hash_emojis_identifier Emoji Identifier Lookup Table
hash_emoticons Emoticons
hash_grady_pos Grady Ward’s Moby Parts of Speech
hash_internet_slang List of Internet Slang and Corresponding Meanings
hash_lemmas Lemmatization List
hash_nrc_emotions NRC Emotion Table
hash_sentiment_emojis Emoji Sentiment Polarity Lookup Table
hash_sentiment_huliu Hu Liu Polarity Lookup Table
hash_sentiment_jockers Jockers Sentiment Polarity Table
hash_sentiment_jockers_rinker Combined Jockers & Rinker Polarity Lookup Table
hash_sentiment_loughran_mcdonald Loughran-McDonald Polarity Table
hash_sentiment_nrc NRC Sentiment Polarity Table
hash_sentiment_senticnet Augmented SenticNet Polarity Table
hash_sentiment_sentiword Augmented Sentiword Polarity Table
hash_sentiment_slangsd SlangSD Sentiment Polarity Table
hash_sentiment_socal_google SO-CAL Google Polarity Table
hash_valence_shifters Valence Shifters
key_contractions Contraction Conversions
key_corporate_social_responsibility Nadra Pencle and Irina Malaescu’s Corporate Social Responsibility Dictionary
key_grade Grades Data Set
key_rating Ratings Data Set
key_regressive_imagery Colin Martindale’s English Regressive Imagery Dictionary
key_sentiment_jockers Jockers Sentiment Data Set
modal_loughran_mcdonald Loughran-McDonald Modal List
nrc_emotions NRC Emotions
pos_action_verb Action Word List
pos_df_irregular_nouns Irregular Nouns Word Dataframe
pos_df_pronouns Pronouns
pos_interjections Interjections
pos_preposition Preposition Words
profanity_alvarez Alejandro U. Alvarez’s List of Profane Words
profanity_arr_bad Stackoverflow user2592414’s List of Profane Words
profanity_banned bannedwordlist.com’s List of Profane Words
profanity_racist Titus Wormer’s List of Racist Words
profanity_zac_anger Zac Anger’s List of Profane Words
sw_dolch Leveled Dolch List of 220 Common Words
sw_fry_100 Fry’s 100 Most Commonly Used English Words
sw_fry_1000 Fry’s 1000 Most Commonly Used English Words
sw_fry_200 Fry’s 200 Most Commonly Used English Words
sw_fry_25 Fry’s 25 Most Commonly Used English Words
sw_jockers Matthew Jocker’s Expanded Topic Modeling Stopword List
sw_loughran_mcdonald_long Loughran-McDonald Long Stopword List
sw_loughran_mcdonald_short Loughran-McDonald Short Stopword List
sw_lucene Lucene Stopword List
sw_mallet MALLET Stopword List
sw_python Python Stopword List

Installation

To download the development version of lexicon:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/lexicon")

Contact

You are welcome to:

lexicon's People

Contributors

trinker avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lexicon's Issues

Preprocessing for Sentiwordnet

Hi,

I want to perform a Sentiment-Analyses with SentiWordnet on Tweets. The analyses should be on the document level, i.e. I want to classify each Tweet as positive, negative or neutral. However, the quality (Accuracy) of the results are dependent on the preprocessing. While there are a bunch of papers dealing with the influence of preprocessing using machine learning applications (SVM, Randon Forest etc.) I found no paper which investigated the influence of preprocessing using Sentiwordnet. Is there a guideline which methods are essential to get good results?

f-word in valence shifters table?

Currently:

> lexicon::hash_valence_shifters['fucking']
         x    y
1: fucking <NA>
> lexicon::hash_sentiment_jockers_rinker['fuckin']
        x  y
1: fuckin NA
> lexicon::hash_sentiment_jockers_rinker['fucking']
         x  y
1: fucking -1
> lexicon::hash_sentiment_jockers_rinker['fuckin']
        x  y
1: fuckin NA

No placment of fuckin and is fucking in the wrong table? Test empirically.

Add regex arg to available_data()

#' Get Available \pkg{lexicon} Data
#'
#' See available \pkg{lexicon} data a data.frame.
#'
#' @param regex A regex to search for within the data columns.
#' @param \ldots Other arguments passed to \code{grep}.
#' @return Returns a data.frame
#' @export
#' @examples
#' available_data()
#' available_data('hash_')
#' available_data('hash_sentiment')
#' available_data('python')
#' available_data('prof')
#' available_data('English')
#' available_data('Stopword')
available_data <- function(regex = NULL, ...){


    results <- utils::data(package = 'lexicon')[["results"]]
    dat <- stats::setNames(data.frame(results[, 3:4, drop = FALSE],
        stringsAsFactors = FALSE), c("Data", "Description"))

    ns <- getNamespaceExports(loadNamespace('lexicon'))
    ns <- ns[!ns %in% c("available_data", 'grady_pos_feature')]
    dat <- rbind.data.frame(
        dat,
        data.frame(
            Data = ns,
            Description = c('Jockers Sentiment Polarity Table', 'Jockers Sentiment Data Set'),
            stringsAsFactors = FALSE
        )
    )
    dat <- dat[order(dat[['Data']]),]
    row.names(dat) <- NULL

    if (!is.null(regex)){
        locs <- sort(unique(unlist( lapply(dat, function(x){ grep(regex, x, ...) }) )))

        if (length(locs) > 0) {
            dat <- dat[locs,]
        } else {
            warning('`regex` not found, returning all available data')
        }
    }

    dat 

}

profanity_google and profanity_voh_ahn data files removed in latest commit

It appears as if profanity_google and profanity_von_ahn were removed for the DESCRIPTION file in the latest commit.

Was this intentional?

If people stumble upon this installing the previous commit brings the data in

devtools::install_github("trinker/lexicon", ref = "bd0e3901de45d5986d32d6e45670ac39b622a8d1")

Error message: Failed to install/load: trinker/lexicon

Hello,

I am relatively new to coding/ programming, and I encountered this issue which I hope you could assist me with:
I was trying to install and load the trinker/lexicon package using the code provided:
"if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/lexicon") "

However, whenever I tried this, I always received the warning message:
"In pacman::p_load_gh("trinker/lexicon") :
Failed to install/load:
trinker/lexicon"

Thank you!

probability words as a probability table

if (!require("pacman")) install.packages("pacman")
pacman::p_load(readr, dplyr, textshape)


browseURL('https://github.com/zonination/perceptions/blob/master/probly.csv')

'https://raw.githubusercontent.com/zonination/perceptions/master/probly.csv' %>%
    read_csv() %>%
    tidy_list('phrase', 'probability') %>%
    mutate(probability = probability/100) %>%
    tbl_df()

## # A tibble: 782 x 2
##    phrase           probability
##    <chr>                  <dbl>
##  1 Almost Certainly       0.950
##  2 Almost Certainly       0.950
##  3 Almost Certainly       0.950
##  4 Almost Certainly       0.950
##  5 Almost Certainly       0.980
##  6 Almost Certainly       0.950
##  7 Almost Certainly       0.850
##  8 Almost Certainly       0.970
##  9 Almost Certainly       0.950
## 10 Almost Certainly       0.900
## # ... with 772 more rows

NA within the dataset

Hi. Not sure whether this was intentional but I found one NA within the dataset. Try running as.vector(freq_first_names$Name[3300:3305]) which shows the location of the NA. This refers to version ‘1.3.0’. This might be important as it requires removing such values prior to creating other objects, e.g.: dictionaries.

indonesian nrc

As i know that NRC Emolex is available in 40+ languages including Indonesian language. is indonesian laguage of NRC also available on this package?

Add terms

sentimentr:::update_polarity_table(hash_sentiment,
    data.frame(
        words = c('spot on', 'on time'),
        polarity = c(1, 1),
        stringsAsFactors = FALSE
    )
)

Import custom syuzhet dictionary

If Matthew Jockers exposes the custom syuzhet dictionary it will likely be as a function that will need to be imported, converted to a key, and then rexported. I may export the data set twice, in its original form and as a senitmentr based key.

If this happens it can look like a data set but can't be called with the data function: http://stackoverflow.com/q/42546514/1000343

Also the readme will need to address this since it uses data to list the data sets.

Maybe an available_data or datasets etc.function is in order that shows everything.

trouble with exports - 'hash_sentiment'

I can't run the sentiment* functions.

pacman::p_load_current_gh("trinker/lexicon", "trinker/sentimentr")
sentiment_by(sam_i_am, by = NULL)
Error: 'hash_sentiment' is not an exported object from 'namespace:lexicon'

Dupe profanity list words results in error

dats <- c( 
    "crowdflower_deflategate", 
    "crowdflower_products", 
    "course_evaluations", 
    "crowdflower_self_driving_cars", 
    "crowdflower_weather", 
    "hotel_reviews", 
    "kaggle_movie_reviews", 
    "cannon_reviews", 
    "kotzias_reviews_amazon_cells"
) 


cdat <- combine_data(dats[1])


sdat <- get_sentences(cdat)
swears <- profanity(sdat, profanity_list = c( 'shit', 'shit'))

Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, :
Join results in 187093 rows; more than 187044 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the ..

Checking to make the list unique is needed.

Pronoun list

given pennebaker's book on hidden life of pronouns it may be useful to add a pronoun list:

c("he", "her", "hers", "herself", "him", "himself", "his", "i", 
"it", "its", "me", "mine", "my", "myself", "our", "ours", "ourselves", 
"she", "thee", "their", "them", "themselves", "they", "thou", 
"thy", "thyself", "us", "we", "ye", "you", "your", "yours", "yourself",
"we"
)

can't find hash_sentiment_inquirer in the package

Hi Tyler,

I notice that the description mentions the hash_sentiment_inquirer table as the Polarity Lookup Table using Inquirer (I presume that it's referring to the Harvard's General Inquirer). However, I can't seem to find it in the package.

> lexicon::hash_sentiment_inquirer
Error: 'hash_sentiment_inquirer' is not an exported object from 'namespace:lexicon'

I also tried lexicon::available_data() but can't seem to find it either.

Am I missing something?

Thank you for your help,

Ken.

Add `combine_lexicon`

A function to take lexicons (named) combine them together removing dupes (for lookups dupes replaced by first value)

combine_lexicon <- function(lexicons, ask.dupes = FALSE, ...){

    ## check that all lexicons are of the same type (grouped together)
    ## manual mapping?  or by naming convention?

    ## get internal data
    dats <- lapply(lexicons, function(x) eval(parse(text = paste0('lexicon::', x))))

    ## handling for atomic vs tabular
    if (is.atomic(dats)) {

    } else {

        ## for tabular lexicons


        if (ask.dupes){
            ## interactively as for which dupe to keep
        } else {
            ## keep first dupe
        }

        ## handling for unequal nuber of columns
        ## handling for data.table keyed lookups
    }

}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.