Git Product home page Git Product logo

siftr's Introduction

SIFtR


Open in gitpod

Purpose

The present project has 2 goals. First is an R implementation of the Smooth Inverse Frequency (SIF) algorithm for sentence embeddings, to fill a gap in sentence embedding techniques in R (that aren't running off python in the background). SIF is a relatively lightweight and remarkably accurate embedding approach, which in some tasks provides comparable performance to neural network based embedding models. Second is an application of this algorithm (via Shiny) to identify and sift out undesirable text data, based on user input fed into a random forest classifier. The app assumes desirability based on some semantic aspect of the data, and uses user provided examples of good and bad data to try and label the full dataset. The user can then provide feedback on the model predictions in order to refine it, with the intent that the user only needs to label a handful of datapoints to get a decent split between useful and non-useful data in an unlabeled dataset.

As an aside, the overlap between the SIF algorithm, the idea of "sifting" data, and the convention of throwing an "r" on the end of R packages was a very happy accident.


Dataset

The dataset used for the current project was pulled from the following:

  • Word frequencies for weighting word embeddings

  • Pretrained word embeddings trained on Wikipedia articles, 100 dimensions to keep the implementation a bit smaller than the standard 300

  • Stringr for the default text data loaded with the Shiny app, specifically the fruits and sentences datasets

Implementation

The present SIF implementation is based on the original algorithm as well as this notebook, which provides a slightly simplified approach. The implementation from the original authors included principal component removal after sentence embedding calculation, which I forgo in this project for simplicity.

The following functions constitute the core of the SIF weighted sentence embedding calculation, which can be briefly summarized as the average of a sentence's constituent word embeddings, with each word embedding multiplied by the inverse frequency for that word.

word_sif <- function(word,  weight_param = 1e-3) {
    if (!(word %in% names(ef_list))) {
        word <- "_UNK_"
    }
    word_emb <- unlist(ef_list[[word]]$emb[[1]])
    word_freq <- ef_list[[word]]$freq
    word_weight <- weight_param / (weight_param + word_freq)
    out <- word_weight * word_emb
    return(out)
}

sent_sif <- function(sentence) {
    sent <- sentence %>% 
        tolower(.) %>% 
        str_replace_all(., "[[:punct:]]", "") %>% 
        str_split(., " ") %>% 
        unlist(.)
    sent_mat <- sapply(sent, word_sif)
    sent_sum <- apply(sent_mat, 1, sum)
    out <- sent_sum / length(sent)
    return(out)
}

The above code also assigns a vector of 0s for words not contained in the model vocabulary. Input text has all punctuation removed, word tokenization based on spaces, and lowercasing applied.


Outputs

  • SIFtR Shiny app. The vocabulary has been kept to a 300,000 out of the full 1.5m words, due to memory constraints on unpaid Shiny projects. If you need a larger vocabulary, the full 1.5m word embeddings load with 8GB of memory no problem.
  • siftr R package, which includes the SIF implementation and associated embedding and frequency data. devtools::install_github("ryancahildebrandt/siftr")

siftr's People

Contributors

ryancahildebrandt avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.