Git Product home page Git Product logo

lexrankr's Introduction

lexRankr: Extractive Text Summariztion in R

Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge Last Commit

Installation

##install from CRAN
install.packages("lexRankr")

#install from this github repo
devtools::install_github("AdamSpannbauer/lexRankr")

Overview

lexRankr is an R implementation of the LexRank algorithm discussed by Güneş Erkan & Dragomir R. Radev in LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. LexRank is designed to summarize a cluster of documents by proposing which sentences subsume the most information in that particular set of documents. The algorithm may not perform well on a set of unclustered/unrelated set of documents. As the white paper's title suggests, the sentences are ranked based on their centrality in a graph. The graph is built upon the pairwise similarities of the sentences (where similarity is measured with a modified idf cosine similarity function). The paper describes multiple ways to calculate centrality and these options are available in the R package. The sentences can be ranked according to their degree of centrality or by using the Page Rank algorithm (both of these methods require setting a minimum similarity threshold for a sentence pair to be included in the graph). A third variation is Continuous LexRank which does not require a minimum similarity threshold, but rather uses a weighted graph of sentences as the input to Page Rank.

note: the lexrank algorithm is designed to work on a cluster of documents. LexRank is built on the idea that a cluster of docs will focus on similar topics

note: pairwise sentence similarity is calculated for the entire set of documents passed to the function. This can be a computationally instensive process (esp with a large set of documents)

Basic Usage

library(lexRankr)
library(dplyr)

df <- tibble(doc_id = 1:3, 
             text = c("Testing the system. Second sentence for you.", 
                      "System testing the tidy documents df.", 
                      "Documents will be parsed and lexranked."))
                      
df %>% 
    unnest_sentences(sents, text) %>% 
    bind_lexrank(sents, doc_id, level = 'sentences') %>% 
    arrange(desc(lexrank))

More Examples

lexrankr's People

Contributors

adamspannbauer avatar tbwhite2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

lexrankr's Issues

rm dplyr

convert data.frame operations away from dplyr (either base or data.table)

reason: speed and stability

throw more informative error if only 1 document passed

put a stopifnot(length(text) > 1) in the best place

currently inputting 1 doc throws exception below, which isnt too helpful to uncover the actual issue

Only one sentence had nonzero tfidf scores.  Similarities would return as NaN

`damping` arg ignored in `lexRankFromSimil`

Just had a look at this package because I was creating a similar package recently called textrank (https://cran.r-project.org/web/packages/textrank/index.html). This package seems to follow the same approach although the textrank package starts with something which looks like the output of udpipe which contains already sentences and all the words tokenised.
While skimming the code, I noticed that you are not using the damping argument lexRankFromSimil, maybe that is something to fix.
I'm also interested to hear if you have found a way to reduce the computational burden of doing many sentence to sentence similarity calculations?

Error in sentenceSimil

I apologize if you have covered this before. I am trying to understand what is causing lexRank to fail in some texts. I am encountering this with a lot of text material.

Here is a reproducible example with the CNN dataset:

data_path <- "cnn/stories"

files <- list.files(data_path, pattern = "story$") 

# files_sample <- sample(files, 30)

file_two <- files[2]

data <- file_two %>%
  map(~ read_lines(file.path(data_path, .))) %>% 
  data_frame()

data <- data %>%
  rename(articles = ".") 

data <- data %>% 
  mutate(doc_id = 1:length(articles))

sent <- data %>%
  pull(articles) %>% 
  as.character()

lexRank(sent)

I am trying to determine what is the cause of the error. I am encountering this in a high number of articles from different sources. I am trying to know if its a pre-processing problem.

proxyDB error: IDFcosine already in registry

if sentence similarity processing starts and is halted before completion "idfcosine" function not removed from proxy registry. Currently no logic to check if function is already in registry before attempting to push idfcosine in; this produces error.

Add logic to check if idfcosine is in proxydb before adding.. create workaround.. perhaps include onexit logic

eval data.table

rewrite base transformation with data.table. benchmark and evaluate speed vs introducing dependency

Test failure on i386

Hi,
since a couple of days (possibly due to a new upload of igraph version 1.3.5) the CI test of the Debian package fails for the i386 architecture with:

== Failed tests ================================================================
-- Failure ('test-lexRank.R:39'): object out value -----------------------------
`testResult` not equal to `expectedResult`.
Component "docId": Mean relative difference: 1
Component "sentenceId": 2 string mismatches
Component "sentence": 2 string mismatches

[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]

You can see this in the full test log.

I have added some debug code in this patch to visualise the issue:

> test_check("lexRankr")
[1] "DEBUG: expectedResult: c(2, 1, 3)"                                                                                    
[2] "DEBUG: expectedResult: c(\"2_1\", \"1_1\", \"3_1\")"                                                                  
[3] "DEBUG: expectedResult: c(\"Is everything working as expected in my test?\", \"Testing 1, 2, 3.\", \"Is it working?\")"
[4] "DEBUG: expectedResult: c(0.48649, 0.25676, 0.25676)"                                                                  
[1] "DEBUG: testResult: c(2, 3, 1)"                                                                                    
[2] "DEBUG: testResult: c(\"2_1\", \"3_1\", \"1_1\")"                                                                  
[3] "DEBUG: testResult: c(\"Is everything working as expected in my test?\", \"Is it working?\", \"Testing 1, 2, 3.\")"
[4] "DEBUG: testResult: c(0.48649, 0.25676, 0.25676)"                                                                  
[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]

As you can see the sequence of argument 2 and 3 is swapped thus the comparison fails. The i386 architecture seems to be the only one which is affected.

Kind regards, Andreas.

no sentences above threshold needs verbose error

lexRankFromSimil has threshold option which sets minimal simil scores for edges that are built into graph where pageRank is performed. Currently no message to user if no edges are above threshold, and code does not produce error in lexRankFromSimil. Error will be produced in lexRank due to this issue in the final output step.

Need to include verbose error describing issue and suggesting explanation/workarounds.

Not able to get single top sentence

Pretty sure this is user error. My dataframe contains a large block of text (from a SQL database) as the column contentraw. When I try to pass back the top sentence, I get a mangled mess instead. The desired output is the single top sentence in the document.

What am I doing wrong?

Code:

df <- data.table(dbxSelect(dbxcon, selectarticles))

cleancopy <- function(x, urls = TRUE, hashtags = TRUE)
{
  ## remove obvious crap
  if (urls) {
    x = gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)", "", x)
  }
  if (hashtags) {
    x = gsub("#\\S+", "", x)
  }
  ## split sentences to new lines
  x = gsub("\\. ", "\\. \n", x)
  #return
  x
  
}

## clean up the column
df$contentraw <- cleancopy(df$contentraw)

## run rank and assign to key
df$keysent <- df[, lexRankr::lexRank(
  contentraw,
  docId = url,
  n = 1,
  continuous = TRUE,
  returnTies = FALSE
),
by = url]

Vector memory exhausted on plain text processing

When attempting to digest plain text, I hit:

Parsing text into sentences and tokens...DONE
Calculating pairwise sentence similarities...
Error: vector memory exhausted (limit reached?)

Example is a dataframe with a URL as the first column and the page's text (HTML stripped) as the second column.

I am attempting to return only the highest LexRanked sentence from the block of text, one per article.

Code:

df$summary = lexRankr::lexRank(df$contentraw,
                          docId = df$url,
                          n = 1,
                          continuous = FALSE)

Am I doing this wrong?

Purr

Is there a way to use map() in a pipe with lexrank? Lets say I want to extract a summary sentence from documents collected in a data frame, one article per row.

I guess you would have to unnest_sentences for each row, then create a new table to store the top ranking sentences?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.