adamspannbauer / lexrankr Goto Github PK

View Code? Open in Web Editor NEW

21.0 5.0 4.0 763 KB

Extractive Text Summariztion with lexRankr (an R package implementing the LexRank algorithm)

License: Other

R 73.85% C++ 3.84% C 1.44% HTML 20.87%

r r-package lexrank rstat nlp lexrank-algorithm

lexrankr's Introduction

lexRankr: Extractive Text Summariztion in R

Installation

##install from CRAN
install.packages("lexRankr")

#install from this github repo
devtools::install_github("AdamSpannbauer/lexRankr")

Overview

lexRankr is an R implementation of the LexRank algorithm discussed by Güneş Erkan & Dragomir R. Radev in LexRank: Graph-based Lexical Centrality as Salience in Text Summarization. LexRank is designed to summarize a cluster of documents by proposing which sentences subsume the most information in that particular set of documents. The algorithm may not perform well on a set of unclustered/unrelated set of documents. As the white paper's title suggests, the sentences are ranked based on their centrality in a graph. The graph is built upon the pairwise similarities of the sentences (where similarity is measured with a modified idf cosine similarity function). The paper describes multiple ways to calculate centrality and these options are available in the R package. The sentences can be ranked according to their degree of centrality or by using the Page Rank algorithm (both of these methods require setting a minimum similarity threshold for a sentence pair to be included in the graph). A third variation is Continuous LexRank which does not require a minimum similarity threshold, but rather uses a weighted graph of sentences as the input to Page Rank.

note: the lexrank algorithm is designed to work on a cluster of documents. LexRank is built on the idea that a cluster of docs will focus on similar topics

note: pairwise sentence similarity is calculated for the entire set of documents passed to the function. This can be a computationally instensive process (esp with a large set of documents)

Basic Usage

library(lexRankr)
library(dplyr)

df <- tibble(doc_id = 1:3, 
             text = c("Testing the system. Second sentence for you.", 
                      "System testing the tidy documents df.", 
                      "Documents will be parsed and lexranked."))
                      
df %>% 
    unnest_sentences(sents, text) %>% 
    bind_lexrank(sents, doc_id, level = 'sentences') %>% 
    arrange(desc(lexrank))

More Examples

lexrankr's People

Contributors

Stargazers

Watchers

Forkers

jiunnguo tbwhite2 wenyi-tay lazuraslong

lexrankr's Issues

rm dplyr

convert data.frame operations away from dplyr (either base or data.table)

reason: speed and stability

throw more informative error if only 1 document passed

put a stopifnot(length(text) > 1) in the best place

currently inputting 1 doc throws exception below, which isnt too helpful to uncover the actual issue

Only one sentence had nonzero tfidf scores.  Similarities would return as NaN

`damping` arg ignored in `lexRankFromSimil`

Just had a look at this package because I was creating a similar package recently called textrank (https://cran.r-project.org/web/packages/textrank/index.html). This package seems to follow the same approach although the textrank package starts with something which looks like the output of udpipe which contains already sentences and all the words tokenised.
While skimming the code, I noticed that you are not using the damping argument lexRankFromSimil, maybe that is something to fix.
I'm also interested to hear if you have found a way to reduce the computational burden of doing many sentence to sentence similarity calculations?

Error in sentenceSimil

I apologize if you have covered this before. I am trying to understand what is causing lexRank to fail in some texts. I am encountering this with a lot of text material.

Here is a reproducible example with the CNN dataset:

data_path <- "cnn/stories"

files <- list.files(data_path, pattern = "story$") 

# files_sample <- sample(files, 30)

file_two <- files[2]

data <- file_two %>%
  map(~ read_lines(file.path(data_path, .))) %>% 
  data_frame()

data <- data %>%
  rename(articles = ".") 

data <- data %>% 
  mutate(doc_id = 1:length(articles))

sent <- data %>%
  pull(articles) %>% 
  as.character()

lexRank(sent)

I am trying to determine what is the cause of the error. I am encountering this in a high number of articles from different sources. I am trying to know if its a pre-processing problem.

proxyDB error: IDFcosine already in registry

if sentence similarity processing starts and is halted before completion "idfcosine" function not removed from proxy registry. Currently no logic to check if function is already in registry before attempting to push idfcosine in; this produces error.

Add logic to check if idfcosine is in proxydb before adding.. create workaround.. perhaps include onexit logic

eval data.table

rewrite base transformation with data.table. benchmark and evaluate speed vs introducing dependency

Test failure on i386

Hi,
since a couple of days (possibly due to a new upload of igraph version 1.3.5) the CI test of the Debian package fails for the i386 architecture with:

== Failed tests ================================================================
-- Failure ('test-lexRank.R:39'): object out value -----------------------------
`testResult` not equal to `expectedResult`.
Component "docId": Mean relative difference: 1
Component "sentenceId": 2 string mismatches
Component "sentence": 2 string mismatches

[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]

You can see this in the full test log.

I have added some debug code in this patch to visualise the issue:

> test_check("lexRankr")
[1] "DEBUG: expectedResult: c(2, 1, 3)"                                                                                    
[2] "DEBUG: expectedResult: c(\"2_1\", \"1_1\", \"3_1\")"                                                                  
[3] "DEBUG: expectedResult: c(\"Is everything working as expected in my test?\", \"Testing 1, 2, 3.\", \"Is it working?\")"
[4] "DEBUG: expectedResult: c(0.48649, 0.25676, 0.25676)"                                                                  
[1] "DEBUG: testResult: c(2, 3, 1)"                                                                                    
[2] "DEBUG: testResult: c(\"2_1\", \"3_1\", \"1_1\")"                                                                  
[3] "DEBUG: testResult: c(\"Is everything working as expected in my test?\", \"Is it working?\", \"Testing 1, 2, 3.\")"
[4] "DEBUG: testResult: c(0.48649, 0.25676, 0.25676)"                                                                  
[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]

As you can see the sequence of argument 2 and 3 is swapped thus the comparison fails. The i386 architecture seems to be the only one which is affected.

Kind regards, Andreas.

`smart_stopwords` object not available unless package loaded with `library`

The smart_stopwords object used for removing stopwords during parsing is only available if user fully loads package. Runs into error if user attempts to use functions using lexRankr::function

no sentences above threshold needs verbose error

lexRankFromSimil has threshold option which sets minimal simil scores for edges that are built into graph where pageRank is performed. Currently no message to user if no edges are above threshold, and code does not produce error in lexRankFromSimil. Error will be produced in lexRank due to this issue in the final output step.

Need to include verbose error describing issue and suggesting explanation/workarounds.

Not able to get single top sentence

Pretty sure this is user error. My dataframe contains a large block of text (from a SQL database) as the column contentraw. When I try to pass back the top sentence, I get a mangled mess instead. The desired output is the single top sentence in the document.

What am I doing wrong?

Code:

df <- data.table(dbxSelect(dbxcon, selectarticles))

cleancopy <- function(x, urls = TRUE, hashtags = TRUE)
{
  ## remove obvious crap
  if (urls) {
    x = gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)", "", x)
  }
  if (hashtags) {
    x = gsub("#\\S+", "", x)
  }
  ## split sentences to new lines
  x = gsub("\\. ", "\\. \n", x)
  #return
  x
  
}

## clean up the column
df$contentraw <- cleancopy(df$contentraw)

## run rank and assign to key
df$keysent <- df[, lexRankr::lexRank(
  contentraw,
  docId = url,
  n = 1,
  continuous = TRUE,
  returnTies = FALSE
),
by = url]

add helper for multiple doc lexranking (within doc)

see use case presented in #8

create doc id arg for unnest sentence function

unnest sentences needs to be able to accept a column of doc ids for the case that the input text column is not a column of documents see #11

Vector memory exhausted on plain text processing

When attempting to digest plain text, I hit:

Parsing text into sentences and tokens...DONE
Calculating pairwise sentence similarities...
Error: vector memory exhausted (limit reached?)

Example is a dataframe with a URL as the first column and the page's text (HTML stripped) as the second column.

I am attempting to return only the highest LexRanked sentence from the block of text, one per article.

Code:

df$summary = lexRankr::lexRank(df$contentraw,
                          docId = df$url,
                          n = 1,
                          continuous = FALSE)

Am I doing this wrong?

Purr

Is there a way to use map() in a pipe with lexrank? Lets say I want to extract a summary sentence from documents collected in a data frame, one article per row.

I guess you would have to unnest_sentences for each row, then create a new table to store the top ranking sentences?