adamspannbauer / lexrankr Goto Github PK

Extractive Text Summariztion with lexRankr (an R package implementing the LexRank algorithm)

License: Other

R 73.85% C++ 3.84% C 1.44% HTML 20.87%

r r-package lexrank rstat nlp lexrank-algorithm

lexrankr's Issues

Not able to get single top sentence

Pretty sure this is user error. My dataframe contains a large block of text (from a SQL database) as the column contentraw. When I try to pass back the top sentence, I get a mangled mess instead. The desired output is the single top sentence in the document.

What am I doing wrong?

Code:

df <- data.table(dbxSelect(dbxcon, selectarticles))

cleancopy <- function(x, urls = TRUE, hashtags = TRUE)
{
  ## remove obvious crap
  if (urls) {
    x = gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)", "", x)
  }
  if (hashtags) {
    x = gsub("#\\S+", "", x)
  }
  ## split sentences to new lines
  x = gsub("\\. ", "\\. \n", x)
  #return
  x
  
}

## clean up the column
df$contentraw <- cleancopy(df$contentraw)

## run rank and assign to key
df$keysent <- df[, lexRankr::lexRank(
  contentraw,
  docId = url,
  n = 1,
  continuous = TRUE,
  returnTies = FALSE
),
by = url]

no sentences above threshold needs verbose error

lexRankFromSimil has threshold option which sets minimal simil scores for edges that are built into graph where pageRank is performed. Currently no message to user if no edges are above threshold, and code does not produce error in lexRankFromSimil. Error will be produced in lexRank due to this issue in the final output step.

Need to include verbose error describing issue and suggesting explanation/workarounds.

rm dplyr

convert data.frame operations away from dplyr (either base or data.table)

reason: speed and stability

proxyDB error: IDFcosine already in registry

if sentence similarity processing starts and is halted before completion "idfcosine" function not removed from proxy registry. Currently no logic to check if function is already in registry before attempting to push idfcosine in; this produces error.

Add logic to check if idfcosine is in proxydb before adding.. create workaround.. perhaps include onexit logic

eval data.table

rewrite base transformation with data.table. benchmark and evaluate speed vs introducing dependency

create doc id arg for unnest sentence function

unnest sentences needs to be able to accept a column of doc ids for the case that the input text column is not a column of documents see #11

Vector memory exhausted on plain text processing

When attempting to digest plain text, I hit:

Parsing text into sentences and tokens...DONE
Calculating pairwise sentence similarities...
Error: vector memory exhausted (limit reached?)

Example is a dataframe with a URL as the first column and the page's text (HTML stripped) as the second column.

I am attempting to return only the highest LexRanked sentence from the block of text, one per article.

Code:

df$summary = lexRankr::lexRank(df$contentraw,
                          docId = df$url,
                          n = 1,
                          continuous = FALSE)

Am I doing this wrong?

add helper for multiple doc lexranking (within doc)

see use case presented in #8

`damping` arg ignored in `lexRankFromSimil`

Just had a look at this package because I was creating a similar package recently called textrank (https://cran.r-project.org/web/packages/textrank/index.html). This package seems to follow the same approach although the textrank package starts with something which looks like the output of udpipe which contains already sentences and all the words tokenised.
While skimming the code, I noticed that you are not using the damping argument lexRankFromSimil, maybe that is something to fix.
I'm also interested to hear if you have found a way to reduce the computational burden of doing many sentence to sentence similarity calculations?

`smart_stopwords` object not available unless package loaded with `library`

The smart_stopwords object used for removing stopwords during parsing is only available if user fully loads package. Runs into error if user attempts to use functions using lexRankr::function

Error in sentenceSimil

I apologize if you have covered this before. I am trying to understand what is causing lexRank to fail in some texts. I am encountering this with a lot of text material.

Here is a reproducible example with the CNN dataset:

data_path <- "cnn/stories"

files <- list.files(data_path, pattern = "story$") 

# files_sample <- sample(files, 30)

file_two <- files[2]

data <- file_two %>%
  map(~ read_lines(file.path(data_path, .))) %>% 
  data_frame()

data <- data %>%
  rename(articles = ".") 

data <- data %>% 
  mutate(doc_id = 1:length(articles))

sent <- data %>%
  pull(articles) %>% 
  as.character()

lexRank(sent)

I am trying to determine what is the cause of the error. I am encountering this in a high number of articles from different sources. I am trying to know if its a pre-processing problem.

throw more informative error if only 1 document passed

put a stopifnot(length(text) > 1) in the best place

currently inputting 1 doc throws exception below, which isnt too helpful to uncover the actual issue

Only one sentence had nonzero tfidf scores.  Similarities would return as NaN

Purr

Is there a way to use map() in a pipe with lexrank? Lets say I want to extract a summary sentence from documents collected in a data frame, one article per row.

I guess you would have to unnest_sentences for each row, then create a new table to store the top ranking sentences?

Test failure on i386

Hi,
since a couple of days (possibly due to a new upload of igraph version 1.3.5) the CI test of the Debian package fails for the i386 architecture with:

== Failed tests ================================================================
-- Failure ('test-lexRank.R:39'): object out value -----------------------------
`testResult` not equal to `expectedResult`.
Component "docId": Mean relative difference: 1
Component "sentenceId": 2 string mismatches
Component "sentence": 2 string mismatches

[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]

You can see this in the full test log.

I have added some debug code in this patch to visualise the issue:

> test_check("lexRankr")
[1] "DEBUG: expectedResult: c(2, 1, 3)"                                                                                    
[2] "DEBUG: expectedResult: c(\"2_1\", \"1_1\", \"3_1\")"                                                                  
[3] "DEBUG: expectedResult: c(\"Is everything working as expected in my test?\", \"Testing 1, 2, 3.\", \"Is it working?\")"
[4] "DEBUG: expectedResult: c(0.48649, 0.25676, 0.25676)"                                                                  
[1] "DEBUG: testResult: c(2, 3, 1)"                                                                                    
[2] "DEBUG: testResult: c(\"2_1\", \"3_1\", \"1_1\")"                                                                  
[3] "DEBUG: testResult: c(\"Is everything working as expected in my test?\", \"Is it working?\", \"Testing 1, 2, 3.\")"
[4] "DEBUG: testResult: c(0.48649, 0.25676, 0.25676)"                                                                  
[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]

As you can see the sequence of argument 2 and 3 is swapped thus the comparison fails. The i386 architecture seems to be the only one which is affected.

Kind regards, Andreas.

adamspannbauer / lexrankr Goto Github PK

lexrankr's Issues

Not able to get single top sentence

no sentences above threshold needs verbose error

rm dplyr

proxyDB error: IDFcosine already in registry

eval data.table

create doc id arg for unnest sentence function

Vector memory exhausted on plain text processing

add helper for multiple doc lexranking (within doc)

`damping` arg ignored in `lexRankFromSimil`

`smart_stopwords` object not available unless package loaded with `library`

Error in sentenceSimil

throw more informative error if only 1 document passed

Purr

Test failure on i386

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent