adamspannbauer / lexrankr Goto Github PK
View Code? Open in Web Editor NEWExtractive Text Summariztion with lexRankr (an R package implementing the LexRank algorithm)
License: Other
Extractive Text Summariztion with lexRankr (an R package implementing the LexRank algorithm)
License: Other
Pretty sure this is user error. My dataframe contains a large block of text (from a SQL database) as the column contentraw. When I try to pass back the top sentence, I get a mangled mess instead. The desired output is the single top sentence in the document.
What am I doing wrong?
Code:
df <- data.table(dbxSelect(dbxcon, selectarticles))
cleancopy <- function(x, urls = TRUE, hashtags = TRUE)
{
## remove obvious crap
if (urls) {
x = gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)", "", x)
}
if (hashtags) {
x = gsub("#\\S+", "", x)
}
## split sentences to new lines
x = gsub("\\. ", "\\. \n", x)
#return
x
}
## clean up the column
df$contentraw <- cleancopy(df$contentraw)
## run rank and assign to key
df$keysent <- df[, lexRankr::lexRank(
contentraw,
docId = url,
n = 1,
continuous = TRUE,
returnTies = FALSE
),
by = url]
lexRankFromSimil
has threshold option which sets minimal simil scores for edges that are built into graph where pageRank is performed. Currently no message to user if no edges are above threshold, and code does not produce error in lexRankFromSimil
. Error will be produced in lexRank
due to this issue in the final output step.
Need to include verbose error describing issue and suggesting explanation/workarounds.
convert data.frame operations away from dplyr (either base or data.table)
reason: speed and stability
if sentence similarity processing starts and is halted before completion "idfcosine" function not removed from proxy registry. Currently no logic to check if function is already in registry before attempting to push idfcosine in; this produces error.
Add logic to check if idfcosine is in proxydb before adding.. create workaround.. perhaps include onexit logic
rewrite base transformation with data.table. benchmark and evaluate speed vs introducing dependency
unnest sentences needs to be able to accept a column of doc ids for the case that the input text column is not a column of documents see #11
When attempting to digest plain text, I hit:
Parsing text into sentences and tokens...DONE
Calculating pairwise sentence similarities...
Error: vector memory exhausted (limit reached?)
Example is a dataframe with a URL as the first column and the page's text (HTML stripped) as the second column.
I am attempting to return only the highest LexRanked sentence from the block of text, one per article.
Code:
df$summary = lexRankr::lexRank(df$contentraw,
docId = df$url,
n = 1,
continuous = FALSE)
Am I doing this wrong?
see use case presented in #8
Just had a look at this package because I was creating a similar package recently called textrank (https://cran.r-project.org/web/packages/textrank/index.html). This package seems to follow the same approach although the textrank package starts with something which looks like the output of udpipe which contains already sentences and all the words tokenised.
While skimming the code, I noticed that you are not using the damping argument lexRankFromSimil, maybe that is something to fix.
I'm also interested to hear if you have found a way to reduce the computational burden of doing many sentence to sentence similarity calculations?
The smart_stopwords
object used for removing stopwords during parsing is only available if user fully loads package. Runs into error if user attempts to use functions using lexRankr::function
I apologize if you have covered this before. I am trying to understand what is causing lexRank to fail in some texts. I am encountering this with a lot of text material.
Here is a reproducible example with the CNN dataset:
data_path <- "cnn/stories"
files <- list.files(data_path, pattern = "story$")
# files_sample <- sample(files, 30)
file_two <- files[2]
data <- file_two %>%
map(~ read_lines(file.path(data_path, .))) %>%
data_frame()
data <- data %>%
rename(articles = ".")
data <- data %>%
mutate(doc_id = 1:length(articles))
sent <- data %>%
pull(articles) %>%
as.character()
lexRank(sent)
I am trying to determine what is the cause of the error. I am encountering this in a high number of articles from different sources. I am trying to know if its a pre-processing problem.
put a stopifnot(length(text) > 1) in the best place
currently inputting 1 doc throws exception below, which isnt too helpful to uncover the actual issue
Only one sentence had nonzero tfidf scores. Similarities would return as NaN
Is there a way to use map() in a pipe with lexrank? Lets say I want to extract a summary sentence from documents collected in a data frame, one article per row.
I guess you would have to unnest_sentences for each row, then create a new table to store the top ranking sentences?
Hi,
since a couple of days (possibly due to a new upload of igraph version 1.3.5) the CI test of the Debian package fails for the i386 architecture with:
== Failed tests ================================================================
-- Failure ('test-lexRank.R:39'): object out value -----------------------------
`testResult` not equal to `expectedResult`.
Component "docId": Mean relative difference: 1
Component "sentenceId": 2 string mismatches
Component "sentence": 2 string mismatches
[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]
You can see this in the full test log.
I have added some debug code in this patch to visualise the issue:
> test_check("lexRankr")
[1] "DEBUG: expectedResult: c(2, 1, 3)"
[2] "DEBUG: expectedResult: c(\"2_1\", \"1_1\", \"3_1\")"
[3] "DEBUG: expectedResult: c(\"Is everything working as expected in my test?\", \"Testing 1, 2, 3.\", \"Is it working?\")"
[4] "DEBUG: expectedResult: c(0.48649, 0.25676, 0.25676)"
[1] "DEBUG: testResult: c(2, 3, 1)"
[2] "DEBUG: testResult: c(\"2_1\", \"3_1\", \"1_1\")"
[3] "DEBUG: testResult: c(\"Is everything working as expected in my test?\", \"Is it working?\", \"Testing 1, 2, 3.\")"
[4] "DEBUG: testResult: c(0.48649, 0.25676, 0.25676)"
[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]
As you can see the sequence of argument 2 and 3 is swapped thus the comparison fails. The i386 architecture seems to be the only one which is affected.
Kind regards, Andreas.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.