Git Product home page Git Product logo

lexrankr's Issues

Not able to get single top sentence

Pretty sure this is user error. My dataframe contains a large block of text (from a SQL database) as the column contentraw. When I try to pass back the top sentence, I get a mangled mess instead. The desired output is the single top sentence in the document.

What am I doing wrong?

Code:

df <- data.table(dbxSelect(dbxcon, selectarticles))

cleancopy <- function(x, urls = TRUE, hashtags = TRUE)
{
  ## remove obvious crap
  if (urls) {
    x = gsub("\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)", "", x)
  }
  if (hashtags) {
    x = gsub("#\\S+", "", x)
  }
  ## split sentences to new lines
  x = gsub("\\. ", "\\. \n", x)
  #return
  x
  
}

## clean up the column
df$contentraw <- cleancopy(df$contentraw)

## run rank and assign to key
df$keysent <- df[, lexRankr::lexRank(
  contentraw,
  docId = url,
  n = 1,
  continuous = TRUE,
  returnTies = FALSE
),
by = url]

no sentences above threshold needs verbose error

lexRankFromSimil has threshold option which sets minimal simil scores for edges that are built into graph where pageRank is performed. Currently no message to user if no edges are above threshold, and code does not produce error in lexRankFromSimil. Error will be produced in lexRank due to this issue in the final output step.

Need to include verbose error describing issue and suggesting explanation/workarounds.

rm dplyr

convert data.frame operations away from dplyr (either base or data.table)

reason: speed and stability

proxyDB error: IDFcosine already in registry

if sentence similarity processing starts and is halted before completion "idfcosine" function not removed from proxy registry. Currently no logic to check if function is already in registry before attempting to push idfcosine in; this produces error.

Add logic to check if idfcosine is in proxydb before adding.. create workaround.. perhaps include onexit logic

eval data.table

rewrite base transformation with data.table. benchmark and evaluate speed vs introducing dependency

Vector memory exhausted on plain text processing

When attempting to digest plain text, I hit:

Parsing text into sentences and tokens...DONE
Calculating pairwise sentence similarities...
Error: vector memory exhausted (limit reached?)

Example is a dataframe with a URL as the first column and the page's text (HTML stripped) as the second column.

I am attempting to return only the highest LexRanked sentence from the block of text, one per article.

Code:

df$summary = lexRankr::lexRank(df$contentraw,
                          docId = df$url,
                          n = 1,
                          continuous = FALSE)

Am I doing this wrong?

`damping` arg ignored in `lexRankFromSimil`

Just had a look at this package because I was creating a similar package recently called textrank (https://cran.r-project.org/web/packages/textrank/index.html). This package seems to follow the same approach although the textrank package starts with something which looks like the output of udpipe which contains already sentences and all the words tokenised.
While skimming the code, I noticed that you are not using the damping argument lexRankFromSimil, maybe that is something to fix.
I'm also interested to hear if you have found a way to reduce the computational burden of doing many sentence to sentence similarity calculations?

Error in sentenceSimil

I apologize if you have covered this before. I am trying to understand what is causing lexRank to fail in some texts. I am encountering this with a lot of text material.

Here is a reproducible example with the CNN dataset:

data_path <- "cnn/stories"

files <- list.files(data_path, pattern = "story$") 

# files_sample <- sample(files, 30)

file_two <- files[2]

data <- file_two %>%
  map(~ read_lines(file.path(data_path, .))) %>% 
  data_frame()

data <- data %>%
  rename(articles = ".") 

data <- data %>% 
  mutate(doc_id = 1:length(articles))

sent <- data %>%
  pull(articles) %>% 
  as.character()

lexRank(sent)

I am trying to determine what is the cause of the error. I am encountering this in a high number of articles from different sources. I am trying to know if its a pre-processing problem.

throw more informative error if only 1 document passed

put a stopifnot(length(text) > 1) in the best place

currently inputting 1 doc throws exception below, which isnt too helpful to uncover the actual issue

Only one sentence had nonzero tfidf scores.  Similarities would return as NaN

Purr

Is there a way to use map() in a pipe with lexrank? Lets say I want to extract a summary sentence from documents collected in a data frame, one article per row.

I guess you would have to unnest_sentences for each row, then create a new table to store the top ranking sentences?

Test failure on i386

Hi,
since a couple of days (possibly due to a new upload of igraph version 1.3.5) the CI test of the Debian package fails for the i386 architecture with:

== Failed tests ================================================================
-- Failure ('test-lexRank.R:39'): object out value -----------------------------
`testResult` not equal to `expectedResult`.
Component "docId": Mean relative difference: 1
Component "sentenceId": 2 string mismatches
Component "sentence": 2 string mismatches

[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]

You can see this in the full test log.

I have added some debug code in this patch to visualise the issue:

> test_check("lexRankr")
[1] "DEBUG: expectedResult: c(2, 1, 3)"                                                                                    
[2] "DEBUG: expectedResult: c(\"2_1\", \"1_1\", \"3_1\")"                                                                  
[3] "DEBUG: expectedResult: c(\"Is everything working as expected in my test?\", \"Testing 1, 2, 3.\", \"Is it working?\")"
[4] "DEBUG: expectedResult: c(0.48649, 0.25676, 0.25676)"                                                                  
[1] "DEBUG: testResult: c(2, 3, 1)"                                                                                    
[2] "DEBUG: testResult: c(\"2_1\", \"3_1\", \"1_1\")"                                                                  
[3] "DEBUG: testResult: c(\"Is everything working as expected in my test?\", \"Is it working?\", \"Testing 1, 2, 3.\")"
[4] "DEBUG: testResult: c(0.48649, 0.25676, 0.25676)"                                                                  
[ FAIL 1 | WARN 0 | SKIP 0 | PASS 142 ]

As you can see the sequence of argument 2 and 3 is swapped thus the comparison fails. The i386 architecture seems to be the only one which is affected.

Kind regards, Andreas.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.