I apologize if you have covered this before. I am trying to understand what is causing

Adam, thanks! It works with bindlexrank. <a href="https://github.com/AdamSpannbauer/le

Error in sentenceSimil about lexrankr HOT 4 CLOSED

adamspannbauer commented on July 24, 2024

Error in sentenceSimil

from lexrankr.

Comments (4)

AdamSpannbauer commented on July 24, 2024

I'm assuming this is related to the way that tfidf was being calculated before calculating sentence similarity (I did not download the google drive file).

The inverse document frequency was being calculated as idf(d, t) = log( n / df(d, t) ); this has a value of 0 if a term is present in every document (which would force the tfidf to 0 as well). The zero as a low bound doesn't make too much sense here so the idf calc has been changed & will be updated with the next release to CRAN. I had been meaning to change the calculation, but I forgot to take note to do it. The updated idf calc will have a min bound of 1; idf(d, t) = log( n / df(d, t) ) + 1

You can see if this fixes your issue by installing from github using devtools::install_github("AdamSpannbauer/lexRankr").

from lexrankr.

Monduiz commented on July 24, 2024

Thank you for looking into this! I installed the dev version from GitHub and I got this error:

Error in sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, :
Only one sentence had nonzero tfidf scores. Similarities would return as NaN

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.1
lexRankr_0.4.1

from lexrankr.

AdamSpannbauer commented on July 24, 2024

In your call to lexRankr::lexRank the function is assuming each element of the input character vector is a separate document. This is creating some confusion. Here are some options.

#read text
text_lines = trimws(readLines(file_two))
#rm blank lines
text_lines = text_lines[text_lines != '']
#rm lines that == "@highlight" (shown as bad lines during manual inspection)
text_lines = text_lines[text_lines != "@highlight" ]

########################################
# OPTION 1
########################################
#assume end of lines were end of sentences adding period to help parser
collapsed = paste0(text_lines, collapse=". ")
#fix double periods introduced by collapse
collapsed = gsub("..", ".", collapsed, fixed = TRUE)
#call lexrank
lexRankr::lexRank(collapsed)

########################################
# OPTION 2
########################################
#create df. only 1 doc so only on doc id
dt = data.table::data.table(doc_id=1, text_lines=text_lines)
#parse sentences
dt = lexRankr::unnest_sentences(dt, sents, text_lines)
#correct sentence ids (do within doc_id if multiple docs)
#something that needs to be fixed in unnest_sentences function
dt[,sent_id := 1:.N]
#lexrank sentences
ranked = lexRankr::bind_lexrank(dt, sents, doc_id, sent_id)
#extract top 3
ranked[order(-lexrank), ][1:3,]

The 2nd option shows an issue with the unnest_sentences function. I've created #12 to fix the problem.

from lexrankr.

Monduiz commented on July 24, 2024

Adam, thanks! It works with bindlexrank. #12 will help with this!

from lexrankr.

Error in sentenceSimil about lexrankr HOT 4 CLOSED

Comments (4)

Related Issues (14)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent