Git Product home page Git Product logo

Comments (4)

AdamSpannbauer avatar AdamSpannbauer commented on July 24, 2024

I'm assuming this is related to the way that tfidf was being calculated before calculating sentence similarity (I did not download the google drive file).

The inverse document frequency was being calculated as idf(d, t) = log( n / df(d, t) ); this has a value of 0 if a term is present in every document (which would force the tfidf to 0 as well). The zero as a low bound doesn't make too much sense here so the idf calc has been changed & will be updated with the next release to CRAN. I had been meaning to change the calculation, but I forgot to take note to do it. The updated idf calc will have a min bound of 1; idf(d, t) = log( n / df(d, t) ) + 1

You can see if this fixes your issue by installing from github using devtools::install_github("AdamSpannbauer/lexRankr").

from lexrankr.

Monduiz avatar Monduiz commented on July 24, 2024

Thank you for looking into this! I installed the dev version from GitHub and I got this error:

Error in sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, :
Only one sentence had nonzero tfidf scores. Similarities would return as NaN

R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.1
lexRankr_0.4.1

from lexrankr.

AdamSpannbauer avatar AdamSpannbauer commented on July 24, 2024

In your call to lexRankr::lexRank the function is assuming each element of the input character vector is a separate document. This is creating some confusion. Here are some options.

#read text
text_lines = trimws(readLines(file_two))
#rm blank lines
text_lines = text_lines[text_lines != '']
#rm lines that == "@highlight" (shown as bad lines during manual inspection)
text_lines = text_lines[text_lines != "@highlight" ]

########################################
# OPTION 1
########################################
#assume end of lines were end of sentences adding period to help parser
collapsed = paste0(text_lines, collapse=". ")
#fix double periods introduced by collapse
collapsed = gsub("..", ".", collapsed, fixed = TRUE)
#call lexrank
lexRankr::lexRank(collapsed)

########################################
# OPTION 2
########################################
#create df. only 1 doc so only on doc id
dt = data.table::data.table(doc_id=1, text_lines=text_lines)
#parse sentences
dt = lexRankr::unnest_sentences(dt, sents, text_lines)
#correct sentence ids (do within doc_id if multiple docs)
#something that needs to be fixed in unnest_sentences function
dt[,sent_id := 1:.N]
#lexrank sentences
ranked = lexRankr::bind_lexrank(dt, sents, doc_id, sent_id)
#extract top 3
ranked[order(-lexrank), ][1:3,]

The 2nd option shows an issue with the unnest_sentences function. I've created #12 to fix the problem.

from lexrankr.

Monduiz avatar Monduiz commented on July 24, 2024

Adam, thanks! It works with bindlexrank. #12 will help with this!

from lexrankr.

Related Issues (14)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.