Comments (4)
I'm assuming this is related to the way that tfidf was being calculated before calculating sentence similarity (I did not download the google drive file).
The inverse document frequency was being calculated as idf(d, t) = log( n / df(d, t) )
; this has a value of 0
if a term is present in every document (which would force the tfidf
to 0
as well). The zero as a low bound doesn't make too much sense here so the idf calc has been changed & will be updated with the next release to CRAN. I had been meaning to change the calculation, but I forgot to take note to do it. The updated idf calc will have a min bound of 1; idf(d, t) = log( n / df(d, t) ) + 1
You can see if this fixes your issue by installing from github using devtools::install_github("AdamSpannbauer/lexRankr")
.
from lexrankr.
Thank you for looking into this! I installed the dev version from GitHub and I got this error:
Error in sentenceSimil(sentenceId = tokenDf$sentenceId, token = tokenDf$token, :
Only one sentence had nonzero tfidf scores. Similarities would return as NaN
R version 3.4.2 (2017-09-28)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.1
lexRankr_0.4.1
from lexrankr.
In your call to lexRankr::lexRank
the function is assuming each element of the input character vector is a separate document. This is creating some confusion. Here are some options.
#read text
text_lines = trimws(readLines(file_two))
#rm blank lines
text_lines = text_lines[text_lines != '']
#rm lines that == "@highlight" (shown as bad lines during manual inspection)
text_lines = text_lines[text_lines != "@highlight" ]
########################################
# OPTION 1
########################################
#assume end of lines were end of sentences adding period to help parser
collapsed = paste0(text_lines, collapse=". ")
#fix double periods introduced by collapse
collapsed = gsub("..", ".", collapsed, fixed = TRUE)
#call lexrank
lexRankr::lexRank(collapsed)
########################################
# OPTION 2
########################################
#create df. only 1 doc so only on doc id
dt = data.table::data.table(doc_id=1, text_lines=text_lines)
#parse sentences
dt = lexRankr::unnest_sentences(dt, sents, text_lines)
#correct sentence ids (do within doc_id if multiple docs)
#something that needs to be fixed in unnest_sentences function
dt[,sent_id := 1:.N]
#lexrank sentences
ranked = lexRankr::bind_lexrank(dt, sents, doc_id, sent_id)
#extract top 3
ranked[order(-lexrank), ][1:3,]
The 2nd option shows an issue with the unnest_sentences
function. I've created #12 to fix the problem.
from lexrankr.
Adam, thanks! It works with bindlexrank. #12 will help with this!
from lexrankr.
Related Issues (14)
- proxyDB error: IDFcosine already in registry HOT 1
- eval data.table
- create doc id arg for unnest sentence function HOT 1
- `smart_stopwords` object not available unless package loaded with `library`
- `damping` arg ignored in `lexRankFromSimil` HOT 5
- Vector memory exhausted on plain text processing HOT 2
- Not able to get single top sentence HOT 1
- no sentences above threshold needs verbose error HOT 3
- throw more informative error if only 1 document passed
- Test failure on i386 HOT 1
- rm dplyr HOT 3
- Purr HOT 10
- add helper for multiple doc lexranking (within doc)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from lexrankr.