Comments (2)
Potential Issues to keep in mind:
All further steps are restricted to the documents of the top n document ids. This includes
- the weighted linear model: the co-citation analysis and bibliographic coupling of all other documents are None and must be excluded from the weighted linear model
- the hybrid model: The first recommender might select documents that are not available in the data of the second recommender since e.g. the top-n documents for the citation model might not be within the top-n documents for the language model
from readnext.
Reopened since currently, each row of the scores dataframe stores the scores for any other document.
This means the stored dataframe for 10.000 documents contains 10.000 rows with 10.000 DocumentScore
objects each which is too large and too slow to read in.
Idea: Again store only the top 100 document scores within each row.
For feature weighting the ranks are not used directly but the inverse ranks as score points, i.e. rank 1 gets 100 points, rank 100 gets one point, all lower ranks get zero points (score points = 101 - rank).
Task: When computing the ranks and documents are looked up with their index, the index might now not be contained in the row of the scores dataframe (since not all but only the top 100 are stored for each feature).
Modify the lookup such that the score points are set to zero in case of a KeyError
and the computation can continue.
from readnext.
Related Issues (20)
- Add GloVe word embeddings
- Add Longformer Language Model as Alternative to (Sci)BERT
- Parallelize Filling Dictionary HOT 1
- Add and update docstrings in all modules
- Add unit tests for all functions and classes HOT 1
- Develop a weighting scheme for the linear combination of citation and global document features HOT 1
- Set up an inference pipeline for a new test document
- Build Docs with MkDocs & Publish via GitHub Pages
- Fix Deployment of docs to GitHub Pages in CI pipeline
- Add Documentation Section `Reproducibility`
- Add Documentation Section `Inference` HOT 1
- Add Documentation Section `Development` HOT 1
- Set up an inference pipeline for a document of the training corpus
- Unify Data Structure of precomputed Data Frames
- Speed up loading precomputed data during inference HOT 2
- Conduct a structured evaluation of the thesis' main objectives
- Explore option to pass multiple query documents as input
- Speed up embeddings computation
- Reduce duplicated code of pytest fixtures
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from readnext.