Light

joel-beck / readnext Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 0.0 78.29 MB

Hybrid Recommender System for Computer Science Papers | Master's Thesis Project 2023

Home Page: https://joel-beck.github.io/readnext/

License: MIT License

Python 100.00%

citation-analysis hybrid-recommender-system language-models python recommender-system

readnext's Introduction

Welcome ☀️

I'm Joel, a professional Data Scientist with a master's degree in Applied Statistics.

👨‍💻 In my day job, I provide business value from data by building Data Pipelines and deploying Machine Learning models
📦 Staying up-to-date with the community, I enjoy developing open-source packages and applications
⚙️ Besides Data Science, I am passionate about Automation Workflows, Developer Tooling and Web Development

👇 Check out my projects and do not hesitate to reach out to me on Linkedin

readnext's People

Contributors

Stargazers

Watchers

readnext's Issues

Unify Data Structure of precomputed Data Frames

Currently, some files are saved as pandas Data Frames with one column, some are saved with two columns and some are saved as pandas Series.

The preferred data structure is a pandas Data Frame with one column and an index named document_id.

Add Word2Vec and BM25 models

word2vec: Well known baseline model. Use the gensim implementation.
bm25: Modern alternative. Use the rank-bm25 package. Select the exact implementation based on this paper.

Use pandera to validate important pandas DataFrames

Check for correct column names
Check for correct data types

Add and update docstrings in all modules

Implement TF-IDF and BM25 Algorithms from Scratch

Share functionality between them as much as possible
Output is a Document Term Matrix with one row per document and one column for each embedding dimension. The number of columns coincides with the number of words (distinct, without stopwords) in the learned vocabulary

Set up an inference pipeline for a document of the training corpus

Allow the D3 Document ID, the Semanticscholar URL, the Arxiv URL, or the paper title as a unique identifier.

Reduce duplicated code of pytest fixtures

Research ways of combining parameterizing fixtures with the param argument and parameterizing tests with the pytest.mark.parametrize decorator to reduce code duplication.

Currently, in order to cover all combinations of e.g. Citation/Language and Seen/Unseen variations, four different fixtures with almost identical code are created bloating up the codebase and making maintenance more difficult and time-consuming.

Speed up loading precomputed data during inference

Right now, the entire precomputed pickle files are loaded during inference which is slow and will get even slower when using the entire data set for precomputation.

In principle, only a single row (the row for the query document during inference) has to be loaded.
Explore ideas on how to to that by e.g. choosing a different storage format than pickle.

If there is no good way to achieve faster loading, use the UnseenPaperAttributeGetter for every paper during inference, even seen ones.

Add Documentation Section `Inference`

How can the package be used for inference?
Goal: Get recommendations for a single seen or unseen query document

Language Models

Implement Language Models to measure cosine similarity between papers.

The Models belong to three categories:

Keyword Based Models

tfidf: Well-known baseline model. Use scikit-learn TF-IDF Vectorizer
bm25: Modern alternative. Use the rank-bm25 package. Select the exact implementation based on this paper.

Static Text Embeddings

word2vec: Well known baseline model. Use the gensim implementation.
fasttext: Modern alternative. Again, use the gensim implementation

Contextual Text Embeddings
Use the SciBERT model from the transformers library in two variations:

With BERT vocabulary, i.e. scibert-basevocab-uncased
With fine-tuned training corpus consisting of SemanticScholar papers, i.e. scibert-scivocab-uncased

Speed up embeddings computation

The compute_embeddings() method for the embedder classes is the bottleneck right now to scale up to larger datasets.

Currently, all embedders compute the embeddings document-wise by adhering to a common interface with a compute_embedding_single_document() method.
This does not leverage fast matrix multiplications for batch computations of gensim or transformers models.

Consider adjusting the common interface to allow batch processing while keeping the connection between document id and the corresponding embedding.

Add Documentation Section `Reproducibility`

How can the existing scripts within the package be used to reproduce the results of my thesis?

Build the Dataset

Run scripts in data directory

Run the Models

Run scripts in modeling directory

Validate the Scoring

Run scripts in evaluation directory

Add unit tests for all functions and classes

Use the test data within the tests/data directory that mimics the real input data.

Parallelize Filling Dictionary

The first attempt to parallelize filling a large dictionary with costly computations did not succeed due to massive overhead.

The code was as follows for BERTEmbedder.compute_embeddings():

from joblib import Parallel, delayed

 def compute_embeddings_mapping(
        self,
        tokens_tensor_mapping: DocumentsTokensTensorMapping,
        aggregation_strategy: AggregationStrategy = AggregationStrategy.mean,
    ) -> DocumentEmbeddingsMapping:
        embeddings_mapping = {}

        # Define a helper function to be parallelized
        # First, return dictionary key and values as a tuple instead of a dictionary
        def compute_single_document(
            document_id: int,
            tokens_tensor: DocumentsTokensTensor,
            aggregation_strategy: AggregationStrategy = AggregationStrategy.mean,
        ) -> tuple[int, DocumentEmbeddings]:
            return document_id, self.compute_embeddings_single_document(
                tokens_tensor, aggregation_strategy
            )

        with setup_progress_bar() as progress_bar:
            # Parallelize the helper function using joblib
            results: list[tuple[int, DocumentEmbeddings]] = Parallel(n_jobs=-1)(
                delayed(compute_single_document)(document_id, tokens_tensor, aggregation_strategy)  # type: ignore # noqa: E501
                for document_id, tokens_tensor in progress_bar.track(
                    tokens_tensor_mapping.items(), total=len(tokens_tensor_mapping)
                )
            )

            # Fill keys and values into a dictionary
            for document_id, embeddings in results:
                embeddings_mapping[document_id] = embeddings

        return embeddings_mapping

Setup Project Structure

Root files

Root Directories

readnext package directory
tests test directory
.github/workflows directory for GitHub Actions

Evaluation Metrics - AP and MAP

Implement the Average Precision (AP) metric for a single recommendation list.

The Mean Average Precision (MAP) for multiple recommendation lists is then simply the mean value of the corresponding Average Precision scores.

Rich Progress bars

Use the rich library for progress bars instead of tqdm because

Pretty Colors ✅
Better Progress Display ✅
Following modern best practices ✅

Add Readme

Introduction of the master's thesis project
Theoretical description of how the hybrid recommender works and what problem it solves
Installation instructions with pdm
Description of the repo structure
Setup to run the scripts, instructions how to download pretrained word2vec and fasttext models

Store only top 100 document matches instead of all scores in a dataframe

Right now e.g. the precomputed pairwise cosine similarities are stored within a NxN matric for N document in the training corpus. This scales quadratically in computation time and memory.
Instead, store only the top 50 documents with the highest score for each query documents. To keep the document id as well as the score, this could be implemented as a one-column dataframe where
- the first column contains the document ids of the training documents
- the second column contains lists with the top-50 document matches. Each match could be represented as a DocumentScore dataclass with the fieldsdocument_id: int and score: float, where score can be e.g. cosine similarity, the number of common citations/references, etc.

Add Documentation Section `Development`

How can users substitute or add features to the citation model?
How can users add a different language model? => tokenizer + embedder
How can users use a different metric to evaluate the results?

Add Longformer Language Model as Alternative to (Sci)BERT

Remove Windows from CI Pipeline

The project uses pandas version 2.0.0.

Since this version is not supported on windows yet, make GitHub Actions run on ubuntu and macos only.

Citation Models

Implement co-citation analysis and bibliographic coupling. There are two possible ways to structure the training process:
1. Compute all pairwise counts during training and store them in a large NxN matrix where N is the number of documents in the training data. Note that this scales quadratically in computation time and memory!
  In this case training is slow, but inference is very fast (simple lookup)
2. Compute the N counts for a single document during inference. In this case there is no training stage, all computation takes place during inference which makes the inference stage slow!
Add co-citation analysis ranks and bibliographic coupling ranks to the global metadata features into a single feature matrix per input document with a corresponding binary 0/1 label vector (irrelevant/relevant recommendation)

Explore option to pass multiple query documents as input

The output is still a list of recommended documents. The computational costs during inference increase marginally for additional seen query papers, but more significantly for additional unseen query papers.

The most important aspect of the implementation is that any precomputed data frame is only loaded once into memory and then used for all query documents at once!

The only meaningful way to aggregate scores for all query documents is to first build individual rankings for each query document as usual, then sum the scores up for each feature separately and select the top candidate documents for the combined ranks.

Note that the set of documents with positive scores is now likely greater than 100 for each feature which is not an issue though.

Data Import

Use the following sources:

The D3 dataset as a foundation with
- document id and title
- author ids and names
- document publication year
- document citation count
- author citation counts
- document abstracts
- arxiv id if available
This Kaggle dataset with arxiv categories as labels. Merge to documents by the arxiv id.
Use the Semantic Scholar API to get citation information, i.e. the citing papers (papers, that cite the input document / incoming links) and the referenced papers (bibliography / outgoing links). Merge to documents by the unique semantic scholar URL.

Data Preprocessing and Feature Engineering

Clean data from all three sources, add rank columns for global document features and merge into a single dataset.

Fix Deployment of docs to GitHub Pages in CI pipeline

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.