Git Product home page Git Product logo

readnext's Introduction

Welcome โ˜€๏ธ

I'm Joel, a professional Data Scientist with a master's degree in Applied Statistics.

  • ๐Ÿ‘จโ€๐Ÿ’ป In my day job, I provide business value from data by building Data Pipelines and deploying Machine Learning models
  • ๐Ÿ“ฆ Staying up-to-date with the community, I enjoy developing open-source packages and applications
  • โš™๏ธ Besides Data Science, I am passionate about Automation Workflows, Developer Tooling and Web Development

๐Ÿ‘‡ Check out my projects and do not hesitate to reach out to me on Linkedin

readnext's People

Contributors

imgbot[bot] avatar joel-beck avatar

Stargazers

 avatar

Watchers

 avatar

readnext's Issues

Unify Data Structure of precomputed Data Frames

Currently, some files are saved as pandas Data Frames with one column, some are saved with two columns and some are saved as pandas Series.

The preferred data structure is a pandas Data Frame with one column and an index named document_id.

Implement TF-IDF and BM25 Algorithms from Scratch

  • Share functionality between them as much as possible
  • Output is a Document Term Matrix with one row per document and one column for each embedding dimension. The number of columns coincides with the number of words (distinct, without stopwords) in the learned vocabulary

Reduce duplicated code of pytest fixtures

Research ways of combining parameterizing fixtures with the param argument and parameterizing tests with the pytest.mark.parametrize decorator to reduce code duplication.

Currently, in order to cover all combinations of e.g. Citation/Language and Seen/Unseen variations, four different fixtures with almost identical code are created bloating up the codebase and making maintenance more difficult and time-consuming.

Speed up loading precomputed data during inference

Right now, the entire precomputed pickle files are loaded during inference which is slow and will get even slower when using the entire data set for precomputation.

In principle, only a single row (the row for the query document during inference) has to be loaded.
Explore ideas on how to to that by e.g. choosing a different storage format than pickle.

If there is no good way to achieve faster loading, use the UnseenPaperAttributeGetter for every paper during inference, even seen ones.

Language Models

Implement Language Models to measure cosine similarity between papers.

The Models belong to three categories:

  1. Keyword Based Models
  1. Static Text Embeddings
  1. Contextual Text Embeddings
    Use the SciBERT model from the transformers library in two variations:
  • With BERT vocabulary, i.e. scibert-basevocab-uncased
  • With fine-tuned training corpus consisting of SemanticScholar papers, i.e. scibert-scivocab-uncased

Speed up embeddings computation

The compute_embeddings() method for the embedder classes is the bottleneck right now to scale up to larger datasets.

Currently, all embedders compute the embeddings document-wise by adhering to a common interface with a compute_embedding_single_document() method.
This does not leverage fast matrix multiplications for batch computations of gensim or transformers models.

Consider adjusting the common interface to allow batch processing while keeping the connection between document id and the corresponding embedding.

Add Documentation Section `Reproducibility`

How can the existing scripts within the package be used to reproduce the results of my thesis?

Build the Dataset

  • Run scripts in data directory

Run the Models

  • Run scripts in modeling directory

Validate the Scoring

  • Run scripts in evaluation directory

Parallelize Filling Dictionary

The first attempt to parallelize filling a large dictionary with costly computations did not succeed due to massive overhead.

The code was as follows for BERTEmbedder.compute_embeddings():

from joblib import Parallel, delayed

 def compute_embeddings_mapping(
        self,
        tokens_tensor_mapping: DocumentsTokensTensorMapping,
        aggregation_strategy: AggregationStrategy = AggregationStrategy.mean,
    ) -> DocumentEmbeddingsMapping:
        embeddings_mapping = {}

        # Define a helper function to be parallelized
        # First, return dictionary key and values as a tuple instead of a dictionary
        def compute_single_document(
            document_id: int,
            tokens_tensor: DocumentsTokensTensor,
            aggregation_strategy: AggregationStrategy = AggregationStrategy.mean,
        ) -> tuple[int, DocumentEmbeddings]:
            return document_id, self.compute_embeddings_single_document(
                tokens_tensor, aggregation_strategy
            )

        with setup_progress_bar() as progress_bar:
            # Parallelize the helper function using joblib
            results: list[tuple[int, DocumentEmbeddings]] = Parallel(n_jobs=-1)(
                delayed(compute_single_document)(document_id, tokens_tensor, aggregation_strategy)  # type: ignore # noqa: E501
                for document_id, tokens_tensor in progress_bar.track(
                    tokens_tensor_mapping.items(), total=len(tokens_tensor_mapping)
                )
            )

            # Fill keys and values into a dictionary
            for document_id, embeddings in results:
                embeddings_mapping[document_id] = embeddings

        return embeddings_mapping

Setup Project Structure

Root files

  • pyproject.toml
  • .gitignore
  • .env
  • LICENSE.md with MIT License
  • README.md
  • .pre-commit-config.yaml

Root Directories

  • readnext package directory
  • tests test directory
  • .github/workflows directory for GitHub Actions

Evaluation Metrics - AP and MAP

Implement the Average Precision (AP) metric for a single recommendation list.

The Mean Average Precision (MAP) for multiple recommendation lists is then simply the mean value of the corresponding Average Precision scores.

Rich Progress bars

Use the rich library for progress bars instead of tqdm because

  • Pretty Colors โœ…
  • Better Progress Display โœ…
  • Following modern best practices โœ…

Add Readme

  • Introduction of the master's thesis project

  • Theoretical description of how the hybrid recommender works and what problem it solves

  • Installation instructions with pdm

  • Description of the repo structure

  • Setup to run the scripts, instructions how to download pretrained word2vec and fasttext models

Store only top 100 document matches instead of all scores in a dataframe

  • Right now e.g. the precomputed pairwise cosine similarities are stored within a NxN matric for N document in the training corpus. This scales quadratically in computation time and memory.
  • Instead, store only the top 50 documents with the highest score for each query documents. To keep the document id as well as the score, this could be implemented as a one-column dataframe where
    • the first column contains the document ids of the training documents
    • the second column contains lists with the top-50 document matches. Each match could be represented as a DocumentScore dataclass with the fieldsdocument_id: int and score: float, where score can be e.g. cosine similarity, the number of common citations/references, etc.

Add Documentation Section `Development`

  • How can users substitute or add features to the citation model?
  • How can users add a different language model? => tokenizer + embedder
  • How can users use a different metric to evaluate the results?

Remove Windows from CI Pipeline

The project uses pandas version 2.0.0.

Since this version is not supported on windows yet, make GitHub Actions run on ubuntu and macos only.

Citation Models

  • Implement co-citation analysis and bibliographic coupling. There are two possible ways to structure the training process:

    1. Compute all pairwise counts during training and store them in a large NxN matrix where N is the number of documents in the training data. Note that this scales quadratically in computation time and memory!
      In this case training is slow, but inference is very fast (simple lookup)
    2. Compute the N counts for a single document during inference. In this case there is no training stage, all computation takes place during inference which makes the inference stage slow!
  • Add co-citation analysis ranks and bibliographic coupling ranks to the global metadata features into a single feature matrix per input document with a corresponding binary 0/1 label vector (irrelevant/relevant recommendation)

Explore option to pass multiple query documents as input

The output is still a list of recommended documents. The computational costs during inference increase marginally for additional seen query papers, but more significantly for additional unseen query papers.

The most important aspect of the implementation is that any precomputed data frame is only loaded once into memory and then used for all query documents at once!

The only meaningful way to aggregate scores for all query documents is to first build individual rankings for each query document as usual, then sum the scores up for each feature separately and select the top candidate documents for the combined ranks.

Note that the set of documents with positive scores is now likely greater than 100 for each feature which is not an issue though.

Set up an inference pipeline for a new test document

Currently, the pipeline only works for query documents that were part of the training corpus.

For a new test document, all steps of adding feature columns, computing embeddings, computing pairwise scores with all documents in the training corpus, and generating and scoring recommendations must be included.

Build Dataset

Data Import

Use the following sources:

  • The D3 dataset as a foundation with

    • document id and title
    • author ids and names
    • document publication year
    • document citation count
    • author citation counts
    • document abstracts
    • arxiv id if available
  • This Kaggle dataset with arxiv categories as labels. Merge to documents by the arxiv id.

  • Use the Semantic Scholar API to get citation information, i.e. the citing papers (papers, that cite the input document / incoming links) and the referenced papers (bibliography / outgoing links). Merge to documents by the unique semantic scholar URL.

Data Preprocessing and Feature Engineering

  • Clean data from all three sources, add rank columns for global document features and merge into a single dataset.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.