Git Product home page Git Product logo

medisia's Introduction


Medical Search Engine


Text Techonologies for Data Science 2020-21


MediSia is an open-source search engine which uses Python backend in Gensim + Django and a user-friendly front-end built with Angular. It indexes medical papers from Unpaywall) to promote open-access research.

Why?

With the coronavirus pandemic, more and more people have used the web to expand their medical knowledge. At the same time more and more inaccurate or fake information has been produced. The aim of the project is to facilitate access to reputable sources of medical knowledge and promote open-access research.

Contributions

  • A new SE system used to retrieve medical and biological literature-based information available on the Internet across multiple sources for both users with and without previous medical or scientific training

  • A system that is capable of taking a simple (or complex) query specification depending on the user's background and match it to a topic or set of topics commonly used in the medical and biological fields for faster processing and search speed.

  • A literature retrieval database powered by a topical web crawler and an indexing strategy that exploits the advantage of medical and biology domain knowledge to filter irrelevant search results.

How it works

Below is a diagram that illustrates the general search process and the different types of searches available:

Overview of search retrieval system.

Distinction across three search functionalities

For a ranked search this is the main logic:

  • Create a TF-IDF vectorizer object
  • Use fit-transform method to represent each docu-ment in the index of interest as a weighted TF-IDFvector and save into an object for use during queryprocessing
  • Similarly, retrieve the TF-IDF vector representationof the query using the transform method
  • Compute the cosine similarity score for the selectquery against all the document vectors
  • Compile a results array ranking documents withrespect to the query by score
  • Acquire the top 300 results to allow for follow-upsearch, if desired, and return the top 10 results tothe user by pagination

Screenshots

How to run

Environment Setup

We use conda as an environment manager. Simply run:

conda env create -f environment.yml

Activate the environment with:

conda activate medisia

Alternatively you can use the requirements.txt to install the required packages.

Front-End Setup

The front-end requires npm and Angular. Once installed simply

cd /ui/search-frontend/

and then run

npm install .

Dataset Download

This project requires you to donwload the dataset from unpaywall and process it using the unpaywall_process.py. This has an option to also download abstracts from journals using bioRxiv, PubMed and crossref. If nothing is found it will try to use Selenium to download the abstract.

This data should be positioned in the data folder django_backend/medical_ir/data/paper

Please make sure you read the functions thoroughly and once you've uncommented and commented the functions you need simply run:

python django_backend/medical_ir/index/unpaywall_process.py 

Make sure you have enough space. Unpaywall may take up 200+ GB uncompressed.

BOW and TFIDF Model

To create your custom models, simply run:

python django_backend/medical_ir/index/tfidf_vectorizer.py

this should create the models under the folder django_backend/medical_ir/data/search_utils. The documents will be vectorized and saved as market matrix Gensim format (it might take a while).

ML query classifier

We use a machine learning model to classify the query and reduce the number of documents to search. The query is classfied in 27 different classes (categories, topics) of medical journals (full list here: https://github.com/universvm/MediSia/blob/main/django_backend/medical_ir/data/journals/journals_categories.txt). The model will learn to associate terms of the title and abstract of the paper to specific categories. To train a model run:

python django_backend/medical_ir/query_classifier/train_classifiers.py

This trains these models using SKLearn MultinomialNB(), GaussianNB(), BernoulliNB(), SGDClassifier(loss="log"). Gaussian NB is the slowest of them all and was removed. Perfomances of each model is shown below.

Metric MultinomialNB Bernoulli NB SGD
Accuracy 0.38 0.32 0.33
Top3 Accuracy 0.61 0.52 0.52
Top5 Accuracy 0.72 0.63 0.64
Precision 0.39 0.38 0.36
Recall 0.39 0.32 0.33
Time (s) 1.14 2.76 1.1

Time refers to the time taken to classify 260K articles.

The top5 accuracy indicates that 5 indeces can be searched at once and still retrieve relevant results.

Running the search engine

The search engine uses Redis to cache results to make the search faster and Django to communicate with the front end. To run the search engine run:

cd /ui/search-frontend/
redis-server & python ../../django_backend/manage.py runserver 8000 & ng serve

Optimisation and Metrics

To optimise search, each paper was classified into 1 of 27 categories using the journal name (full list here: https://github.com/universvm/MediSia/blob/main/django_backend/medical_ir/data/journals/journals_categories.txt). These were compiled manually from Wikipedia.

Some indeces for example medicine took about 5M documents which significantly decreased the speed at which to search medicine. We therefore split the index into smaller sub-indeces called "shards". These take advantage of multiprocessing which searches through parts of them simoultaneously. This was done for the "all" categories (ie. an index that includes all the papers) which is used for the deep search. This reduced the time to search the term "coronavirus" as Deep Search from 67.8 to 24.2 seconds.

Deep Search

Searching for the term "Coronavirus" as a Deep Search (ie. through 13M+) takes 24.2 seconds, and an additional 14s seconds to paginate and display the results to the user in the front-end.

A cached query takes about 1 second to be retrieved.

Topic Search

A topic search varies in speed depending on the topic (ie. some topics have more knowledge than others eg. medicine > forestry). Searching for the term "coronavirus" in the category "virology" takes 3.847 seconds

Magic Search

The magic search reduces the number of indeces to search (compared to a deep search), by classifying the query and computing similarities against the top 3 indeces returned by the classifier. Searching for "coronavirus" takes about 5.04 seconds.

Python Style

We used PEP8 and BLACK (https://black.readthedocs.io/en/stable/) as well as typing docstring.

medisia's People

Contributors

aimeeredbond avatar dependabot[bot] avatar universvm avatar walkkker avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

walkkker

medisia's Issues

Create categories of scientific journals

Categories of scientific journals may be important if we want to:

  • search by topics (say Botany, or Psychiatry) or
  • L2R query into topics and then extract the correct index.

Tag articles with categories

All articles should be tagged with a category coming from journal. This could be a class or a simple function

It's crucial that this function checks whether the articles are duplicated or not in our final index.

Implement follow up search

Ie. search within results. Ideally this should be adding filters to an already existing list of result.

While the user is scrolling through the page, the backend can create some indexes.

Implement vectorization with TFIDF

As per textbook Intro To IR page 123, we should implement search by doing cosine similarity. All the documents need to be vectorized with TFIDF.

A BagOfWord approach can be used to reduce the number of terms that can be used. A sparse matrix can be used to represent the documents. Instead of having 0/1 we can have TFIDF values. https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

The search could then be a cosine similarity between the query and the texts. All of this is provided for free by scikit learn:

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances

Search could just be something like:

def tfidf_search(q):
    queryTFIDF = tfidf_vectoriser.transform([q])
    cosine_similarities = cosine_similarity(queryTFIDF, datasetTFIDF).flatten()

Implement topic search

This search can skip the classification of the query entirely and just search into one index

Move all cleaning functions into a class

Given that the cleaning functions should be consistent for index formation and query analysis, there should be a function or class in a separate file that standardizes it.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.