Git Product home page Git Product logo

cord19q's Introduction

cord19q: COVID-19 Open Research Dataset (CORD-19) Analysis

CORD19

COVID-19 Open Research Dataset (CORD-19) is a free resource of scholarly articles, aggregated by a coalition of leading research groups, about COVID-19 and the coronavirus family of viruses. The dataset can be found on Semantic Scholar and there is a research challenge on Kaggle.

This project builds an index over the CORD-19 dataset to assist with analysis and data discovery. A series of tasks were explored to identify relevant articles and help find answers to key scientific questions on a number of COVID-19 research topics.

Tasks

The following files show the top query results for each task provided in the CORD-19 Research Challenge using this model. A highlights section is also shown for each task, which highlights the most relevant sentences from the query results.

A full overview of how to use this project can be found via this Notebook

Installation

You can use Git to clone the repository from GitHub and install it. It is recommended to do this in a Python Virtual Environment.

git clone https://github.com/neuml/cord19q.git
cd cord19q
pip install .

Python 3.5+ is supported

Building a model

Download all the files in the Download CORD-19 section on Semantic Scholar. Go the directory with the files and run the following commands.

cd <download_path>

For each tar.gz file run the following, where $file is the name of the file with .tar.gz removed.

mkdir $file && tar -C $file -xvzf $file.tar.gz

Once completed, there should be a file name metadata.csv and subdirectories for each data subset with all json articles.

To build the model locally:

# Convert csv/json files to SQLite
python -m cord19q.etl <download_path>

# Can optionally use pre-trained vectors
# https://www.kaggle.com/davidmezzetti/cord19-fasttext-vectors#cord19-300d.magnitude
# Default location: ~/.cord19/vectors/cord19-300d.magnitude
python -m cord19q.vectors

# Build embeddings index
python -m cord19q.index

The model will be stored in ~/.cord19

Building a report file

A report file is simply a markdown file created from a list of queries. An example:

python -m cord19q.report tasks/diagnostics.txt

Once complete a file named tasks/diagnostics.md will be created.

Running queries

The fastest way to run queries is to start a cord19q shell

cord19q

A prompt will come up. Queries can be typed directly into the console.

Tech Overview

The tech stack is built on Python and creates a sentence embeddings index with FastText + BM25. Background on this method can be found in this Medium article and an existing repository using this method codequestion.

The model is a combination of the sentence embeddings index and a SQLite database with the articles. Each article is parsed into sentences and stored in SQLite along with the article metadata. FastText vectors are built over the full corpus. The sentence embeddings index only uses COVID-19 related articles, which helps produce more recent and relevant results.

Multiple entry points exist to interact with the model.

  • cord19q.report - Builds a markdown report for a series of queries. For each query, the best articles are shown, top matches from those articles and a highlights section which shows the most relevant sections from the embeddings search for the query.
  • cord19q.query - Runs a single query from the terminal
  • cord19q.shell - Allows running multiple queries from the terminal

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.