Git Product home page Git Product logo

nikolapeja6 / gi-proj-cnnscorevariants Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 280 KB

School project for the Computational Genomics (2018/2019) course, which is a part of the Master studies at the School of Electrical Engineering, University of Belgrade.

Home Page: https://cgc.sbgenomics.com/public/apps#nikolapeja6/gatk-cnnscorevariants-commit/gatk4-cnnscorevariants/

Common Workflow Language 100.00%
genomics cancer-genomics-cloud common-workflow-language cwl docker gatk4 school-project rabix cgc

gi-proj-cnnscorevariants's Introduction

Wrapping CNNScoreVariants in CWL with benchmarking

This school project was created for the Computational Genomics (2018/2019) course, which is a part of the Master studies at the School of Electrical Engineering, University of Belgrade. A full description of the project statement is given below, along with an overview of the results.

After finishing the project, the CNNScoreVariants app was published on the Cancer Genomics Cloud platform in the Public Apps gallery.

Authors: Nikola Pejić and Dušan Đorić

Mentor: Vladimir Kovačević

Structure of the repository

The whole "code" of the project is located in the apps folder, and consists of the following files:

  • gi-gatk4-CNNScoreVariants.cwl - the CNNScoreVariants command with all of its parameters wrapped in CWL

  • VariantFiltration-simple.cwl - a simple version of the VariantFiltration command where only one filtering expression can be set, wrapped in CWL

  • CNNScoreVariants-with-VariantFiltration.cwl - a simple CWL workflow which consists of the two previous commands, where the output of the first one is passed on to the second one.

In order to open and edit the files, we recommend the Rabix Composer application.

The project statement and benchmarking results are located in the docs folder. It also contains the images folder where the images referenced by this README.md are stored.

Project statement

Project task is consisted of wrapping GATK4 CNNScoreVariants deep-learning based tool for variant filtering in CWL and running it with test samples. This project will be done on Cancer Genomics Cloud platform where anyone with academic email address can register and receive $300 free credit. The process of adaptation (wrapping) and running should be done in a similar way as described in quickstart and in the tutorials (pages: YOUR TOOL, TOOL WRAPPING TIPS AND TRICKS, RUN AN ANALYSIS). The Docker image for the tool is available: images.sbgenomics.com/vladimirk/gatk:4.1.0.0 and tool can be run from it with the command:

/gatk/gatk CNNScoreVariants

All parameters of the tool need to be supported (added as inputs) to be adjustable in the task. Successful end of this part of project considers completing filtering task on CGC platform with input VCF and BAM file from Public reference files. (10 points)

Second part of the project assignment is benchmark of the filtering results. The tool will be run on the set of provided samples with available truth set (HG001-HG007 + CHM11-CHM13). For all of these samples precision, recall and f-score should be calculated using all provided VCF Benchmark workflow, truth-set VCFs and confidence BED regions for each of the samples. The precision, recall and f-score results for all samples, for filtered and non-filtered variants should be added to the comparison table and bar diagrams should to be created. (15 points)

Create slides (Google Slides or Power point presentation) with summarized work being done and record a 5-10 minute (upload to YouTube - optional) video which will present it to potential audience. (15 points)

The project will be done in a cooperation with Vladimir Kovacevic, who will be informed on the progress of the project and answer to all relevant questions or concerns.

Results

After experimenting with the command for some time, it turned out that the CNNScoreVariants command does not filer variants, but only annotates them by adding a new column called CNN_1D with the scores. So, the original project statement was altered to include finding a way to filter variants so that the resulting f-score was higher than originally.

The VariantFiltration command was selected (and wrapped in CWL) in order to perform the filtering, and the expression on which it filtered was decided to be CNN_1D < <val>, where <val> is a float which was changed while testing. The measurement results can be seen in CNNScoreVariants - Variant filtering.xlsx located in the docs folder.

In short, we found that with the increase of the threshold the resulting precision grows, as shown in the chart below.

Precision chart

However, that causes the recall to shrink.

Precision chart

That is why we focused on the f-score, which peaks when <val> is around -6.0.

Precision chart

Workflow

As mentioned above, the final version of the app (on which the benchmarking was done) was a workflow which consists of the CNNScoreVariants and VariantFiltration commands, where the output of the first is passed on to the latter. A graphical representation of the workflow is shown below.

Precision chart

Presentation

A recording of the presentation of the project was made and is available on YouTube.

Presentation of the project on YouTube

gi-proj-cnnscorevariants's People

Contributors

imgbotapp avatar nikolapeja6 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.