Git Product home page Git Product logo

dnaish's Introduction

Genomic Language Model

This is an implementation of ULMFiT. The model architecture used is based on the AWD-LSTM.

The approach uses three training phases to produce a classification model:

  1. Train a language model on a large, unlabeled corpus
  2. Fine tune the language model on the classification corpus
  3. Use the fine tuned language model to initialize a classification model

This is useful for genetic analysis, where large amounts of unlabeled data is abundant and labeled data is scarce. The approach allows to train a model on a large, unlabeled genomic corpus in an unsupervised fashion. The pre-trained language model serves as a feature extractor for parsing genomic data.

Results

Promoter Classification

E. coli promoters

The method performs well at the task of classifying promoter sequences from random sections of the genome. The process of unsupervised pre-training and fine-tuning has a clear impact on the performance of the classification model

Model Accuracy Precision Recall Correlation Coefficient
Naive 0.834 0.847 0.816 0.670
E. coli Genome Pre-Training 0.919 0.941 0.893 0.839
Genomic Ensemble Pre-Training 0.973 0.980 0.966 0.947

Dataset

Notebook Directory

Classification performance on human promoters is competitive with published results

Human Promoters (short)

For the short promoter sequences, using data from Recognition of Prokaryotic and Eukaryotic Promoters using Convolutional Deep Learning Neural Networks:

Model DNA Size kmer/stride Accuracy Precision Recall Correlation Coefficient Specificity
Kh et al. -200/50 - - - 0.9 0.89 0.98
Naive Model -200/50 5/2 0.80 0.74 0.80 0.59 0.80
With Pre-Training -200/50 5/2 0.922 0.963 0.849 0.844 0.976
With Pre-Training and Fine Tuning -200/50 5/2 .977 .959 .989 .955 .969
With Pre-Training and Fine Tuning -200/50 5/1 .990 .983 .995 .981 .987
With Pre-Training and Fine Tuning -200/50 3/1 .995 .992 .996 .991 .994

Data Source

Notebook Directory

Human Promoters (long)

For the long promoter sequences, using data from PromID: Human Promoter Prediction by Deep Learning:

Model DNA Size Models Accuracy Precision Recall Correlation Coefficient
Umarov et al. -1000/500 2 Model Ensemble - 0.636 0.802 0.714
Umarov et al. -200/400 2 Model Ensemble - 0.769 0.755 0.762
Naive Model -500/500 Single Model 0.858 0.877 0.772 0.708
With Pre-Training -500/500 Single Model 0.888 0.90 0.824 0.770
With Pre-Training and Fine Tuning -500/500 Single Model 0.892 0.877 0.865 0.778

Notebook Directory

Other Bacterial Promoters

This table shows results on data from Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.

Method Organism Training Examples Accuracy Precision Recall Correlation Coefficient Specificity
Kh et al. E. coli 2936 - - 0.90 0.84 0.96
ULMFiT E. coli 2936 0.956 0.917 0.880 0.871 0.977
Kh et al. B. subtilis 1050 - - 0.91 0.86 0.95
ULMFiT B. subtilis 1050 0.905 0.857 0.789 0.759 0.95

Notebook Directory

Metaganomics Classification

ULMFiT shows improved performance on the metagenomics taxonomic dataset from Deep learning models for bacteria taxonomic classification of metagenomic data.

Method Data Source Accuracy Precision Recall F1
Fiannaca et al. Amplicon .9137 .9162 .9137 .9126
ULMFiT Amplicon .9239 .9402 .9332 .9306
Fiannaca et al. Shotgun .8550 .8570 .8520 .8511
ULMFiT Shotgun .8797 .8824 .8769 .8758

Notebook Directory

Enhancer Classification

Trained on a dataset of mammalian enhancer sequences from Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences, ULMFiT outperforms Cohn et al.

Model - AUROC Human Mouse Dog Opossum
Cohn et al. 0.80 0.78 0.77 0.72
ULMFiT 5-mer Stride 2 0.812 0.871 0.773 0.787
ULMFiT 4-mer Stride 2 0.804 0.876 0.771 0.786
ULMFiT 3-mer Stride 1 0.819 0.875 0.788 0.798

Data Source

Notebook Directory

mRNA/lncRNA Classification

This table shows results for training a classification model on a dataset of coding mRNA sequences and long noncoding RNA (lncRNA) sequences. The dataset used is A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential by Hill et al.

Model Test Set Accuracy Specificity Sensitivity Precision MCC
GRU Ensemble (Hill et al.) Standard Test Set 0.96 0.97 0.95 0.97 0.92
ULMFiT (3mer stride 1) Standard Test Set 0.963 0.952 0.974 0.953 0.926
GRU Ensemble (Hill et al.) Challenge Test Set 0.875 0.95 0.80 0.95 0.75
ULMFiT (3mer stride 1) Challenge Test Set 0.90 0.944 0.871 0.939 0.817

Notebook Directory

Interpreting Results

In the plot below, the red line corresponds to a true transcription start site. The plot shows how prediction results are sensitive to changes around that location. Model Interpretations directory.

Long Sequence Inference

The image below shows a sample prediction of promoter locations on a 40,000 bp region of the E. coli genome. True promoter locations are shown in red.notebook

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.