Genomic Language Model

This is an implementation of ULMFiT. The model architecture used is based on the AWD-LSTM.

The approach uses three training phases to produce a classification model:

Train a language model on a large, unlabeled corpus
Fine tune the language model on the classification corpus
Use the fine tuned language model to initialize a classification model

This is useful for genetic analysis, where large amounts of unlabeled data is abundant and labeled data is scarce. The approach allows to train a model on a large, unlabeled genomic corpus in an unsupervised fashion. The pre-trained language model serves as a feature extractor for parsing genomic data.

Results

Promoter Classification

E. coli promoters

The method performs well at the task of classifying promoter sequences from random sections of the genome. The process of unsupervised pre-training and fine-tuning has a clear impact on the performance of the classification model

Model	Accuracy	Precision	Recall	Correlation Coefficient
Naive	0.834	0.847	0.816	0.670
E. coli Genome Pre-Training	0.919	0.941	0.893	0.839
Genomic Ensemble Pre-Training	0.973	0.980	0.966	0.947

Dataset

Notebook Directory

Classification performance on human promoters is competitive with published results

Human Promoters (short)

For the short promoter sequences, using data from Recognition of Prokaryotic and Eukaryotic Promoters using Convolutional Deep Learning Neural Networks:

Model	DNA Size	kmer/stride	Accuracy	Precision	Recall	Correlation Coefficient	Specificity
Kh et al.	-200/50	-	-	-	0.9	0.89	0.98
Naive Model	-200/50	5/2	0.80	0.74	0.80	0.59	0.80
With Pre-Training	-200/50	5/2	0.922	0.963	0.849	0.844	0.976
With Pre-Training and Fine Tuning	-200/50	5/2	.977	.959	.989	.955	.969
With Pre-Training and Fine Tuning	-200/50	5/1	.990	.983	.995	.981	.987
With Pre-Training and Fine Tuning	-200/50	3/1	.995	.992	.996	.991	.994

Data Source

Notebook Directory

Human Promoters (long)

For the long promoter sequences, using data from PromID: Human Promoter Prediction by Deep Learning:

Model	DNA Size	Models	Accuracy	Precision	Recall	Correlation Coefficient
Umarov et al.	-1000/500	2 Model Ensemble	-	0.636	0.802	0.714
Umarov et al.	-200/400	2 Model Ensemble	-	0.769	0.755	0.762
Naive Model	-500/500	Single Model	0.858	0.877	0.772	0.708
With Pre-Training	-500/500	Single Model	0.888	0.90	0.824	0.770
With Pre-Training and Fine Tuning	-500/500	Single Model	0.892	0.877	0.865	0.778

Notebook Directory

Other Bacterial Promoters

This table shows results on data from Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks.

Method	Organism	Training Examples	Accuracy	Precision	Recall	Correlation Coefficient	Specificity
Kh et al.	E. coli	2936	-	-	0.90	0.84	0.96
ULMFiT	E. coli	2936	0.956	0.917	0.880	0.871	0.977
Kh et al.	B. subtilis	1050	-	-	0.91	0.86	0.95
ULMFiT	B. subtilis	1050	0.905	0.857	0.789	0.759	0.95

Notebook Directory

Metaganomics Classification

ULMFiT shows improved performance on the metagenomics taxonomic dataset from Deep learning models for bacteria taxonomic classification of metagenomic data.

Method	Data Source	Accuracy	Precision	Recall	F1
Fiannaca et al.	Amplicon	.9137	.9162	.9137	.9126
ULMFiT	Amplicon	.9239	.9402	.9332	.9306
Fiannaca et al.	Shotgun	.8550	.8570	.8520	.8511
ULMFiT	Shotgun	.8797	.8824	.8769	.8758

Notebook Directory

Enhancer Classification

Trained on a dataset of mammalian enhancer sequences from Enhancer Identification using Transfer and Adversarial Deep Learning of DNA Sequences, ULMFiT outperforms Cohn et al.

Model - AUROC	Human	Mouse	Dog	Opossum
Cohn et al.	0.80	0.78	0.77	0.72
ULMFiT 5-mer Stride 2	0.812	0.871	0.773	0.787
ULMFiT 4-mer Stride 2	0.804	0.876	0.771	0.786
ULMFiT 3-mer Stride 1	0.819	0.875	0.788	0.798

Data Source

Notebook Directory

mRNA/lncRNA Classification

This table shows results for training a classification model on a dataset of coding mRNA sequences and long noncoding RNA (lncRNA) sequences. The dataset used is A deep recurrent neural network discovers complex biological rules to decipher RNA protein-coding potential by Hill et al.

Model	Test Set	Accuracy	Specificity	Sensitivity	Precision	MCC
GRU Ensemble (Hill et al.)	Standard Test Set	0.96	0.97	0.95	0.97	0.92
ULMFiT (3mer stride 1)	Standard Test Set	0.963	0.952	0.974	0.953	0.926
GRU Ensemble (Hill et al.)	Challenge Test Set	0.875	0.95	0.80	0.95	0.75
ULMFiT (3mer stride 1)	Challenge Test Set	0.90	0.944	0.871	0.939	0.817

Notebook Directory

Interpreting Results

In the plot below, the red line corresponds to a true transcription start site. The plot shows how prediction results are sensitive to changes around that location. Model Interpretations directory.

Long Sequence Inference

The image below shows a sample prediction of promoter locations on a 40,000 bp region of the E. coli genome. True promoter locations are shown in red.notebook

tejasvi / dnaish Goto Github PK

dnaish's Introduction

Genomic Language Model

Results

Promoter Classification

E. coli promoters

Human Promoters (short)

Human Promoters (long)

Other Bacterial Promoters

Metaganomics Classification

Enhancer Classification

mRNA/lncRNA Classification

Interpreting Results

Long Sequence Inference

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent