kundajelab / chromdragonn Goto Github PK

Code for the paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts"

License: MIT License

Python 27.92% Jupyter Notebook 72.08%

genomics epigenetics deep-learning chromatin-accessibiity gene-regulation

chromdragonn's Introduction

ChromDragoNN: cis-trans Deep RegulAtory Genomic Neural Network for predicting Chromatin Accessibility

This repository contains code for our paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts". The models are implemented in PyTorch.

Data

All associated data from our paper can be downloaded from here or here.

Untar the dnase.chr.packbited.tar.gz file (occupies ~30 Gb).

If you have your own data, you may use scripts in the preprocess/ directory.

Preparing Accessibility Data

For the accessibility matrix, prepare your data in the following format as a tab-separated gzipped file.

chr    start  end    task1  task2  ...  taskM
chr1   50     1050       0      0           0
chr1   1000   2000       1      0           1
...
chr2   100    1100       1      0           1

ChromDragoNN works on binary data and hence do ensure that the labels are all 0 or 1 only.

Then use the following command to process the data (this may take a few hours depending on the size of your dataset):

python ./preprocess/make_accessibility_joblib.py --input /path/to/accessibility/file.tsv.gz --output_dir /path/to/dnase/packbited --genome_fasta /path/to/genome/fasta.fa

Make sure the output directory exists!

If you wish to generate the binary matrix from peaks (e.g. narrowPeak), have a look at the seqdataloader repo.

Preparing RNA Data

For the RNA matrix, prepare your data in the following format as a tab-separated file (NOT gzipped).

gene    task1   task2  ...  taskM
MEOX1   3.5189  2.8237      3.7542
SOX8    0.0     0.0         1.9623
...
ZNF195  0.0     0.1232      0.0023

The gene expression values must already be appropriately normalised. In our paper, we use the arcsinh(TPM) values for 1630 Transcription Factors. Do ensure the number and order of the tasks is the same as in the accessibility data.

Then use the following command to process the data:

python ./preprocess/make_rna_joblib.py --input /path/to/rna/file.tsv --output_prefix /path/to/rna/prefix

This will output /path/to/rna/prefix.joblib RNA quants file.

Model Training

Stage 1

The stage 1 models predict accessibility across all training cell types from only sequence, and does not utilise RNA-seq profiles.

The model_zoo/stage1 directory contains models for the Vanilla, Factorized and our ResNet models.

To start training any of these models (say, ResNet), from the model_zoo/stage1 directory:

python resnet.py -cp /path/to/stage1/checkpoint/dir --dnase /path/to/dnase/packbited --rna_quants /path/to/rna_quants_1630tf.joblib

For other inputs, such as hyperparameters, refer

python resnet.py --help

Stage 2

The stage 2 models predict accessibility for each cell type, sequence pair and uses RNA-seq profiles.

The model_zoo/stage2 directory contains models for the stage 2 models, which may be trained with or without mean accessibility feature as input (explained in more detail in the paper).

To start training any of these models (say, ResNet, with mean), from the model_zoo/stage2 directory:

python simple.py -cp /path/to/stage2/checkpoint/dir --dnase /path/to/dnase/packbited --rna_quants /path/to/rna_quants_1630tf.joblib --stage1_file ../stage1/resnet.py --stage1_pretrained_model_path /path/to/stage1/checkpoint/dir --with_mean 1

The model loads weights from the best model from the stage 1 checkpoint directory. You may resume training from a previous checkpoint by adding the argument -rb 1 to the above command. To predict on the test set, add the arguments -rb 1 -ev 1 to the above command. This will generate a report of performance on the test set and also produce precision-recall plots.

For other inputs, such as hyperparameters, refer

python simple.py --help

Citation

If you use this code for your research, please cite our paper:

Surag Nair, Daniel S Kim, Jacob Perricone, Anshul Kundaje, Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts, Bioinformatics, Volume 35, Issue 14, July 2019, Pages i108–i116, https://doi.org/10.1093/bioinformatics/btz352

chromdragonn's People

Contributors

Stargazers

Watchers

Forkers

jun-lizst jjwang01 aawdeh michael1788 marieleyse vapurv843 lichenbiostat anandketan gaybro8777 msindeeva

chromdragonn's Issues

Question about input matrix

Hi, if I want to train a model to predict binary accessibility in my own cell types of interest, can you provide some guidance for how to preprocess my data? Where do I input the binary matrix of accessibility in each cell type + bed file of peaks into your preprocess pipeline to generate the .joblib files needed to input into the model?

Thanks for your help!

color2ctype is missing from tSNE notebook

Hello I was wondering if you can make the color2ctype labels available for the tSNE notebook plots?

thank you,

Omar

what's the reason behind these filter and padding schemes?

Hello Surag,

Thanks for the great paper and the codes! Really appreciate researchers share their work with their codes!

I see that the code uses filter sets of either (3, 1) (7, 1) (1, 1). And if I understand it correctly, the sliding window across the axis of 1000 bp is '1'. And the sliding window across one-hot encoded axis is 3 or 7 or 1. Am I correct? If so, what was the reason to use only window size of 1 along the sequence? I thought it should be the other way around; window size of 3 or larger to scan across the 1000 bp. Do the same for each of the one-hot encoded base pair.

Also, what is the reason to treat the one-hot encoded sequence as a 2D matrix? Is there an advantage over using Conv1D with 4 channels?

In your paper's Fig.1, the shape of the layer starts from wide-short and gradually becomes long-skinny. What is the reason for this transformation? Why not just deepen the channels and shrink the width?

Thank you again for the insight,

Andrew

GRCh38 model

Hi,

do you have a pretrained model based on GRCh38 ?

Thanks
Gianfilippo

Structure of dnase.chr.packbited.tar.gz data

Dear Kundaje lab,

I'm exploring the data associated with your paper "Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts ". I've downloaded dnase.chr.packbited.tar.gz file from http://mitra.stanford.edu/kundaje/projects/seqxgene/ and loaded it using joblib library.

Could you please explain the structure of these data? I see that num_cell_types value is set to 123, thus I would expect the shape of lables N_saqs * N_cell_types; however the shape of labels is

(398270, 16)
Thanks in advance,
Veniamin

Make trained model available for prediction

I've been trying to train the resnet.py model to predict accessibility from DNA sequence (on the provided DNAse-seq data). I've ran into some issues (listed below). Is it possible for you to provide the trained model weights and a script to initialize the trained model and predict on new DNA?

In the training script, the model seems to expect held out cell types instead of held out chromosomes. I manged to fix this issue by changing line 31 in runner.py which assumes the held out data is cell types, but I'm still not sure it's training correctly (see issue #2).

Train command:
python resnet.py --dnase /data/packbited --rna_quants /data/rna_quants_1630tf.joblib -bs 32 -cp /data/checkpoint-resnet -ho chromosomes -tl 18 -vl 19 > resnet-train-report.txt 2>&1

Once I get the model trained, the performance is poor on the held out chromosome (OVERALL AUPRC = 0.068, AUC = 0.498).

Test command:
python simple.py --dnase /data/packbited --rna_quants /data/rna_quants_1630tf.joblib -bs 32 -s1f ../stage1/resnet.py -s1m /data/checkpoint-resnet/ -ho chromosomes -tl 18 -vl 19 -rfn resnet-report.txt -evmode report -ev 1 --with_mean 0 -cp /data/stage2-resnet

ERROR: Not all intervals have the same length

Hi,

I am trying to test your algorithm.

I downloaded the data from the links you provide, and I am starting with the command you provide:
python ./preprocess/make_accessibility_joblib.py --input /path/to/accessibility/file.tsv.gz --output_dir /path/to/dnase/packbited --genome_fasta /path/to/genome/fasta.fa

I understand that I have to run this command on every single file that is contained in dnase.chr.packbited.tar.gz

I just run it on ENCSR979BPU.Fetal_Kidney.UW_Stam.DNase-seq_rep1-pr.IDR0.1.filt.narrowPeak.gz
using hg38 first and hg19 later.

i get then following error:
PREPROCESS::: Reading file into memory
ERROR: Not all intervals have the same length

Can you please explain what I am missing ?

Thanks