jordiabante / cpelasm.jl Goto Github PK

View Code? Open in Web Editor NEW

10.0 1.0 0.0 2.56 MB

Julia module to perform haplotype allele-specific DNA methylation analysis.

Home Page: https://jordiabante.github.io/CpelAsm.jl/dev/

License: MIT License

Julia 100.00%

statistical-physics dna-methylation information-theory statistical-analysis johns-hopkins-university

cpelasm.jl's Introduction

CpelAsm

Description

CpelAsm is a julia package especifically designed for haplotype allele-specific methylation based on the method in [1]. CpelAsm draws ideas from statistical physics and information theory to detect allele-specific methylation imbalances at the haplotype level.

Testing

CpelAsm is tested against Julia 1.3.0 on the latest versions of Linux, macOS and Windows.

Getting Started

Prerequisites

julia v1.3.0
git.

Installation

CpelAsm and dependencies can be installed via the following command in julia's REPL:

(v1.3) pkg> add https://github.com/jordiabante/CpelAsm.jl.git

Run the following command to load and test the CpelAsm.jl package

(v1.3) pkg> test CpelAsm

Local installation

Place the CpelAsm folder in a directory of your choice.

Start Julia and change the current directory to the CpelAsm folder on your system. To do so, type (for example):

# Windows
julia> cd("C:\\Users\\UserName\\code\\CpelAsm.jl")
# macOS/Unix
julia> cd("/Users/UserName/code/CpelAsm.jl")

Run the following commands to install dependencies:

julia> using Pkg; Pkg.activate("."); Pkg.instantiate()

Run the following command to load and test the CpelAsm.jl package:
```
julia> using CpelAsm; Pkg.test("CpelAsm")
```
If successfully installed, you should see the following prompt:
```
    Testing CpelAsm tests passed
```

Toy Example

The package includes a small toy example for illustrative purposes. The example consists of two alleles a1 and a2. The former has a mean-methylation level (MML) of 0.8, while that of the later is 0.2. Nevertheless, given the symmetry of the problem, both alleles have the same Shannon entropy. Thus differential analysis only identifies differences in terms of MML. The output bedGraph files can be found in out_path. To run the toy example run the following commands in a julia's REPL:

using CpelAsm
dir = "/path/to/CpelAsm.jl/test/"
b1 = "$(dir)/bam/example.a1.bam"
b2 = "$(dir)/bam/example.a2.bam"
fa = "$(dir)/fasta/n-masked/example.fa"
vcf = "$(dir)/vcf/example.vcf"
out = "$(dir)/out/"
run_analysis(b1,b2,b1,vcf,fa,out;win_exp=10,n_null=1000,cov_ths=5)

Alternatively, if the installation has been done locally, this can be simply run using the following shell command in the CpelAsm.jl folder:

julia calls/ToyExample.jl

Authors

Jordi Abante

License

This project is licensed under the MIT License - see the LICENSE.md file for details.

References

[1] Abante, J., Fang, Y., Feinberg, A.P., Goutsias, J., Detection of haplotype-dependent allele-speciﬁc DNA methylation in WGBS data, Nature Communications 11, 5238 (2020), https://doi.org/10.1038/s41467-020-19077-1.

cpelasm.jl's People

Stargazers

Watchers

cpelasm.jl's Issues

Fixing "Found overlap between homozygous and heterozygous CpG sites" error

I'm getting the error "Found overlap between homozygous and heterozygous CpG sites. Check FASTA file is masked" when I try to use CpelAsm, and I'm trying to figure out which part of my input is causing it. I looked at the relevant part of the source code, but I'm new to Julia and wasn't able to glean much.

What exactly is the overlap referring to; is it a feature of the VCF file, the BAMs, or an interaction of them?

My input files are the N-masked genome fasta and three BAMs (genome1, genome2, unassigned) created by SNPsplit plus a VCF file. The VCF is derived from the file that I used to N-mask the genome and prepare the SNP list for SNPsplit, but I modified it to follow the VCF in the example (e.g., combining two 1/1 and 0/0 samples into one 1/0 sample and adding the haplotype tag).

Thank you for your time.

How to implement parallel computing?

Dear CpelAsm Team,

In the original text, you mentioned: "Note, however, that the clear superiority of the CPEL method over existing methods comes at a substantial computational cost (it took about 48 hours using 20 CPUs to process one WGBS sample in the real data), which necessitates the use of parallel processing using a computer cluster." However, when I run CpelAsm, it appears to use only one CPU. Due to my limited familiarity with the Julia language, could you please guide me on how to enable parallel processing? Thank you very much.

Julia Version

Dear CpelASM Team,

I really would like to run your software on my datasets, but unfortunately, it does not run under the latest Julia version on my device. Moreover, I can not find Julia in the version 1.3.0 that is recommended in the ReadMe. Could you give me a hint with which version of Julia I may have success or what I might overlooked?

Thanks in advance, Christiane

The calculation results seem to have issues

Dear CpelAsm Team,
I have followed your instructions to calculate the results, but they seem to lack statistical significance, and the p-values are close to 1. I cannot identify where the issue might be. Do you know what could be causing this situation? I really hope to receive your assistance.

How to create the example.a1.bam and example.a2.bam?

I'm very confused about the origin of the two bam files in the toyExample. Based on the bismark_reports, the bam came from aligning PE reads to two haplotypes separately. I assume there should be a process to divide reads into two groups (correspond to two haplotypes), but two bams have the same aligned reads (396 reads).

For example, the read "EASX:X:X:X:X:33:33_1:Y:0:NNNNNN" appear in two bam files. Is it correct? Any read should only have one origin.

Can you share how to create two bam files? thanks

No significant T-MML, T-NME and PDM-haps found in samples validated for X-skew testing

Hi Jordi,

I am testing samples that have been validated for X-skew testing to show skewed XCI using clinical X-skew testing as well as monoallelic expression using DNA & RNA sequencing.

I am now testing the same samples for allele specific methylation on the X chromosome using CPEL and although I was able to get CPEL to work smoothly with no errors, I am not seeing any significant p-values in the following files
__tmml_null.bedGraph - empty
__tnme_null.bedGraph - empty
__tpdm_null.bedGraph - empty

Here are the steps I followed for each sample

WGS --> alignment using GATK -- HET VCF --> high confidence HETS (DP >=10 and GQ>=20)

This gave me roughly ~60k het SNVs across the X chromosome.

N-masking of the X chromosome fasta sequence (build 37) using SNPsplit using the ~60k SNVs above.
Alignment to the N-masked reference genome using WGBS data and BISMARCK. (trimming was done using Trimgalore to remove adapters) and only reads with MQ>=20 were retained.
Splitting of the BISMARCK bam generated above using 60k SNVs into allele1.bam and allele2.bam using SNPsplit. A majority of the reads within the n-masked bam were unassigned as they didnt overlap the 60k SNVs, understandably so, but allele1 and allele2 had roughly ~300k reads assigned to them
Use whatshap to generate a phased VCF from the VCF generated in step1 - Roughly ~30k variants could be phased (50% of the original)
Use run_analysis(b1,b2,b1,vcf,fa,out;win_exp=10,n_null=1000,cov_ths=5) within CPEL - The VCF supplied here is the phased VCF from whatshap, the b1 and b2 are the bams generated in step3 and fasta file is generated by snpsplit in step 2. The program runs for rougly 45 minutes, but I do not get any significant hits. All of the WGBS is deep sequenced and I used high confidence HET SNVs.

Could you suggest what the issue might be ?? Do I not need to filter the HET SNVs and use all of them without applying any quality filters in step 1? - Is that the main issue here ?? Any insight from you will be very helpful. If I dont put quality filters on the HET SNVs, each sample has roughly ~70k HET SNVs which can be used towards N-masking and allele splitting. But when using whatshap phasing, they are still reduced down to ~30-40k SNVs.

Regards
Numrah