chjiao / pehaplo Goto Github PK

Python 100.00%

pehaplo's Introduction

PEHaplo

PEHaplo is a de novo assembly tool for recovering virus haplotypes from virus quasispecies sequencing data. It utilizes overlap graph and paired-end information to recover virus haplotypes.

It requires paired-end reads file as input and outputs contigs that are part of or full haplotypes.

PEHaplo does not need any reference genomes and thus can be applied for identifying new haplotyps or haplotypes that are remotely related to characterized ones.

Quick Start

To quickly test the core assembly algorithm, we have prepared the procssed data set (after error correction, removing duplicates, reads orientation adjustment, etc.) in folder processed_test_data.

Required Dependencies

Install Python 2.7.x
Install Python module: networkx 1.11

cd networkx-1.11/  
sudo python setup.py install

Apsp

cd Apsp/  
make

Copy the compiled binary file Apsp to your path

Installation

PEHaplo was developed to run in Linux. To install:
git clone https://github.com/chjiao/PEHaplo.git

Running Example

cd PEHaplo  
mkdir assembly  
cd assembly  
python ../apsp_overlap_clique.py ../processed_test_data/Plus_strand_reads.fa ../processed_test_data/pair_end_connections.txt 180 250 600 210

180: initial overlap threshold before merging cliques
250: read size
600: paired-end insert size
210: overlap threshold after merging cliques

Output:
Contigs.fa: the produced contigs
PEG_nodes_sequences.fa: the nodes sequences in the graph

The whole pipeline

Installing via conda

Noted that all the packages can be found on anaconda.cloud, which means you can easily install them by using conda. You can follow the Guaidance to install step by step.

Installing without using conda: required dependencies

Install Python 2.7.x
Install Python module: networkx 1.11
Install Karect, so that karect is in your path
Install Readjoiner, so that readjoiner is in your path
Install Apsp, so that Apsp is in your path
Install SGA, so that sga is in your path
Install Samtools, so that samtools is in your path

For contigs correction based on alignment:
Bowtie2 is required in your path

Usage

python pehaplo.py [-h] -f1 INPUT_F1 -f2 INPUT_F2 -l OVERLAP_LEN -r READ_LEN [-l1 OVERLAP_LEN1] [-F FRAGMENT_LEN] [-std FRAGMENT_STD] [-n DUP_N] [-correct CONTIG_CORRECT] [-t THREADS]

Arguments:

-f1 INPUT_F1 input .1 part of paired-end fasta file

-f2 INPUT_F2 input .2 part of paired-end fasta file

-l OVERLAP_LEN, --overlap_len OVERLAP_LEN overlap threshold between reads for overlap graph construction

-l1 OVERLAP_LEN1, --overlap_stage1 OVERLAP_LEN1 overlap cutoff to remove potentially error overlaps after merging linked cliques, default same as -l

-n DUP_N, --dup_n DUP_N keep the reads that are duplicated at least n times

-r READ_LEN, --read_len READ_LEN reads length

-F FRAGMENT_LEN, --fragment_len FRAGMENT_LEN paired-end reads insert size, default as 2.5*read_len

-std FRAGMENT_STD standard deviation of paired-end reads insert size, default as 100

-n DUP_N, --dup_n DUP_N the reads kept should be duplicated at least n times, default as keep all the duplicates removed reads

-correct CONTIG_CORRECT whether apply alignment based contigs correction(yes/no), default: no

-t THREADS, --threads THREADS threads for karect, sga, bowtie2

Test data and Running Examples

Use the test in folder raw_test_data/.
Example:

cd PEHaplo  
mkdir assembly  
python ../pehaplo.py -f1 ../raw_test_data/virus_1.fa -f2 ../raw_test_data/virus_2.fa -l 180 -l1 210 -r 250 -F 600 -std 150 -n 3 -correct yes

Output:
Contigs.fa: the raw output contigs
Contigs_clipped.fa: the contigs after error correction
PEG_nodes_sequences.fa: the nodes sequences in the graph

pehaplo's People

Contributors

Stargazers

Watchers

Forkers

broadinstitute skshashank74

pehaplo's Issues

some needed imports are not mentioned in the readme

numpy, scipy, pygraphviz . others?

Test run fails

Hi Jiao,

I am having problem getting PEHaplo to run. I think all the requirements are installed correctly, but I am not sure how to test whether that is true. I tried following the README and created an assembly folder in the root folder of the repo. Then changed directory to the assembly folder and ran python ../pehaplo.py -f1 ../raw_test_data/virus_1.fa -f2 ../raw_test_data/virus_2.fa -l 180 -l1 210 -r 250 -F 600 -std 150 -n 3 -correct yes that crashed after a while with the following error:

# gt readjoiner overlap (version 1.2)
# number of reads in filtered readset = 27066
# number of irreducible suffix-prefix matches = 27088
# average irreducible SPM/read = 1.00
# number of transitive suffix-prefix matches = 1351014
Number of total reads:  27066
Number of paired reads: 11472
Number of unpaired reads:       15594
Graph construction finished!
Traceback (most recent call last):
  File "/home/microadmin/Programs/PEHaplo/tools/Seperate_strand_single_pair.py", line 480, in <module>
    plus_reads_dict, minus_reads_dict = BFS_tranverse_graph_pair(G, read_db, des_list, starting_nodes[0], pair1_dict, pair2_dict)
  File "/home/microadmin/Programs/PEHaplo/tools/Seperate_strand_single_pair.py", line 280, in BFS_tranverse_graph_pair
    candidate_nodes.extend(G.predecessors(vertex))
AttributeError: 'dictionary-keyiterator' object has no attribute 'extend'
Traceback (most recent call last):
  File "../pehaplo.py", line 157, in <module>
    main()
  File "../pehaplo.py", line 85, in main
    subprocess.check_call('python "%s"/tools/Seperate_strand_single_pair.py samp.des samp_edges.txt kept_num.fa kept_pairs.txt' % base_path, shell=True)
  File "/opt/miniconda3/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'python "/home/microadmin/Programs/PEHaplo"/tools/Seperate_strand_single_pair.py samp.des samp_edges.txt kept_num.fa kept_pairs.txt' returned non-zero exit status 1

Is this a bug or is there something wrong with the installation? Could you maybe create docker image of a working setup that could be used as an alternative to the complicated manual installation of the requirements?

Thanks for your help in advance!

Kind regards,
Balazs

get_pairs_title_from_nondup.py error

Hi,

I installed PEHaplo with conda - this requires and installs python 3.6 (is that correct?).
I tried to install PEHaplo using conda into a py26 environment - but that causes inconsistencies.
Maybe the py version is not the problem?

kept_num.fa is produced and produce some 200k reads with fasta headers - e.g.:

>M03777:103:000000000-B655P:1:1101:15081:1669 M03777:103:000000000-B655P:1:1101:15081:1669 NumDuplicates=1
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTCTTTTCTTTTTTTCTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>M03777:103:000000000-B655P:1:1101:15081:1669 M03777:103:000000000-B655P:1:1101:15081:1669 NumDuplicates=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
>M03777:103:000000000-B655P:1:1101:18625:1726 M03777:103:000000000-B655P:1:1101:18625:1726 NumDuplicates=1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAATAAAAAAAAATTTAATAATTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
....

the kept_pairs.txt however contain only 2 entries:
M03777:103:000000000-B655P:1:1101:15081:1669 M03777:103:000000000-B655P:1:1101:15081:1669

Traceback (most recent call last): File "/home/ngs/miniconda3/envs/pehaplo/share/pehaplo-v0.1_dev05-0/tools/get_pairs_title_from_nondup.py", line 26, in <module> assert title_dict[base].endswith('1') and title.endswith('2') AssertionError Traceback (most recent call last): File "/home/ngs/miniconda3/envs/pehaplo/bin/pehaplo.py", line 141, in <module> main() File "/home/ngs/miniconda3/envs/pehaplo/bin/pehaplo.py", line 87, in main subprocess.check_call('python "%s"/tools/get_pairs_title_from_nondup.py kept_num.fa kept_pairs.txt' % base_path, shell=True) File "/home/ngs/miniconda3/envs/pehaplo/lib/python3.6/subprocess.py", line 311, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command 'python "/home/ngs/miniconda3/envs/pehaplo/share/pehaplo-v0.1_dev05-0"/tools/get_pairs_title_from_nondup.py kept_num.fa kept_pairs.txt' returned non-zero exit status 1.

Failure running PEHaplo

Hi,
I have been trying to run PEHaplo during the last two days with no success. Several failures have jumped, but step by step I could solve them. Unfortunately, the last one it is out of my control:
[ERROR REPORTED]
File "XERMADE/PEHaplo-master/tools/get_pairs_title_from_nondup.py", line 28, in
assert (title_dict[base].endswith('1') and title.endswith('2')) or (title_dict[base].endswith('2') and title.endswith('1'))
AssertionError
But I have changed my reads (they are in fasta format), the current format is: 'read1/1' 'read1/2', which I think is the right one. Am I wrong?
Besides, I have tried to execute aswell the gold benchmark (https://github.com/cbg-ethz/5-virus-mix); but the code fails:
[ERROR REPORTED]
XERMADE/PEHaplo-master/tools/Seperate_strand_single_pair.py", line 466, in
des_list, read_map, read_db = get_seq_from_fa(fa_file, des_file) # read dictionary
File "/home/bfreire/XERMADE/PEHaplo-master/tools/Seperate_strand_single_pair.py", line 14, in get_seq_from_fa
with open(des_file,'r') as f:
IOError: [Errno 2] No such file or directory: 'samp.des'
I think that samp.des is a file created by the code. If not I did not realize about how it should be.
In this case I have trimmed the reads with the same paremeters as SAVAGE did, but I have not applied PEAR. Is everything correct?
Thanks in advice :)

python 3 compatibility

I tried to make a bioconda recipe ( https://bioconda.github.io/ ) for installing PEHaplo. To install it in Python 3 environments, I tried to use the 2to3 program which converts python 2 code to python 3. The problem is that the division operator (/) works differently in Python 3: in Python 2 it's integer division if operands are integer, while in Python 3 it's floating-point division while // is integer division. If possible, could you change your code to be Python 3 compatible, by adding
from future import division
at the top of each script, and then changing / to // wherever you want integer division rather than floating-point division? Then PEHaplo could be installed in Python 3 conda environments alongside other tools.

bug?

@chjiao at https://github.com/chjiao/PEHaplo/blob/master/tools/Seperate_strand_single_pair.py#L98
did you mean to assign overlap_len not overlap?

tag a release

@chjiao Can you tag a github release? This is needed to create a bioconda recipe for installing pehaplo ( https://bioconda.github.io/ ). Thanks!

Error with get_pairs_title_from_nondup.py

Hello, I´m having an error when PEHaplo runs get_pairs_title_from_nondup.py. It exits with non zero status.

I have installed Networkx version 1.11
My reads names are CP1p_1.fa and CP1p_2.fa

Can you guide me through to the solution?

Thanks!

Read quality trimming and adapter removal

The article mentions:

There are five major components in the pipeline of PEHaplo as shown
in Figure 3. In the first pre-processing stage, reads with low-quality
or ambiguous base calls are filtered or trimmed.

however, it's not implemented in the code, as far as I can tell.
There is also no mention of any adapter removal in the methods section.
Should we do the usual adapter removal and/or quality trimming procedure before using PEHaplo, or is the algorithm somehow immune to this noise?

My primary concern is that after trimming the read lengths will be widely distributed and many of them will be shorter than the calculated min overlap parameter.

Fail in contig correction

The command is: python pehaplo.py -f1 PV194D.virus_recruit_1.fa -f2 PV194D.virus_recruit_2.fa -l 100 -l1 120 -t 30 -r 150 -F 375 -std 100 -n 3 -correct yes

The error info is: Traceback (most recent call last):
File "pehaplo.py", line 157, in
main()
File "pehaplo.py", line 115, in main
subprocess.check_call('samtools view -F 4 -S contigs_alignment.sam > contigs_mapped.sam', shell=True)
File "/home/xiuwan/miniconda2/envs/pehaplo/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'samtools view -F 4 -S contigs_alignment.sam > contigs_mapped.sam' returned non-zero exit status 1

The reason I think is:
grep -n "contig_529_0" Contigs.fa
115:>contig_529_0 contig_529_0 NumDuplicates=1
117:>contig_529_0 contig_529_0 NumDuplicates=1

Why there would be two same id in Contigs.fa?

I have completed the other 10 samples, everything is successful, just this sample meet this problem. It's so strange. Right?

Thanks in advance.

reading .des files

At https://github.com/chjiao/PEHaplo/blob/master/tools/Seperate_strand_single_pair.py#L14
the .des file is read as a text file. But at the end of this file, after the list of read names, there is binary data, which is currently read in as an (incorrect) read name. It'd be better to use the official API for reading EncSeq files, from https://github.com/genometools/genometools/blob/master/gtpython/gt/core/encseq.py .

AssertionError in get_pairs_title_from_nondup.py

Hi Jiao,

I've been trying to get PEHaplo running for a while now, but I keep moving from error to error so I was hoping you could help me out. First it complained when running "join_pair_end_fasta.py". Since the input given to the script in th main pipeline is fastq and there is also a "join_pair_end_fastq.py" script available in /tools, I replaced the first by the latter script in the pipeline. This seemed to fix the issue. But then it crashed when running "get_pairs_title_from_nondup.py" because of an AssertionError:

assert (title_dict[base].endswith('1') and title.endswith('2')) or (title_dict[base].endswith('2') and title.endswith('1'))

So I renamed my input reads to ensure that forward readnames end with /1 and reverse readnames end with /2 (my input now looks similar to the example data) but PEHaplo still crashes. If I remove the assert statement it will continue to the next step and crash when running "Seperate_strand_single_pair.py" because of a KeyError after the graph construction finishes.

Any help would be very much appreciated!

Best,
Jasmijn

subprocess.CalledProcessError: Command 'sga index Contigs.fa' returned non-zero exit status 1

Hi, I'm experiencing the following error when trying to assemble the paired-end reads:

Size of File:0
Number of Strings: K 0
Size of strings:0
Creating tree: 
User Time for Constructing tree with no sorting required: 0 ms
Segmentation fault (core dumped)
The nodes of the whole graph is: 0; the edges of the whole graph is: 0.
index: input file is empty
Traceback (most recent call last):
  File "../pehaplo.py", line 157, in <module>
    main()
  File "../pehaplo.py", line 102, in main
    subprocess.check_call('sga index Contigs.fa', shell=True)
  File "/home/thermite/anaconda3/envs/tarvir/lib/python2.7/subprocess.py", line 190, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'sga index Contigs.fa' returned non-zero exit status 1

I'm running Python 2.7 and followed the steps as stated in the Readme as well as my paired end reads are in the 'reads1/1' format.
Thanks!

division by zero

@chjiao sometimes at
https://github.com/chjiao/PEHaplo/blob/master/tools/identify_misjoin_contigs.py#L427
count_pair is zero, and the tool crashes with division by zero.

subprocess.CalledProcessError exit status -11

An issue was arised as followed:
(pehaplo) root 17:36:35 /public/home/hanzhenzhi/ngs-project/WFM/D2587/quasispecies
$ pehaplo.py -f1 tmp/clean_1.fq -f2 tmp/clean_2.fq -l 50 -r 150 -std 100 -t 36 -m 200GB -correct no
Error correction begins -------------------------------------------------
Error Correction Started
Estimated Coverage = 14.745907
Global Num Reads = 145153888
Global Num Bases = 21695686234
Global Num Kmers = 20534455130
Global Max Read Len = 150
Global Avg Read Len = 149.466794
Min Overlap = 35
Min Overlap Percentage = 0.200000
Max Overlap Error Rate = 0.250000
Truncate Factor = 0.000000
Min Straight Edge Val = 73.729537
Min Straight Edge Per = 1.000000
Min Read Weight = 0.000000
Max Mul Len Matches = 13422818
Max Kmer Slots = 100000
Use Qual = 1
Match Type = Hamming Distance
Kmer Size = 13
Max Kmer Errors = 2
Kmer With Errors Size = 13
Read Index Bits = 40
Num Threads = 36
Memory Budget = 10995116277760[10240g]
File Block Size = 10485760[10m]
Cache Block Size = 134217728[128m]
Max Reads Per Step = 1000
Max Part Res = 110
Main CorrectErrors StartFile=0
Total Reads Size = 43681680245
Building Index [0 - 43536526356]
Traceback (most recent call last):
File "/public/Software/bin/pehaplo.py", line 160, in
main()
File "/public/Software/bin/pehaplo.py", line 52, in main
subprocess.check_call("karect -correct -inputfile=%s -inputfile=%s -threads=%s -celltype=haploid -matchtype=hamming -aggressive=5.0 -numstages=2 -errorrate=0.25 -errorratesec=0.25" % (args.input_f1, args.input_f2, args.threads), shell=True)
File "/public/Software/Miniconda3/envs/pehaplo/lib/python2.7/subprocess.py", line 190, in check_call
raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command 'karect -correct -inputfile=tmp/clean_1.fq -inputfile=tmp/clean_2.fq -threads=36 -celltype=haploid -matchtype=hamming -aggressive=5.0 -numstages=2 -errorrate=0.25 -errorratesec=0.25' returned non-zero exit status -11

I performed several methods to solve this issue, such as reducing the overlap threshold as showed for other problems. But this issue could not be solve. I also change to "root" role, for the highest permission of computer. And the command "-correct no" or "-correct yes" return the same issue. Could you check this error and provide a solution?

Zhenzhi

warn that pehaplo.py deletes all existing files in the directory where it is run

@chjiao pehaplo.py deletes all existing files in the directory where it is run . Maybe warn about that?

read quality scores

@chjiao I notice that you don't use read quality scores (fastq files), although some of the tools you invoke (Karect, sga) can use them. Is that just to make things simpler, or you've found in your experiments that using quality scores makes results worse?