lpryszcz / pyscaf Goto Github PK

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned locally onto reference chromosomes, ignoring:

matches not satisfying cut-offs (--identity and --overlap)
suboptimal matches (only best match of each query to reference is kept)
and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table). Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
Consider closing gaps after scaffolding.

Usage

Dependencies

Parameters

Given reference genome, the program generates pairwise genome alignment (dotplots) by default.

Genral options:

`-h, --help`	show this help message and exit
`-f FASTA, --fasta FASTA`
	assembly FASTA file
`-o OUTPUT, --output OUTPUT`
	output stream [scaffolds.fa]
`-t THREADS, --threads THREADS`
	max no. of threads to run [4]
`--log LOG`	output log to [stderr]
`--dotplot`	generate dotplot as [png]
`--version`	show program's version number and exit

Reference-based scaffolding options:

`-r REF, --ref REF, --reference REF`
	reference FastA file
`--identity IDENTITY`
	min. identity [0.33]
`--overlap OVERLAP`
	min. overlap [0.66]
`-g MAXGAP, --maxgap MAXGAP`
	max. distance between adjacent contigs [0.01 * assembly_size]
`--norearrangements`
	high identity mode (rearrangements not allowed)

Long read-based scaffolding options (EXPERIMENTAL!):

-n LONGREADS, --longreads LONGREADS

FastQ/FastA file(s) with PacBio/ONT reads

NGS-based scaffolding options (!NOT IMPLEMENTED!):

`-i FASTQ, --fastq FASTQ`
	FASTQ PE/MP files
`-j JOINS, --joins JOINS`
	min pairs to join contigs [5]
`-a LINKRATIO, --linkratio LINKRATIO`
	max link ratio between two best contig pairs [0.7]
`-l LOAD, --load LOAD`
	align subset of reads [0.2]
`-q MAPQ, --mapq MAPQ`
	min mapping quality [10]

Test run

To perform reference-based assembly, provide assembled contigs and reference genome in FastA format. Dotplots of below runs can be found in docs. If you wish to skip dotplot generation (ie. no X11 on your system), provide --dotplot '' parameter.

# scaffold homogenised assembly (reduced contigs)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.fa

# scaffold reduced contigs using global mode (no norearrangements allowed)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.global.fa --norearrangements

# scaffold heterozygous assembly (de novo assembled contigs)
./pyScaf.py -f test/contigs.fa -r test/ref.fa -o test/contigs.ref.fa

# scaffold reduced contigs using long reads
## pacbio
./pyScaf.py -f test/contigs.reduced.fa -n test/pacbio.fq.gz -o test/contigs.reduced.pacbio.fa
## nanopore
./pyScaf.py -f test/contigs.reduced.fa -n test/nanopore.fa.gz -o test/contigs.reduced.nanopore.fa

# generate dotplot
lastdb test/ref.fa
lastal -f TAB test/ref.fa test/contigs.reduced.pacbio.fa | last-dotplot - test/contigs.reduced.pacbio.fa.ref.png
lastal -f TAB test/ref.fa test/contigs.reduced.nanopore.fa | last-dotplot - test/contigs.reduced.nanopore.fa.ref.png

# clean-up
#rm test/contigs.{,reduced.}fa.* test/ref.fa.* test/*.{nanopore,pacbio,ref}* test/*.log

Proof of concept

pyScaf is under heavy development right now. Nevertheless, both the reference-based mode and long-read mode are functional and produces meaningful assemblies. pyScaf has been implemented in Redundans.

For more info, have a look in workbook.

pyscaf's People

Contributors

Stargazers

Watchers

Forkers

cz2009111109 chrishah biocko alexpersa7 wangdi2014 biogeeker gabaldonlab litswu biohlt omicsclass

pyscaf's Issues

don't generate dotplot if too many seqs in query

as in topic

mail

Less BUSCO genes after scaffolding.

Hi,

I would just like to make a return on the scaffolding of my assembly (Sanger technology) with PacBio reads (30x coverage), by using pyScaf.

pyScaf is fast and generates interesting results in the first place. I went from 2,059 scaffolds to 1,344 scaffolds, which was encouraging. Then I launched BUSCO on both assemblies and got the following results :

95.6% of complete BUSCO genes for my assembly (before pyScaf) and 78.7% of complete BUSCO genes after pyScaf. Before scaffolding, I have 37 missing genes, after pyScaf I have 284 missing genes.

I launched pyScaf with these parameters :
pyScaf.py -f Scaffolds.fasta --identity 0.80 -o Scaffolds.pyScaf.fasta -t 10 --log pyScaf_run.log --longreads all_raw_reads.Pacbio.fasta

Maybe I have to change them ? Do you have any advice to me?

No such file or directory error

Hi,

I'm trying to run the example commands and getting the following error:

./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.fa
Options: Namespace(dotplot='png', fasta='test/contigs.reduced.fa', fastq=None, identity=0.33, joins=5, linkratio=0.7, load=0.2, log=<open file '<stderr>', mode 'w' at 0x7fbd138771e0>, longreads=None, mapq=10, maxgap=0, nofilter_overlaps=False, norearrangements=False, output=<open file 'test/contigs.reduced.ref.fa', mode 'w' at 0x259b4b0>, overlap=0.66, ref='test/ref.fa', threads=4)
 maxgap cut-off of 10000 bp

##################################################
[Tue Jan  9 16:18:31 2018][    9 Mb] Aligning contigs on reference...

##################################################
[Tue Jan  9 16:18:31 2018][    9 Mb] Saving dotplots to: test/contigs.reduced.fa.png
Traceback (most recent call last):
  File "./pyScaf.py", line 1294, in <module>
    main()
  File "./pyScaf.py", line 1286, in main
    s.save(out)
  File "./pyScaf.py", line 171, in save
    self._get_scaffolds()
  File "./pyScaf.py", line 1094, in _get_scaffolds
    t2hits, t2size = self._get_hits()
  File "./pyScaf.py", line 1035, in _get_hits
    dotplot = self.save_dotplot(self.genome, readstdin=True)
  File "./pyScaf.py", line 212, in save_dotplot
    proc = subprocess.Popen(args, stdin=subprocess.PIPE, stderr=self.log)
  File "/home/peter/.pyenv/versions/2.7.6/lib/python2.7/subprocess.py", line 709, in __init__
    errread, errwrite)
  File "/home/peter/.pyenv/versions/2.7.6/lib/python2.7/subprocess.py", line 1326, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

It doesn't seem to matter whether I build the software from github or use pip/bioconda for installation.

The computer is running Ubuntu 16.04 and python 2.7.6

program seems stuck at alignment

Hi,

The program seems running, but stuck at the beginning of aligning contigs on reference, 12 hours no progress...

Any suggestion?

Thanks.

Output to log

Hi,

I'm currently scaffolding an assembly with some Nanopore data, but it is somewhat difficult to figure out the status of this job. There is no output to a logfile or the screen, and no noticeable output to files in the last 6 days. 1) Is this normal? 2) Can future versions include this?

Thanks!

RuntimeError: maximum recursion depth exceeded !

The script terminated with an error message:
...
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 279, in _populate_scaffold
links_sorted = sorted(links.iteritems(), key=lambda x: x[1][0], reverse=1)
RuntimeError: maximum recursion depth exceeded

Issue with Pyscaf Conda

Hi,

I ran the following command:

pyScaf -f assembly.fa -o pyscaf.scaffolding -t 64 --log pyscaf.log --dotplot png -n pacbio_reads.fasta

It blocks and I receive the output as in the image.

Thanks,
Marco

[WARNING] Unsupported sequence format

Hi, I do not know if it is a silly error... but I am trying to run pyScaf to do scaffolding in my Canu (Pabio) assembly with my nanopore reads.

this is my .sh

pyScaf -f /dataX/sami/genomes/p_ulei_polished.fasta -o p_ulei_scaf_nano --dotplot pdf -n /data1/sami/rawdata/nanopore/nanopore_all.fastq

the less slurm-XX.out is

Options: Namespace(dotplot='pdf', fasta='/dataX/sami/genomes/p_ulei_polished.fasta', fastq=None, identity=0.33, joins=5, linkratio=0.7, load=0.2, log=<open file '', mode 'w' at 0x7f1660bc61e0>, longreads=['/data1/sami/rawdata/nanopore/nanopore_all.fastq'], mapq=10, maxgap=0, norearrangements=False, output=<open file 'p_ulei_scaf_nano', mode 'w' at 0x7f165a43f6f0>, overlap=0.66, ref='', threads=4)
maxgap cut-off of 936958 bp

##################################################
[Tue Oct 8 17:05:21 2019][ 7 Mb] Aligning long reads on contigs...
File "/home/sami/anaconda2/bin/maf-convert", line 136
print(line, end="")
^
SyntaxError: invalid syntax
0 0

##################################################
[Tue Oct 8 17:05:21 2019][ 7 Mb] Reporting scaffolds...
93695813 bp in 230 scaffolds. Details in p_ulei_scaf_nano.tsv
Scaffolds saved to: p_ulei_scaf_nano

##################################################
[Tue Oct 8 17:05:21 2019][ 56 Mb] Saving dotplots to: p_ulei_scaf_nano.pdf
[WARNING] Unsupported sequence format p_ulei_scaf_nano in /data1/sami/rawdata/nanopore/nanopore_all.fastq

The program is not running. Can you advise me something? my nanopore reads are not compressed. We install both pyScaf and Dependencies with bioconda.

Thanks a miilion

Too large size of gaps got after long-reads scaffolding

I have get as long as ~500K bp gaps when using PacBio reads for scaffolding. However, the long reads are not longer than 80K. I wonder how such long gaps opened. There might be an error on calculating gap in the script at Line 802 (gap = dist - pos1 + pos2). As pos1 and pos2 should be the overhang of target2 and target1, the expression should be "gap = dist - (pos1 + pos2)" and "overhang = pos1 + pos2"?
Additionly, there might be some over-scaffolding that many contigs seemed with large overlap were linked directly (without any check such as whether the contigs overlapped actually). Here are head lines in file scaffolds.longreads.1.fa.tsv:

name size no. of contigs ordered contigs contig orientations (0-forward; 1-reverse) gap sizes (negative gap size = adjacent contigs are overlapping)

Here is the command:
redundans.py -f $SEQS -o nonredundant5 -l $LONGREADS -identity 0.9 --overlap 0.9 --minLength 200 -t 30 --resume -v
The '-f' is all contigs assembled by CANU without any gap and the '-l' is corrected Pacbio reads by CANU.
Before scaffolding, the ratio of completed BUSCOs is ~92%, and that is ~90% after scaffolding.

Dotplot doesn't work

Hi Leszek,

I ran pyScaf to compare my genome with other closely related species. My main goal was to generate dotplots for them. I ran the command something like this:

python /shared/software/GIF/programs/pyscaf/2016-04-08/pyScaf.py \
    --fasta $GENOME \
    --reference $TARGET \
    --norearrangements \
    --threads 16 \
    --dotplot pdf \
    --log ${GENOME%.*}_pynast.log

The STDOUT showed that the job completed sucessfully:

Options: Namespace(dotplot='pdf', fasta='MaSuRCA_redundans_final.fasta', fastq=None, identity=0.33, joins=5, linkratio=0.7, load=0.2, log=<open file 'MaSuRCA_redundans_final_pynast.log', mode 'w' at 0x2ab415a29300>, longreads=None, mapq=10, maxgap=0, norearrangements=True, overlap=0.66, ref='Cyanidioschyzon_merolae.ASM9120v1.dna.toplevel.fa', threads=16)
 maxgap cut-off of 17978594 bp

##################################################
[Thu Nov 10 15:11:39 2016][   94 Mb] Aligning contigs on reference...

##################################################
[Thu Nov 10 15:11:39 2016][   94 Mb] Saving dotplots to: MaSuRCA_redundans_final.fasta.pdf
[WARNING] dotplot generation failed!

##################################################
[Thu Nov 10 15:19:26 2016][   94 Mb] Reporting scaffolds...
 1797859444 bp in 17040 scaffolds. Details in MaSuRCA_redundans_final.fasta.scaffolds.ref.fa.tsv
Scaffolds saved to: MaSuRCA_redundans_final.fasta.scaffolds.ref.fa

##################################################
[Thu Nov 10 15:22:17 2016][  118 Mb] Saving dotplots to: MaSuRCA_redundans_final.fasta.scaffolds.ref.fa.pdf

##################################################
[Thu Nov 10 15:22:18 2016][  118 Mb] Done!

except for the [WARNING] dotplot generation failed!.
Does this mean that I was missing some packages required to plot? Or is this the data which didn't produce any dotplot?

Thanks for the help! Have a great day!

importerror: no module named fastaindex

Hello :
I had use conda install fastaindex; but I still got the eroor
My command is :
./pyScaf.py -f assembly.fasta -n long-read.fasta -o mapped@pyS

Is there any problems with this command?

`-n LONGREADS, --longreads LONGREADS`
	FastQ/FastA file(s) with PacBio/ONT reads

lpryszcz / pyscaf Goto Github PK

pyscaf's Introduction