Git Product home page Git Product logo

pyscaf's Introduction

pyScaf orders contigs from genome assemblies utilising several types of information:

This is under development... Stay tuned.

In this mode, pyScaf aligns long reads onto the contigs, identifies the reads the connects two or more contigs and join adjacent contigs.

Long reads are aligned locally onto contigs, ignoring:

  • matches not satisfying cut-offs (--identity and --overlap)
  • suboptimal matches (only best match of each query to reference is kept)
  • and removing overlapping matches on reference.

Note, this is experimental implementation.

In reference-based mode, pyScaf uses synteny to the genome of closely related species in order to order contigs and estimate distances between adjacent contigs.

Contigs are aligned locally onto reference chromosomes, ignoring:

  • matches not satisfying cut-offs (--identity and --overlap)
  • suboptimal matches (only best match of each query to reference is kept)
  • and removing overlapping matches on reference.

In preliminary tests, pyScaf performed superbly on simulated heterozygous genomes based on C. parapsilosis (13 Mb; CANPA) and A. thaliana (119 Mb; ARATH) chromosomes, reconstructing correctly all chromosomes always for CANPA and nearly always for ARATH (Figures in dropbox, CANPA table, ARATH table). Runs took ~0.5 min for CANPA on 4 CPUs and ~2 min for ARATH on 16 CPUs.

Important remarks:

  • Reduce your assembly before (fasta2homozygous.py) as any redundancy will likely break the synteny.
  • pyScaf works better with contigs than scaffolds, as scaffolds are often affected by mis-assemblies (no de novo assembler / scaffolder is perfect...), which breaks synteny.
  • pyScaf works very well if divergence between reference genome and assembled contigs is below 20% at nucleotide level.
  • pyScaf deals with large rearrangements ie. deletions, insertion, inversions, translocations. Note however, this is experimental implementation!
  • Consider closing gaps after scaffolding.

Given reference genome, the program generates pairwise genome alignment (dotplots) by default.

  • Genral options:

    -h, --help

    show this help message and exit

    -f FASTA, --fasta FASTA
     

    assembly FASTA file

    -o OUTPUT, --output OUTPUT
     

    output stream [scaffolds.fa]

    -t THREADS, --threads THREADS
     

    max no. of threads to run [4]

    --log LOG

    output log to [stderr]

    --dotplot

    generate dotplot as [png]

    --version

    show program's version number and exit

  • Reference-based scaffolding options:

    -r REF, --ref REF, --reference REF
     

    reference FastA file

    --identity IDENTITY
     

    min. identity [0.33]

    --overlap OVERLAP
     

    min. overlap [0.66]

    -g MAXGAP, --maxgap MAXGAP
     

    max. distance between adjacent contigs [0.01 * assembly_size]

    --norearrangements
     

    high identity mode (rearrangements not allowed)

  • Long read-based scaffolding options (EXPERIMENTAL!):

    -n LONGREADS, --longreads LONGREADS
     

    FastQ/FastA file(s) with PacBio/ONT reads

  • NGS-based scaffolding options (!NOT IMPLEMENTED!):

    -i FASTQ, --fastq FASTQ
     

    FASTQ PE/MP files

    -j JOINS, --joins JOINS
     

    min pairs to join contigs [5]

    -a LINKRATIO, --linkratio LINKRATIO
     

    max link ratio between two best contig pairs [0.7]

    -l LOAD, --load LOAD
     

    align subset of reads [0.2]

    -q MAPQ, --mapq MAPQ
     

    min mapping quality [10]

To perform reference-based assembly, provide assembled contigs and reference genome in FastA format. Dotplots of below runs can be found in docs. If you wish to skip dotplot generation (ie. no X11 on your system), provide --dotplot '' parameter.

# scaffold homogenised assembly (reduced contigs)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.fa

# scaffold reduced contigs using global mode (no norearrangements allowed)
./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.global.fa --norearrangements

# scaffold heterozygous assembly (de novo assembled contigs)
./pyScaf.py -f test/contigs.fa -r test/ref.fa -o test/contigs.ref.fa

# scaffold reduced contigs using long reads
## pacbio
./pyScaf.py -f test/contigs.reduced.fa -n test/pacbio.fq.gz -o test/contigs.reduced.pacbio.fa
## nanopore
./pyScaf.py -f test/contigs.reduced.fa -n test/nanopore.fa.gz -o test/contigs.reduced.nanopore.fa

# generate dotplot
lastdb test/ref.fa
lastal -f TAB test/ref.fa test/contigs.reduced.pacbio.fa | last-dotplot - test/contigs.reduced.pacbio.fa.ref.png
lastal -f TAB test/ref.fa test/contigs.reduced.nanopore.fa | last-dotplot - test/contigs.reduced.nanopore.fa.ref.png

# clean-up
#rm test/contigs.{,reduced.}fa.* test/ref.fa.* test/*.{nanopore,pacbio,ref}* test/*.log

pyScaf is under heavy development right now. Nevertheless, both the reference-based mode and long-read mode are functional and produces meaningful assemblies. pyScaf has been implemented in Redundans.

For more info, have a look in workbook.

pyscaf's People

Contributors

lpryszcz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

pyscaf's Issues

Issue with Pyscaf Conda

Hi,

I ran the following command:

pyScaf -f assembly.fa -o pyscaf.scaffolding -t 64 --log pyscaf.log --dotplot png -n pacbio_reads.fasta

It blocks and I receive the output as in the image.

Thanks,
Marco
image

Output to log

Hi,

I'm currently scaffolding an assembly with some Nanopore data, but it is somewhat difficult to figure out the status of this job. There is no output to a logfile or the screen, and no noticeable output to files in the last 6 days. 1) Is this normal? 2) Can future versions include this?

Thanks!

RuntimeError: maximum recursion depth exceeded !

The script terminated with an error message:
...
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 307, in _populate_scaffold
return self._populate_scaffold(links, end, sid, scaffold, orientations, gaps, orientation)
File "./pyScaf.py", line 279, in _populate_scaffold
links_sorted = sorted(links.iteritems(), key=lambda x: x[1][0], reverse=1)
RuntimeError: maximum recursion depth exceeded

Less BUSCO genes after scaffolding.

Hi,

I would just like to make a return on the scaffolding of my assembly (Sanger technology) with PacBio reads (30x coverage), by using pyScaf.

pyScaf is fast and generates interesting results in the first place. I went from 2,059 scaffolds to 1,344 scaffolds, which was encouraging. Then I launched BUSCO on both assemblies and got the following results :

95.6% of complete BUSCO genes for my assembly (before pyScaf) and 78.7% of complete BUSCO genes after pyScaf. Before scaffolding, I have 37 missing genes, after pyScaf I have 284 missing genes.

I launched pyScaf with these parameters :
pyScaf.py -f Scaffolds.fasta --identity 0.80 -o Scaffolds.pyScaf.fasta -t 10 --log pyScaf_run.log --longreads all_raw_reads.Pacbio.fasta

Maybe I have to change them ? Do you have any advice to me?

program seems stuck at alignment

Hi,

The program seems running, but stuck at the beginning of aligning contigs on reference, 12 hours no progress...

Any suggestion?

Thanks.

importerror: no module named fastaindex

Hello :
I had use conda install fastaindex; but I still got the eroor
My command is :
./pyScaf.py -f assembly.fasta -n long-read.fasta -o mapped@pyS

Is there any problems with this command?

No such file or directory error

Hi,

I'm trying to run the example commands and getting the following error:

./pyScaf.py -f test/contigs.reduced.fa -r test/ref.fa -o test/contigs.reduced.ref.fa
Options: Namespace(dotplot='png', fasta='test/contigs.reduced.fa', fastq=None, identity=0.33, joins=5, linkratio=0.7, load=0.2, log=<open file '<stderr>', mode 'w' at 0x7fbd138771e0>, longreads=None, mapq=10, maxgap=0, nofilter_overlaps=False, norearrangements=False, output=<open file 'test/contigs.reduced.ref.fa', mode 'w' at 0x259b4b0>, overlap=0.66, ref='test/ref.fa', threads=4)
 maxgap cut-off of 10000 bp

##################################################
[Tue Jan  9 16:18:31 2018][    9 Mb] Aligning contigs on reference...

##################################################
[Tue Jan  9 16:18:31 2018][    9 Mb] Saving dotplots to: test/contigs.reduced.fa.png
Traceback (most recent call last):
  File "./pyScaf.py", line 1294, in <module>
    main()
  File "./pyScaf.py", line 1286, in main
    s.save(out)
  File "./pyScaf.py", line 171, in save
    self._get_scaffolds()
  File "./pyScaf.py", line 1094, in _get_scaffolds
    t2hits, t2size = self._get_hits()
  File "./pyScaf.py", line 1035, in _get_hits
    dotplot = self.save_dotplot(self.genome, readstdin=True)
  File "./pyScaf.py", line 212, in save_dotplot
    proc = subprocess.Popen(args, stdin=subprocess.PIPE, stderr=self.log)
  File "/home/peter/.pyenv/versions/2.7.6/lib/python2.7/subprocess.py", line 709, in __init__
    errread, errwrite)
  File "/home/peter/.pyenv/versions/2.7.6/lib/python2.7/subprocess.py", line 1326, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

It doesn't seem to matter whether I build the software from github or use pip/bioconda for installation.

The computer is running Ubuntu 16.04 and python 2.7.6

Dotplot doesn't work

Hi Leszek,

I ran pyScaf to compare my genome with other closely related species. My main goal was to generate dotplots for them. I ran the command something like this:

python /shared/software/GIF/programs/pyscaf/2016-04-08/pyScaf.py \
    --fasta $GENOME \
    --reference $TARGET \
    --norearrangements \
    --threads 16 \
    --dotplot pdf \
    --log ${GENOME%.*}_pynast.log

The STDOUT showed that the job completed sucessfully:

Options: Namespace(dotplot='pdf', fasta='MaSuRCA_redundans_final.fasta', fastq=None, identity=0.33, joins=5, linkratio=0.7, load=0.2, log=<open file 'MaSuRCA_redundans_final_pynast.log', mode 'w' at 0x2ab415a29300>, longreads=None, mapq=10, maxgap=0, norearrangements=True, overlap=0.66, ref='Cyanidioschyzon_merolae.ASM9120v1.dna.toplevel.fa', threads=16)
 maxgap cut-off of 17978594 bp

##################################################
[Thu Nov 10 15:11:39 2016][   94 Mb] Aligning contigs on reference...

##################################################
[Thu Nov 10 15:11:39 2016][   94 Mb] Saving dotplots to: MaSuRCA_redundans_final.fasta.pdf
[WARNING] dotplot generation failed!

##################################################
[Thu Nov 10 15:19:26 2016][   94 Mb] Reporting scaffolds...
 1797859444 bp in 17040 scaffolds. Details in MaSuRCA_redundans_final.fasta.scaffolds.ref.fa.tsv
Scaffolds saved to: MaSuRCA_redundans_final.fasta.scaffolds.ref.fa

##################################################
[Thu Nov 10 15:22:17 2016][  118 Mb] Saving dotplots to: MaSuRCA_redundans_final.fasta.scaffolds.ref.fa.pdf

##################################################
[Thu Nov 10 15:22:18 2016][  118 Mb] Done!

except for the [WARNING] dotplot generation failed!.
Does this mean that I was missing some packages required to plot? Or is this the data which didn't produce any dotplot?

Thanks for the help! Have a great day!

Too large size of gaps got after long-reads scaffolding

I have get as long as ~500K bp gaps when using PacBio reads for scaffolding. However, the long reads are not longer than 80K. I wonder how such long gaps opened. There might be an error on calculating gap in the script at Line 802 (gap = dist - pos1 + pos2). As pos1 and pos2 should be the overhang of target2 and target1, the expression should be "gap = dist - (pos1 + pos2)" and "overhang = pos1 + pos2"?
Additionly, there might be some over-scaffolding that many contigs seemed with large overlap were linked directly (without any check such as whether the contigs overlapped actually). Here are head lines in file scaffolds.longreads.1.fa.tsv:

name size no. of contigs ordered contigs contig orientations (0-forward; 1-reverse) gap sizes (negative gap size = adjacent contigs are overlapping)

scaffold00001 37101958 21 tig00007288|arrow tig00000647|arrow tig00007807|arrow tig00007204|arrow tig00007202|arrow tig00007056|arrow tig00007054|arrow tig00007746|arrow tig00003676|arrow tig00007493|arrow tig00007487|arrow tig00000702|arrow tig00000363|arrow tig00007322|arrow tig00000346|arrow tig00000967|arrow tig00000110|arrow tig00001511|arrow tig00007211|arrow tig00000161|arrow tig00007841|arrow 0 1 1 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0 1 1 1 -8365 -1532317 -1484 -774 -1085 -175 46 -166587 -571737 -1200624 1658 608 -102546 121 335 -571031 -1593587 -5191766 -1584315 645 0
scaffold00002 27485094 6 tig00000652|arrow tig00000682|arrow tig00006972|arrow tig00007248|arrow tig00006941|arrow tig00000050|arrow 1 1 0 0 0 0 -262 -1670 582026 -807298 -262598 0
scaffold00003 19217952 10 tig00007669|arrow tig00000631|arrow tig00007413|arrow tig00007205|arrow tig00007206|arrow tig00007557|arrow tig00000626|arrow tig00007674|arrow tig00007220|arrow tig00006956|arrow 1 1 0 0 0 1 0 1 1 0 -6695 -918253 1084 1152 -2125124 -649917 -444017 -75582 -695 0

Here is the command:
redundans.py -f $SEQS -o nonredundant5 -l $LONGREADS -identity 0.9 --overlap 0.9 --minLength 200 -t 30 --resume -v
The '-f' is all contigs assembled by CANU without any gap and the '-l' is corrected Pacbio reads by CANU.
Before scaffolding, the ratio of completed BUSCOs is ~92%, and that is ~90% after scaffolding.

[WARNING] Unsupported sequence format

Hi, I do not know if it is a silly error... but I am trying to run pyScaf to do scaffolding in my Canu (Pabio) assembly with my nanopore reads.

this is my .sh

pyScaf -f /dataX/sami/genomes/p_ulei_polished.fasta -o p_ulei_scaf_nano --dotplot pdf -n /data1/sami/rawdata/nanopore/nanopore_all.fastq

the less slurm-XX.out is

Options: Namespace(dotplot='pdf', fasta='/dataX/sami/genomes/p_ulei_polished.fasta', fastq=None, identity=0.33, joins=5, linkratio=0.7, load=0.2, log=<open file '', mode 'w' at 0x7f1660bc61e0>, longreads=['/data1/sami/rawdata/nanopore/nanopore_all.fastq'], mapq=10, maxgap=0, norearrangements=False, output=<open file 'p_ulei_scaf_nano', mode 'w' at 0x7f165a43f6f0>, overlap=0.66, ref='', threads=4)
maxgap cut-off of 936958 bp

##################################################
[Tue Oct 8 17:05:21 2019][ 7 Mb] Aligning long reads on contigs...
File "/home/sami/anaconda2/bin/maf-convert", line 136
print(line, end="")
^
SyntaxError: invalid syntax
0 0

##################################################
[Tue Oct 8 17:05:21 2019][ 7 Mb] Reporting scaffolds...
93695813 bp in 230 scaffolds. Details in p_ulei_scaf_nano.tsv
Scaffolds saved to: p_ulei_scaf_nano

##################################################
[Tue Oct 8 17:05:21 2019][ 56 Mb] Saving dotplots to: p_ulei_scaf_nano.pdf
[WARNING] Unsupported sequence format p_ulei_scaf_nano in /data1/sami/rawdata/nanopore/nanopore_all.fastq

The program is not running. Can you advise me something? my nanopore reads are not compressed. We install both pyScaf and Dependencies with bioconda.

Thanks a miilion

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.