vishnubob / ssw Goto Github PK

Python interface for SIMD Smith-Waterman Library

License: Other

Python 34.30% C 65.70%

ssw's Introduction

#SSW: A Python Wrapper for the SIMD Smith-Waterman

Overview

SSW is a fast implementation of the Smith-Waterman algorithm, which uses the Single-Instruction Multiple-Data (SIMD) instructions to parallelize the algorithm at the CPU level. This repository wraps the SSW library into an easy to install, high-level python interface with no external library dependancies.

The SSW library is written by Mengyao Zhao and Wan-Ping Lee, and this python interface is maintained by Giles Hall.

Installation

To install the SSW python package, use pip:

$ pip install ssw

Example Usage

import ssw
aligner = ssw.Aligner()
alignment = aligner.align(reference="ACGTGAGAATTATGGCGCTGTGATT", query="ACGTGAGAATTATGCGCTGTGATT")
print(alignment.alignment_report())
Score = 45, Matches = 24, Mismatches = 0, Insertions = 0, Deletions = 1

ref   1   ACGTGAGAATTATGGCGCTGTGATT
          ||||||||||||| |||||||||||
query 1   ACGTGAGAATTAT-GCGCTGTGATT

ssw's People

Contributors

Stargazers

Watchers

Forkers

ksahlin goranrakocevic yang123vc xlzh jenjouhung kevinzjy eigenvivek hcji iossifovlab vishalbelsare dongspy scottmastro y9c

ssw's Issues

Speed enhancement

Hi again,

I'm currently calculating lots of align instances with your library. I did some profiling of my code and noticed that there might be room for speed improvement without significant rewriting. When I studied the profiling stats (attached below), it seems like the alignment parser sswobj.py:149(alignment) and converting sequences to ints convert_sequence_to_ints pretty much takes up all the overhead time.

My intuition tells me that at least the speed of the alignment function can be greatly reduced e.g. by omitting sswobj.py:153(getseq) completely --- maybe r_seq and q_seq can be made strings immediately and operated on?
Maybe also some of the .upper() calls can be skipped or be made once for an entire string? I see that there are a huge amount of calls to upper() for my instance 309373840, for only 88409 calls to align. Maybe .upper() can be called once on a entire string?
I have no idea for now to why convert_sequence_to_ints takes the time it does, but my intuition tells me that the time spent in it can be improved :)

I will eventually look into this myself, but thought I should let you know --- maybe you have some quick insights?

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
      2/1    0.097    0.048  491.491  491.491 runHIT:30(<module>)
        1    0.040    0.040  490.521  490.521 runHIT:74(get_alignment_information)
        1    2.234    2.234  480.330  480.330 transcript_module.py:155(get_alignments_from_SW)
    88409    0.418    0.000  285.299    0.003 sswobj.py:90(align)
    88410  168.563    0.002  284.786    0.003 sswobj.py:107(_align)
    88412   52.584    0.001  158.666    0.002 sswobj.py:149(alignment)
   176820   54.718    0.000  113.293    0.001 sswobj.py:61(convert_sequence_to_ints)
  4601464   28.244    0.000   76.336    0.000 sswobj.py:153(getseq)
156490544   45.569    0.000   58.548    0.000 sswobj.py:63(<genexpr>)
160414072   35.748    0.000   48.092    0.000 sswobj.py:154(<genexpr>)
    88409    0.499    0.000   33.352    0.000 collections.py:441(__init__)
    88409   26.037    0.000   32.853    0.000 collections.py:504(update)
309373840   25.654    0.000   25.654    0.000 {method 'upper' of 'str' objects}
155812906   12.345    0.000   12.348    0.000 {next}
  3126588    7.235    0.000   10.923    0.000 sswobj.py:131(iter_cigar)
.....

Segmentation fault 11

Hi,
the latest code gives me segmentation fault for a given instance:

Python 2.7.9 (default, Dec  1 2015, 18:18:28) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ssw
>>> score_matrix = ssw.DNA_ScoreMatrix(match=2, mismatch=-1)
>>> aligner = ssw.Aligner(gap_open=1, gap_extend=1)
>>> query = "GGTAGTCATGAGTCGACACTAAGGGCTCGCTGACCTACCGGGTGCCAGAGAGGCTGCGGCAGGGTTTCTGTGGCGTGGGTCGGGCAGCACAGGCCTTGGTGTGTGCGAGTGCCAAGGAGGGCACCGCCTTCAGGATGGAGGCTGTGCAGGAGGGGCGGCCGGGGTGGAGAGTGAGCAGGCGGCTTTGGGGAGGAGGCGGTGCTGCTGTTGGATGACATAATGGCGGAGGTGGAGGTGGTGGCGGAGGAGGAGGGCCTCGTGGAGCGGCGGGAGGAGGCCCAGCGGGCACAGCAGGCTGTGCCTGGCCCTGGGCCCATGACCCCAGAGTCTGCACCGGAGGAGCTGCTGGCCGTTCAGGTGGAGCTGGAGCCGGTTAATGCCCAAGCCAGGAAGGCCTTTTCTCGGCAGCGGGAAAAGATGGAGCGGAGGCGCAAGCCCCACCTAGACCGCAGAGGCGCCGTCATCCAGAGCGTCCCTGGCTTCTGGGCCAATGTTATTGCAAACCACCCCAGATGTCAGCCCTGATCACTGACGAAGATGAAGACATGCTGAGCTACATGGTCAGCCTGGAGGTGGAAGAAGAGAAGCATCCTGTTCATCTCTGCAAGATCATGTTGTTCTTTCGGAGTAACCCTACTTCCAGAATAAAGTGATTACCAAGGAATATCTGGTGAACATCACAGAATACAGGGCTTCTCATTCCACTCCAATTGAGTGGTATCCGGATTATGAAGTGGAGGCCTATCGCCGCAGACACCACAACAGCAGCCTTAACTTCTTCAACTGGTTCTCTGACCACAACTTCGCAGGATCTAACAAGATTGCTGAGGCCTTTCCCGATTGAGTCCCCTGACAGATCCTATGTAAGGACCTGTGGCGCAATCCCCTGCAATACTACAAGAGGATGAAGCCACCTGAAGAGGGAACAGAGACGTCAGGGGACTCCCAGTTGTTGAGTTGACGTGTGCATAGATCGCGATGG"
>>> reference = "GGTAGGCGCTCTGTGTGCAGCAGGGCTCGCTGACCTACCGGGTGCCAGAGAGGCTGCGGCAGGGTTCTGTGGCGTGGGTCGGGCAGCACAGGCCTTGGTGTGTGCGAGTGCCAAGGAGGCACCGCTTCAGGATGGAGGCTGTGCAGGAGGGGCGGCCGGGTGGAGAGTGAGCAGGCGGCTTTGGGGAGGAGGCGGTGCTGCTGTTGGATGACATAATGGCGGAGGTGGAGGTGGTGGTGCGGAGGAGGAGGGCCTCGTGGAGCGGCGGGAGGAGGCCCAGCGGGCACAGCAGGCTGTGCCTGGCCTGGCCATGACCCAGAGTCTGCACTGGAGGAGCTGCTGGCCGTTCAGGTGGAGCTGGAGCCGGTTAATGCCCAAGCCAGGAAGGCCTTTTCTCGGCAGCGGGAAAAGATGGAGCGGAGGCGCAAGCCCCACTAGACCGCAGAGGCGCCGTCATCCAGAGCGTCCCTGGCTTCTGGCCAATGTTATTGCAAACCACCCCAGATGTCAGCCCTGATCACTGACGAAGATGAAGACATGCTGAGCTACATGGTCAGCCTGGAGGTGGAGAAGAGAAGCATCTGTTCATCTCTGCAAGATCATGTTGTTCTTTCGGAGTAACCCTACTTCCAGAATAAAGTGATTACCAAGGAATATCTGGTGAACATCACAGAATAAGATGGGCTTCATCATTCCACTCCAATTCTGAGTAGGCTCATCCTCCAAGTGACGATTATGAAGTGGAGGCCTATCGCCGCAGACACCACAACAGCAGCCTTAACTTCTTCAACTGGTTCTCTGACCACAACTTCGCAGGATCTAACAAGATTGCTGAGATCCTATGTAAGGACCTGTGGCGCAATCCCCTGCAATACTACAAGAGGATGAAGCCACCTGAAGAGGGAACAGAGACGTCAGGGGACTCCCAGTTGTTGAGTTGAAGAGTACTACATATGAGATGG"
>>> alignment = aligner.align(query, reference)
Segmentation fault: 11

interestingly, changing e.g. gap_open=3 to the code above makes it complete without error..

Consider uploading ssw to PyPi?

Hi,

I want to include this library as dependence in my code. Could you consider uploading a version to PyPi everytime you make an update (that is of relatively significant importance).

It would be great since then we can just issue pip install ssw (and also it's easier to include the dependency on this library in the setup.py script when pip installs my software).
Another big pro is that you can run pip uninstall or pip install --upgrade on ssw. Now I need to go in and remove the files manually whenever in need to install a new update of ssw (at least I dont know any other way to remove the library if I want to install a new version).

Kindly,
Kristoffer

Key Error in sswobj.py

I am using a program (Bamsplit) that relies on ssw to run, and get the following SSW error:

File "[...]/bamsplit-master/bamsplit.py", line 266, in <module>
    main(parsed)
  File "[...]/bamsplit-master/bamsplit.py", line 235, in main
    run_bamsplit(ref, bam_in, vcf, vcf.header.samples[0], bams_out, bed_out, region)
  File "[...]Software/bamsplit-master/bamsplit.py", line 225, in run_bamsplit
    split_contig(contig, ref, bam_in, vcf, sample, bams_out, bed_out)
  File "[...]/Software/bamsplit-master/bamsplit.py", line 211, in split_contig
    last_read = split(phased_sites, ref, bam_iter, sample, bams_out, bed_out, last_read)
  File "[...]/Software/bamsplit-master/bamsplit.py", line 194, in split
    scores = calculate_alignment_scores(read.query_sequence, genotype)
  File "[...]/Software/bamsplit-master/bamsplit.py", line 159, in calculate_alignment_scores
    return [calculate_alignment_score(read, haplotype) for haplotype in genotype]
  File "[...]/Software/bamsplit-master/bamsplit.py", line 159, in <listcomp>
    return [calculate_alignment_score(read, haplotype) for haplotype in genotype]
  File "[...]/Software/bamsplit-master/bamsplit.py", line 155, in calculate_alignment_score
    alignment = aligner.align(reference=haplotype, query=read)
  File "[...]/.local/lib/python3.7/site-packages/ssw-0.3.1-py3.7-linux-x86_64.egg/ssw/sswobj.py", line 99, in align
    res = self._align(query, reference, flags, filter_score, filter_distance, mask_length)
  File "[...].local/lib/python3.7/site-packages/ssw-0.3.1-py3.7-linux-x86_64.egg/ssw/sswobj.py", line 109, in _align
    _reference = self.matrix.convert_sequence_to_ints(reference)
  File "[...]/.local/lib/python3.7/site-packages/ssw-0.3.1-py3.7-linux-x86_64.egg/ssw/sswobj.py", line 64, in convert_sequence_to_ints
    return _seq_type(*seq_generator)
  File "[...]/.local/lib/python3.7/site-packages/ssw-0.3.1-py3.7-linux-x86_64.egg/ssw/sswobj.py", line 63, in <genexpr>
    seq_generator = (self.symbol_map[symbol.upper()] for symbol in seq)
KeyError: '<'

I thought it might be a problem with "<" not being able to run through "symbol.upper()", so I attempted to manually remove all ".upper()" commands, in the sswobj.py file, but that was unable to correct the problem. Any idea on what what the issue is?

error LNK2001: unresolved external symbol pyinit_libssw

hi
when I pip install ssw ,there is an error : error LNK2001: unresolved external symbol pyinit_libssw

LINK : error LNK2001: unresolved external symbol PyInit__libssw
build\temp.win-amd64-3.5\Release\src/ssw\_libssw.cp35-win_amd64.lib : fatal

error LNK1120: 1 unresolved externals
error: command 'C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC\B
IN\x86_amd64\link.exe' failed with exit status 1120

my python version is py3.6 or py3.5

thank u very much

Include some more examples with other features in the README? (Enhancement)

Would it be possible to include in the README.md a set of examples where other features of the ssw functionality can be interfaced with this wrapper? eg. the library allows multiple query and target sequences to be used at a time in a function call, "The input files can be in FASTA or FASTQ format. Both target and query files can contain multiple sequences. Each sequence in the query file will be aligned with all sequences in the target file. If your target file has N sequences and your query file has M sequences, the results will have MN alignments.*"

Also, can you demonstrate how the score penalties can be set when performing an alignment?

Inconsistent alignment to alignment score

Hi,

I'm providing an example with an inconsistent alignment score and the actual alignment below (Original post from mengyao/Complete-Striped-Smith-Waterman-Library#29):

client-104-39-79-38:workspace kxs624$ python
Python 2.7.9 (default, Dec  1 2015, 18:18:28) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import ssw
>>> ref_seq = "CCC" + "AGCT"*10
>>> query_seq = "AGGT"*10
>>> aligner = ssw.Aligner(ref_seq, gap_open=1, gap_extend=1)
>>> alignment = aligner.align(query_seq)
>>> print(alignment.alignment_report)
Score = 40, Matches = 0, Mismatches = 38, Insertions = 0, Deletions = 0

ref   4   CCCAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGC
          **************************************
query 1   AGGTAGGTAGGTAGGTAGGTAGGTAGGTAGGTAGGTAG

For clarity, below is what I expect the result to look like:

CCCAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCT
   ||*|||*|||*|||*|||*|||*|||*|||*|||*|||*|
---AGGTAGGTAGGTAGGTAGGTAGGTAGGTAGGTAGGTAGGT

I see that you have removed the match and mismatch options. What are these penalties set to, as the score is now 40?

How are gaps to be represented (formating) in the strings? ('-' creates an error)

using the '-' creates an error which is common to represent a gap. It is in my data and is understandable if there is no point in keeping the '-' since they will be introduced later on by the algorithm, but just wanting to know if there is another convention

>>> alignment = aligner.align(reference="ACGTGAGAATTATGGCGCTGTGATT", query="ACGTGGAATTATGCGCTGGATT--")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/cool/.local/lib/python2.7/site-packages/ssw/sswobj.py", line 112, in align
    res = self._align(query, reference, flags, filter_score, filter_distance, mask_length)
  File "/home/cool/.local/lib/python2.7/site-packages/ssw/sswobj.py", line 121, in _align
    _query = self.matrix.convert_sequence_to_ints(query)
  File "/home/cool/.local/lib/python2.7/site-packages/ssw/sswobj.py", line 70, in convert_sequence_to_ints
    _seq_instance[idx] = self.symbol_map[symbol]
KeyError: '-'