Git Product home page Git Product logo

seq's Introduction

Work on the Seq compiler is being continued in Codon, a general, extensible, high-performance Python compiler. Seq's bioinformatics libraries, features and optimizations are still available and being maintained as a plugin for Codon.


Seq

Seq — a language for bioinformatics

Build Status Gitter Version License

Introduction

A strongly-typed and statically-compiled high-performance Pythonic language!

Seq is a programming language for computational genomics and bioinformatics. With a Python-compatible syntax and a host of domain-specific features and optimizations, Seq makes writing high-performance genomics software as easy as writing Python code, and achieves performance comparable to (and in many cases better than) C/C++.

Think of Seq as a strongly-typed and statically-compiled Python: all the bells and whistles of Python, boosted with a strong type system, without any performance overhead.

Seq is able to outperform Python code by up to 160x. Seq can further beat equivalent C/C++ code by up to 2x without any manual interventions, and also natively supports parallelism out of the box. Implementation details and benchmarks are discussed in our paper.

Learn more by following the tutorial or from the cookbook.

Examples

Seq is a Python-compatible language, and many Python programs should work with few if any modifications:

def fib(n):
    a, b = 0, 1
    while a < n:
        print(a, end=' ')
        a, b = b, a+b
    print()
fib(1000)

This prime counting example showcases Seq's OpenMP support, enabled with the addition of one line. The @par annotation tells the compiler to parallelize the following for-loop, in this case using a dynamic schedule, chunk size of 100, and 16 threads.

from sys import argv

def is_prime(n):
    factors = 0
    for i in range(2, n):
        if n % i == 0:
            factors += 1
    return factors == 0

limit = int(argv[1])
total = 0

@par(schedule='dynamic', chunk_size=100, num_threads=16)
for i in range(2, limit):
    if is_prime(i):
        total += 1

print(total)

Here is an example showcasing some of Seq's bioinformatics features, which include native sequence and k-mer types.

from bio import *
s = s'ACGTACGT'     # sequence literal
print(s[2:5])       # subsequence
print(~s)           # reverse complement
kmer = Kmer[8](s)   # convert to k-mer

# iterate over length-3 subsequences
# with step 2
for sub in s.split(3, step=2):
    print(sub[-1])  # last base

    # iterate over 2-mers with step 1
    for kmer in sub.kmers(step=1, k=2):
        print(~kmer)  # '~' also works on k-mers

Install

Pre-built binaries

Pre-built binaries for Linux and macOS on x86_64 are available alongside each release. We also have a script for downloading and installing pre-built versions:

/bin/bash -c "$(curl -fsSL https://seq-lang.org/install.sh)"

Build from source

See Building from Source.

Documentation

Please check docs.seq-lang.org for in-depth documentation.

Citing Seq

If you use Seq in your research, please cite:

Ariya Shajii, Ibrahim Numanagić, Riyadh Baghdadi, Bonnie Berger, and Saman Amarasinghe. 2019. Seq: a high-performance language for bioinformatics. Proc. ACM Program. Lang. 3, OOPSLA, Article 125 (October 2019), 29 pages. DOI: https://doi.org/10.1145/3360551

BibTeX:

@article{Shajii:2019:SHL:3366395.3360551,
 author = {Shajii, Ariya and Numanagi\'{c}, Ibrahim and Baghdadi, Riyadh and Berger, Bonnie and Amarasinghe, Saman},
 title = {Seq: A High-performance Language for Bioinformatics},
 journal = {Proc. ACM Program. Lang.},
 issue_date = {October 2019},
 volume = {3},
 number = {OOPSLA},
 month = oct,
 year = {2019},
 issn = {2475-1421},
 pages = {125:1--125:29},
 articleno = {125},
 numpages = {29},
 url = {http://doi.acm.org/10.1145/3360551},
 doi = {10.1145/3360551},
 acmid = {3360551},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {Python, bioinformatics, computational biology, domain-specific language, optimization, programming language},
}

seq's People

Contributors

arshajii avatar ghuls avatar glram avatar inumanag avatar jodiew avatar jordanwatson1 avatar markhend avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seq's Issues

Pythonic Import

Provide proper Pythonic import syntax:

  • import x
  • from x import y
  • from x import *

Standard library TODO

Hi Jordan,

let's start with the string library:
https://docs.python.org/3/library/stdtypes.html#textseq

You can avoid string.format part for now.

Some functions (like str.split) are implemented in multiple places: for example, split on a single character is different than a split that operates on multi-character patterns.

Also, for each stdlib file, add the docs and implement a test suite as follows. For str.seq, add test_str.seq and there test each function, e.g.:

str.seq:

class str:
   def isspace(self: str) -> bool:
      """
      Doc
      """
      ...

test_str.seq:

def test_isspace():
    assert ' '.isspace() == True
    assert 'x'.isspace() == False
    # ... etc

Please check the function once done:

  • str.capitalize()
  • str.casefold()
  • str.center(width[, fillchar])
  • str.count(sub[, start[, end]])
  • str.endswith(suffix[, start[, end]])
  • str.expandtabs(tabsize=8)
  • str.find(sub[, start[, end]])
  • str.format(*args, **kwargs)
  • str.index(sub[, start[, end]])
  • str.isalnum()
  • str.isalpha()
  • str.isascii()
  • str.isdecimal()
  • str.isdigit()
  • str.isidentifier()
  • str.islower()
  • str.isnumeric()
  • str.isprintable()
  • str.isspace()
  • str.istitle()
  • str.isupper()
  • str.join(iterable)
  • str.ljust(width[, fillchar])
  • str.lower()
  • str.lstrip([chars])
  • str.partition(sep)
  • str.replace(old, new[, count])
  • str.rfind(sub[, start[, end]])
  • str.rindex(sub[, start[, end]])
  • str.rjust(width[, fillchar])
  • str.rpartition(sep)
  • str.rsplit(sep=None, maxsplit=-1)
  • str.rstrip([chars])
  • str.split(sep=None, maxsplit=-1)
  • str.splitlines([keepends])
  • str.startswith(prefix[, start[, end]])
  • str.strip([chars])
  • str.swapcase()
  • str.title()
  • str.translate(table)
  • str.upper()
  • str.zfill(width)

Language features

Remaining items (higher priority is at the top):

  • Lexer
    • Fix escapes ("\n")
  • Parser & AST
    • [hard] Better error reporting (i.e. no more "Menhir error")
  • Compiler
    • lambdas
    • Generics in OCaml
    • import from, import scoping
    • [hard] Extern support for Python and R
    • [hard] macros via sexps
  • Seq/LLVM
    • None for pointers
    • is/is not for pointers (e.g. if self.root is None)
    • binding & collecting (monadic >== and maybe <|> for collecting?)
    • [hard?] ADTs OR exceptions (i.e. how to indicate that an item is not found?)
    • lambdas
    • named and optional arguments
    • comprehensions
    • Tuple unpacking & assignment for full Python support
      • a, b = f(x)
      • t = (1, 2); f(*t) (star-unpacking)
      • f[1, 2]
    • More complex indexing (e.g. step as in [1:2:3], reverse [::-1]
      • numpy/MATLAB indexing [1,2], [:,3:4]
      • Generalized slice object for indexing
    • global statement
    • assert statement
    • decorators

Named arguments

Support named arguments:

  • named arguments
  • default arguments

(work being on ocaml-generics branch).

Better k-mer support

  • Match on k-mers (should work exactly like sequence match)
  • k-mer indexing (e.g. k[3], k[:4])
  • Building new k-mers from existing ones (e.g. swap a base)

Roadmap: document all Python standard library modules

It would be helpful to have a complete list of all standard Python modules and a guide as to which ones are supported, in progress, going to be supported, or not going to be supported ever. For example, is the threading module ever going to be supported? Will threading be handled with Seq's pipelines only? Will the multiprocessing module be supported?

Having this list will encourage people to help your efforts if it looks like most of what they need is going to eventually be part of Seq, and will also help them decide whether Seq will never be what they need.

"with" statement

with can be transformed into try...finally at the compiler level:

with <expr> as v:
    <block>

can be converted to:

# new scope so `v` is not accessible outside `with` body
v = <expr>
v.__enter__()
try:
    <block>
finally:
    v.__exit__()

withs with multiple clauses can be converted to single-clause nested withs:

with <e1> as v1, <e2> as v2, ..., <eN> as vN:
    <block>

becomes

with <e1> as v1:
    with <e2> as v2:
        ...
            with <eN> as vN:
                <block>

before applying the preceding transformation.

Multi-assignment error

The following code:

a, b = 0, 0
x, y, z = 0, 0, 0

Errors with "tuple index 2 out of bounds (len: 2)" on line 2. Commenting either line avoids the error, so assuming this is a parsing issue.

Int / Uint / Kmer bounds error

Creating Int, UInt or Kmer types with an out-of-bounds width/length causes an uncaught C++ exception. This exception should be caught on the OCaml side. Example:

Int[-1](0)

Pattern matching

Typical pattern matching from e.g. OCaml or Rust should be fairly easy to implement.

More interesting is pattern matching for sequence types. Many sequence computations can potentially be expressed with this. For example, a simple hash function:

fun hash(s: Seq) -> Int:
  s match A t => 0 + 4*hash(t)
        | C t => 1 + 4*hash(t)
        | G t => 2 + 4*hash(t)
        | T t => 3 + 4*hash(t)
        | _   => 0

Nested try-except

Nested try-except is currently not supported. This issue is for adding nested try-except support.

Idea: interface file for Python C libs

I ran a quick 10M {int:int} dict benchmark with Python 2.7 and Seq and was quite impressed. I'ts posted on your Hacker News announcement. The Python version used 1.1 GB and 8 seconds, Seq used 395 MB and 5 seconds. I didn't add any type info, just changed one line that created an empty dict. Congratulations!

I did some poking around and it looks like supporting the Python stdlib requires porting all the C code to Seq. I also saw that Seq can import regular Python code with pyimport. What I was wondering is, is it possible or would it make sense to allow importing Cython C extensions by providing an interface spec file, like os.pyi for example. This spec file would document the types, classes, return values, etc. and then allow a Seq program to use the standard Python built-ins without a rewrite.

There would have to be a wrapper generated for each function called, or even multiple wrappers if different types are used. The wrapper would have to create CPython objects for each argument, take the GIL, call the C extension, and then convert any returned values from CPython objects to Seq objects. The interface spec file would have to document not only the types, but which arguments might be modified. For example, passing a list to a C extension might cause the list to be modified (sort for example), or the list might not be modified, eg, len(list).

Double quotes inside multiline strings fail to parse

This works:

'''
'a'
'''

This fails:

'''
"a"
'''

with error

Uncaught exception:

  (Scanf.Scan_failure
   "scanf: bad input at char number 3: end of input not found")

Raised at file "scanf.ml" (inlined), line 444, characters 18-40
Called from file "scanf.ml", line 1164, characters 4-75
Called from file "seqaml/lexer.mll", line 52, characters 25-33
Called from file "grammar.ml", line 23209, characters 15-27
Called from file "grammar.ml", line 23236, characters 24-51
Called from file "seqaml/codegen.ml", line 17, characters 14-56
Called from file "runner.ml", line 15, characters 12-77
Called from file "runner.ml", line 32, characters 4-28
Called from file "runner.ml", line 61, characters 19-43

str.isspace() illegal escape

When testing str.isspace() an uncaught exception appeared when testing '\v' and '\f'
Please see image below for reference.

Screen Shot 2019-09-11 at 1 45 01 PM

Algebraic Data Types

Support basic ADTs, e.g.

type Foo = A | B(foo: int) | C(bar: str, i: int) | D

y = Foo.A
match y:
   case B(foo): ...

Closures

Provide full support for closures:

  • Lambdas
  • Nested functions
  • Decorators

Uncaught OCaml exception on misplaced quotes

Only one character is needed to reproduce this:

'

Output:

Uncaught exception:

  (Failure "lexing: empty token")

Raised by primitive operation at file "lexing.ml", line 65, characters 15-37
Called from file "seqaml/lexer.ml", line 1298, characters 8-65
Called from file "grammar.ml", line 24008, characters 15-27
Called from file "grammar.ml", line 22305, characters 22-49
Called from file "seqaml/codegen.ml", line 17, characters 14-56
Called from file "runner.ml", line 15, characters 12-77
Called from file "runner.ml", line 32, characters 4-28
Called from file "runner.ml", line 61, characters 19-43

install.sh script doesn't install correctly (libomp.so* missing)

I am trying to download/install the prebuilt binaries for seq, but the install.sh script does not seem to do the job. I am running the command wget -O - https://raw.githubusercontent.com/seq-lang/seq/master/install.sh | bash as found in the install documentation.

On CentOS 7.6.180:

$> wget -O - https://raw.githubusercontent.com/seq-lang/seq/master/install.sh | bash

--2020-01-10 11:30:54--  https://github.com/seq-lang/seq/releases/latest/download/seq-linux-x86_64.tar.gz
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/seq-lang/seq/releases/download/v0.9.2/seq-linux-x86_64.tar.gz [following]
--2020-01-10 11:30:55--  https://github.com/seq-lang/seq/releases/download/v0.9.2/seq-linux-x86_64.tar.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/118039967/d6bed000-2060-11ea-8c79-7fd1a81d11ec?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200110%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200110T163055Z&X-Amz-Expires=300&X-Amz-Signature=0c835357968e918d639ad5887954c1cb7d52d8bd1276464bcf01576d622c8427&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dseq-linux-x86_64.tar.gz&response-content-type=application%2Foctet-stream [following]
--2020-01-10 11:30:55--  https://github-production-release-asset-2e65be.s3.amazonaws.com/118039967/d6bed000-2060-11ea-8c79-7fd1a81d11ec?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200110%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200110T163055Z&X-Amz-Expires=300&X-Amz-Signature=0c835357968e918d639ad5887954c1cb7d52d8bd1276464bcf01576d622c8427&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dseq-linux-x86_64.tar.gz&response-content-type=application%2Foctet-stream
Resolving github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)... 52.216.141.164
Connecting to github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)|52.216.141.164|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 44849156 (43M) [application/octet-stream]
Saving to: 'STDOUT'

100%[======================================================================================================================================================================>] 44,849,156  12.3MB/s   in 3.7s   

2020-01-10 11:30:59 (11.7 MB/s) - written to stdout [44849156/44849156]

--2020-01-10 11:30:59--  https://github.com/seq-lang/seq/releases/latest/download/seq-stdlib.tar.gz
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github.com/seq-lang/seq/releases/download/v0.9.2/seq-stdlib.tar.gz [following]
--2020-01-10 11:30:59--  https://github.com/seq-lang/seq/releases/download/v0.9.2/seq-stdlib.tar.gz
Reusing existing connection to github.com:443.
HTTP request sent, awaiting response... 302 Found
Location: https://github-production-release-asset-2e65be.s3.amazonaws.com/118039967/d7effd00-2060-11ea-9a3e-34d1163a9de2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200110%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200110T163059Z&X-Amz-Expires=300&X-Amz-Signature=817015814cf6b79955ad0caccebbe1c8ec74a32b29f2f32fadec17bdb7fa38da&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dseq-stdlib.tar.gz&response-content-type=application%2Foctet-stream [following]
--2020-01-10 11:30:59--  https://github-production-release-asset-2e65be.s3.amazonaws.com/118039967/d7effd00-2060-11ea-9a3e-34d1163a9de2?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20200110%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20200110T163059Z&X-Amz-Expires=300&X-Amz-Signature=817015814cf6b79955ad0caccebbe1c8ec74a32b29f2f32fadec17bdb7fa38da&X-Amz-SignedHeaders=host&actor_id=0&response-content-disposition=attachment%3B%20filename%3Dseq-stdlib.tar.gz&response-content-type=application%2Foctet-stream
Resolving github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)... 52.216.141.204
Connecting to github-production-release-asset-2e65be.s3.amazonaws.com (github-production-release-asset-2e65be.s3.amazonaws.com)|52.216.141.204|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 59956 (59K) [application/octet-stream]
Saving to: 'STDOUT'

100%[======================================================================================================================================================================>] 59,956      --.-K/s   in 0.03s   

2020-01-10 11:30:59 (1.65 MB/s) - written to stdout [59956/59956]


Seq installed at: /workdir/err87/.seq
Make sure to add the following lines to ~/.bash_profile:
  export PATH="/workdir/err87/.seq:$PATH"
  export SEQ_PATH="/workdir/err87/.seq/stdlib"
  export LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/workdir/err87/.seq"

After adding the last three lines above to ~/.bash_profile and starting a new shell:

$> seqc -h
seqc: error while loading shared libraries: libomp.so.5: cannot open shared object file: No such file or directory

I can see that the symlink at /workdir/err87/.seq/libomp.so.5 is broken, as there is nothing in /usr/lib64 matching the prefix /usr/lib64/libomp.so*, so I'm guessing this is just a missing dependency. However, I don't have root privileges on this server, (university HPC) so I am not able to manually install the library.

I tried the same install command on two other machines which I DO have root privileges on, one running macOS and the other Ubuntu, but encountered similar problems.

On macOS (10.14.6)

output of install command is except LD_LIBRARY_PATH is DYLD_LIBRARY_PATH

$> seqc -h
dyld: Library not loaded: /usr/local/opt/libomp/lib/libomp.dylib
  Referenced from: /Users/err87-admin/.seq/seqc
  Reason: image not found
Abort trap: 6

On Ubuntu (18.04.3):

The install command also complains:

find: ‘/usr/lib64/’: No such file or directory
find: ‘/usr/lib64/’: No such file or directory

But seqc returns the same error as on CentOS:

$> seqc -h
seqc: error while loading shared libraries: libomp.so.5: cannot open shared object file: No such file or directory

I'm guessing the fix is to install libomp, but the documentation doesn't mention this as necessary when using the pre-built binaries.

Looking forward to trying out the language!

Sample programs and benchmarks

Ultimately we have to start writing real programs and benchmarking them, both in terms of runtime/memory and code usage.

Potential applications:

  • EMA, since we are familiar with the codebase
  • Some aligner (e.g. BWA, SNAP)
  • Some assembler (e.g. SGA)

We don't necessarily need to reimplement the entire application, just some key kernels of it. Using Seq should lead to some performance boost in each of these cases.

Another interesting avenue (perhaps down the road) is to look at applications using GPUs. LLVM supports an NVPTX back-end that makes targeting the GPU relatively easy. Potential applications to look at would be e.g. MEGAHIT or BarraCUDA.

Stack overflow on recursive imports

A file test.seq containing

import test

will fail to parse with a stack overflow OCaml exception. We should catch these cases and handle them gracefully.

Hard generics cases

These are generics corner-cases that seem to be hard to deal with, and currently break the generics system. Will update this as I find more bad cases.

1. A lot of nested generics

EDIT: This is fixed in the latest commit.

This is a case where we end up with T -> S -> Float and Q -> Int. This breaks somewhere along the line, and we end up with the error "generic type 'S' not yet realized". Have not been able to simplify this case any further while preserving the error.

class B[T](b of T):
  def bar[Q](self of B[T]):
    pass

def foo[S](s of S):
  b = B[S](s)
  b.bar[Int]()

foo[Float](3.14)

Incremental Menhir parsing

We should provide .messages file to Menhir or switch to incremental API to provide more informative error messages during the parsing (currently any malformed grammar yields just Menhir error).

Documentation

Fix documentation:

  • Seq compiler
  • LLVM logic
  • Ocaml parser
  • Seq stdlib

Type specialization and generic magics

Support

    def __mul__(x: Secure[`t], y): # Errors as y is generic
        """
        Protocol 5: MultiplyPublic (p.8)
        """
        return Secure[`t](x.sh * y)
    def __mul__(a: Secure[`t], b: Secure[`t]): # Only call this if b is Secure
        """
        Protocol 9: EvaluatePolynomial (p.12)
        restricted to f := xy
        """
        ar, am = __mpc_env__.beaver_partition(a.sh)
        br, bm = __mpc_env__.beaver_partition(b.sh)
        c = __mpc_env__.beaver_mult(ar, am, br, bm)
        return Secure[`t](__mpc_env__.beaver_reconstruct(c))

Outstanding issues/features

  • Fix __argv__ (i.e. link that symbol to module->getArgVar(); accessible with get_module_arg in ocaml.cpp)

  • Multiple for-loop iteration variables: for a,b in c: ... (same for comprehensions). Can be converted in parser to:

    for _x in c:
        a = _x[0]
        b = _x[1]
        ...
    
  • Explicitly realizing generic methods (on master these should be MethodExpr objects -- similar to GetElemExpr but with type parameters; see method_expr in ocaml.cpp) :

    class A():
        def foo[`s](s: `s):
            pass
    A.foo[int](10)  # error: cannot find method '__getitem__' for type 'function[void,`s]' with specified argument types (int)
    A().foo[int](10)  # similar error
    
  • global variables (should be as simple as calling ->setGlobal() on corresponding Var*)

  • assert (not on master?)

  • Generators (done on LLVM side; see gen_expr in ocaml.cpp)

  • Lambdas (I suggest we disallow outer variable references in lambdas for now, as they just complicate things)

  • Exceptions (need LLVM-side support)

  • realize_type and realize_func throw uncaught exceptions (low priority)

  • Allow referencing generics in return type of a function. Example:

    def none[`t]() -> `t:  # right now the `t return type fails -- symbol not found
        return None
    
  • Allow implicit generator argument. Example:

    print sum(i*i for i in range(10))    # fails: parsing error
    print sum((i*i for i in range(10)))  # works
    

C/Python/R interoperability

We require C interoperability, which should be easy to implement using LLVM. Python and R interoperability will likely be harder to implement, but are also essential.

One possibility for language-level syntax:

extern c  fun my_c_function(n: Int, s: Str) -> Int   # defines external C function
extern py fun my_py_function(n: Int, s: Str) -> Int  # defines external Python function
extern r  fun my_r_function(n: Int, s: Str) -> Int   # defines external R function

Automatic class member deduction

Two approaches:

  1. Use parsing to deduce members (harder--- many corner cases)
  2. Use self.foo as in Python and then deduce types/members during the compile stage (might be easier?)

Idea: try (2) and see will it work

Automatic member deduction does not work with global variables

This fails miserably:

N = 5
class Generator:
    def __init__(self: Generator):
        self.state = array[u64](N)
        self.state_size = 0
        self.initf = 0
        self.next = 0
g = Generator()

Error:

Assertion failed: (type), function getType, file /Users/inumanag/Desktop/Projekti/seq/test/compiler/lang/var.cpp, line 159.

Function called with different types

Shouldn't this code work? I thought that if a function was called with different arg types, the compiler would generate two different functions, one for each arg type. Instead, it looks like the compiler is complaining because it is inferring the type of x to be both int and str.

[root@hbseq ~]# cat argtype2.py
def fn(x):
    if x == 'abc':
        print 'str'
    elif x == 1:
        print 'int'
    else:
        print 'other'

fn('abc')
fn(1)
fn([1,2])
[root@hbseq ~]# python2 argtype2.py
str
int
other
[root@hbseq ~]# ./s argtype2
+ seqc -o argtype2.bc argtype2.seq
seqc: /lib64/libtinfo.so.5: no version information available (required by /root/.seq/libseq.so)
argtype2.py:4:9: error: unsupported operand type(s) for ==: 'str' and 'int'

Case study: HapTreeX

What needs to be done to run HapTreeX with Seq

Compiler tasks:

  • Comprehensions
  • Default args and named arguments
  • try/catch
  • lambdas
  • globals
  • Libraries
    • File open/close (readlines, write)
    • map
    • copy.copy
    • dict.has_key
    • sorted, reversed
    • abs
    • string.split
    • math.log, math.erf, math.sqrt, math.factorial, math.exp
    • random.randint, random.choice
    • itertools.combinations
    • time.time
    • os.path
  • auto class member detection

Manual interventions:

  • type ambiguities ({}, set() etc)
  • argparse

Parallelism & parallelism syntax

Parallelism can perhaps be tied to branching. One possible syntax is to have a special parallel pipe operator like |>(42) which would use 42 threads to process the previous stage's output.

Bioinformatics features

Possible bioinformatics features:

  • Sequence pattern matching
    • Matching against an index
  • Convenient sequence pipelining (splitting, substring, etc.)
  • Easy + fast I/O
  • Index abstraction (e.g. with caching optimization, frequent-lookups optimization)
  • Load-balancing parallelism (esp. in the genomics context, where sequence processing time can vary by orders of magnitude)
  • Sequence representation optimizations (e.g. in some contexts sequences can be represented with ASCII strings, but sometimes using a 2-bit encoding may be superior). Another idea is to use LLVM vectors to represent fixed-length sequences, so that a k-mer would be represented by a <k x i2> or something similar.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.