Git Product home page Git Product logo

irelate's Introduction

irelate.go

Streaming relation (overlap, distance, KNN) testing of (any number of) sorted files of intervals.

Currently supports BED, BAM, GFF, VCF.

![GoDoc] (https://godoc.org/github.com/brentp/irelate?status.png) Build Status Coverage Status

Motivation

We want to relate (e.g. intersect or by distance) sets of intervals. For example, we may want to report the nearest gene to a set of ChIP-Seq peaks. BEDTools does this extremely well, irelate is an attempt to provide an API so that users can write their own tools with little effort in go.

Design

  • data-sources must support the Relatable Interface. (we provide parsers for common formats).
  • a user-defined function returns true if 2 Relatable's are related. (only a small number of interval-pairs are sent to be tested--this is handled automatically by IRelate.). We provide CheckRelatedByOverlap to perform overlap testing.
  • i.Related() gives access to all of the related intervals (after they are added internally by IRelate)
  • the "API" is a for loop
  • A parallel chrom-sweep algorithm is used that avoids problems with chromosome order and parallelizes nicely up to about a dozen CPUs.

Example

(also see main/main.go which is similar to bedtools intersect -sorted -sortout -c)

print the number of b alignments that overlap an interval in a

// CheckRelatedByOverlap returns true if Relatables overlap.
func CheckRelatedByOverlap(a Relatable, b Relatable) bool {
        // note with distance == 0 this just overlap.
        return (b.Start() < a.End()) && (b.Chrom() == a.Chrom())
}

// determine ordering of Relatables.
func Less(a Relatable, b Relatable) bool {
    if a.Chrom() != b.Chrom() {
        return a.Chrom() < b.Chrom()
    }
    return a.Start() < b.Start() // || (a.Start() == b.Start() && a.End() < b.End())
}



// a and b are channels that send Relatables.
a, _ := bix.New('intervals.bed.gz')
b, _ := bix.New('some.vcf.gz')
for interval := range IRelate(CheckRelatedByOverlap, 0, Less, a, b) {
    fmt.Fprintf("%s\t%d\t%d\t%d\n", interval.Chrom(), interval.Start(), interval.End(), len(interval.Related()))
}

The 2nd argument determines the query set of intervals. So, only intervals from a (the 0th) source will be sent from IRelate. If this is set to -1, then all intervals from all sources will be sent. After this, any number of interval streams can be passed to IRelate

If we only want to count variants with a given mapping quality, the loop becomes:

for interval := range IRelate(CheckRelatedByOverlap, 0, Less, a, b) {
    n := 0
    for _, b := range interval.Related() {
         // cast to a bam to ge the mapping quality.
         if int(b.(*Variant).Score()) > 20 {
             n += 1
         }
    }
    fmt.Fprintf("%s\t%d\t%d\t%d\n", interval.Chrom(), interval.Start(), interval.End(), n))
}

note that any number of interval sources are supported even though the example is with 2. We can see the source of each interval with: interval.Source(). That value is set automatically inside of irelate.

This is a very simple example, but the point of this is that since the interface is a simple function (as in CheckRelatedByOverlap) and a for loop, it is easy to create custom applications.

For example, here is the function to relate all intervals within 2KB:

// CheckRelatedBy2KB returns true if intervals are within 2KB.
func CheckRelatedBy2KB(a Relatable, b Relatable) bool {
        distance := uint32(2000)
        // note with distance == 0 this just overlap.
        return (b.Start()-distance < a.End()) && (b.Chrom() == a.Chrom())
}

Note that we are guaranteed that b.Start() >= a.Start() so the check is quite simple.

Relatable

a key interface in irelated is:

// Relatable provides all the methods for irelate to function.
// See Interval in interval.go for a class that satisfies this interface.
// Related() likely returns and AddRelated() likely appends to a slice of
// relatables. Note that for performance reasons, Relatable should be implemented
// as a pointer to your data-structure (see Interval).
type Relatable interface {
        Chrom() string
        Start() uint32
        End() uint32
        Related() []Relatable // A slice of related Relatable's filled by IRelate
        AddRelated(Relatable) // Adds to the slice of relatables
        SetSource() uint32    // Internally marks the source (file/stream) of the Relatable
}

Performance

irelate is quite fast, but use PIRelate for parallel intersection. It is less flexible than irelate, but skips parsing of database intervals for sparse regions in the query. In addition, it has very good (automatic) parallelization.

irelate's People

Contributors

arq5x avatar brentp avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

irelate's Issues

error parsing VCF query file

Hello Brent!

While working with PCGR we found the following issue when parsing this particular INFO field in a VCF file:

vcfanno.go:132: error parsing VCF query file /workdir/output/test.pcgr_ready.pcgr_vep.vcf.gz: INFO error: ##INFO=<ID=ACGTNacgtnMINUS,Number=10,Type=Integer,Description="The first five numbers correspond to the number of bases on forward reads found to be A, C, G, T, or N, while the last five numbers correspond to bases on reverse reads found to be a, c, g, t, or n on minus stranded PCR templates (only reads with a mapping quality greater or equal to 30, and bases with a base quality greater or equal to 13 were considered).">, []. [line: 132]

INFO error: ##INFO=<ID=ACGTNacgtnPLUS,Number=10,Type=Integer,Description="The first five numbers correspond to the number of bases on forward reads found to be A, C, G, T, or N, while the last five numbers correspond to bases on reverse reads found to be a, c, g, t, or n on plus stranded PCR templates (only reads with a mapping quality greater or equal to 30, and bases with a base quality greater or equal to 13 were considered).">, []. [line: 133]

Can you spot what's off? I'm not sure what's wrong with it other than perhaps multiline, excessive length, or...?

cc @ohofmann @chapmanb

Collection of related objects from a single source

Hello,

I would like to create a collection of related objects from a single source using IRelate.
A simple concrete example would be creating a collection of sets of features which have exactly same start and end from a single GFF file.
I imagine that in contrast with existing IRelate functionality this would require comparing relatables from the same source to each other.

Thanks,
Botond

Qual not right after query() method

func (b *BamQueryable) Query(region interfaces.IPosition) (interfaces.RelatableIterator, error) {

while this method running, first record from it.Next() then Read() method will write to interfaces.Relatable channel[0]index. Then second record also will write to interfaces.Relatable channel[1] index and update its Record.Qual, but which will unexpectedly update the Record.Qual value in nterfaces.Relatable channel[0]index.

Finally, the bam record's name and pos not coresponding to Qual.

Github Releases?

Hi,
This software is vendored in debian, and it would be great to have release tags.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.