Git Product home page Git Product logo

gsort's Introduction

gsort

Build Status

Binaries Available Here

gsort is a tool to sort genomic files according to a genomefile.

For example, for some reason, you may want to sort your VCF to have order: X,Y,2,1,3,... and you want to keep the header at the top.

As a more likely example, you may want to sort your file to match GATK order (1 ... X, Y, MT) which is not possible with any other sorting tool. With gsort one can simply place MT as the last chrom in the .genome file.

Given a genome file (lines of chrom\tlength) With this tool, you can sort a BED/VCF/GTF/... in the order dictated by that file with:

gsort --memory 1500 my.vcf.gz crazy.genome | bgzip -c > my.crazy-order.vcf.gz

where here, memory-use will be limited to 1500 megabytes.

We will use this to enforce chromosome ordering in ggd.

It will also be useful for getting your files ready for use in bedtools.

GFF parent

In GFF, the Parent attribute may refer to a row that would otherwise be sorted after it (based on the end position). But, some programs require that the row referenced in a Parent attribute be sorted first. If this is required, used the --parent flag introduced in version 0.0.6.

Performance

gsort can sort the 2 million variants in ESP in 15 seconds. It takes a few minutes to sort the ~10 million ExAC variants because of the huuuuge INFO strings in that file.

Usage

gsort will error if your genome file has 'chr' prefix and your file does not (or vice-versa).

It will write temporary files to your $TMPDIR (usually /tmp/) as needed to avoid using too much memory.

TODO

  • Specify a VCF for the genome file and pull order from the @SQ tags
  • Avoid temp file when everything can fit in memory. (more universally, last chunk can always be kept in memory).

API Documentation

-- import "github.com/brentp/gsort"

Package gsort is a library for sorting a stream of tab-delimited lines ([]bytes) (from a reader) using the amount of memory requested.

Instead of using a compare function as most sorts do, this accepts a user-defined function with signature: func(line []byte) []int where the []ints are used to determine ordering. For example if we were sorting on 2 columns, one of months and another of day of months, the function would replace "Jan" with 1 and "Feb" with 2 for the first column and just return the Atoi of the 2nd column.

func Sort

func Sort(rdr io.Reader, wtr io.Writer, preprocess Processor, memMB int) error

Sort accepts a tab-delimited io.Reader and writes to wtr using prepocess to determine ordering

type Processor

type Processor func(line []byte) []int

Processor is a function that takes a line and return a slice of ints that determine ordering

gsort's People

Contributors

brentp avatar mikecormier avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.