Git Product home page Git Product logo

prokgc's Introduction

License: GPL v3

prokgc

This program computes GC contents in the genome and CDS sequence if presented as a multifasta file.

prokgc \
    --cds examples/example_cds.fasta \
    --genome examples/example_genome.fasta \
    --header \
    --taxid 62977

produces the following output :

taxid,n_chr,genome_len,genome_gc,genome_n,n_seq,n_good_seq,cds_cumlen,gc_cds,gc1,gc2,gc3
62977,1,4809037,2505158,0,4600,4473,4073316,2167988,804625,559570,803793

which can be piped to the wonderful csvtk pretty command to get :

taxid   n_chr   genome_len   genome_gc   genome_n   n_seq   n_good_seq   cds_cumlen   gc_cds    gc1      gc2      gc3
62977   1       4809037      2505158     0          4600    4473         4073316      2167988   804625   559570   803793

Synopsis

This program is designed to computes the number of GC bases in a multi-fasta file containing the whole genome sequence and the number of GC bases in a multi-fasta file containing the coding sequences only.

  • taxid: the taxid provided via --taxid
  • n_chr: the number of chromosome in the genome file
  • genome_len: cumulated chromosome length
  • genome_gc: number of GC bases in the genome file
  • genome_n: number of N bases in the genome file.
  • n_seq: number of sequences in the CDS file
  • n_good_seq: number of sequences with length multiple of 3 and a canonical stop codon.
  • cds_cumlen: cumulated CDS length
  • gc_cds: number of GC bases in the CDS file
  • gcX: number of GC bases at codon position X

This program uses the kseq.h header from the klib library and is really fast :

time (prokgc -c example_cds.fasta -g example_genome.fasta -t 1234 --header)
# => 0.05s user 0.01s system 81% cpu 0.083 total

This speed is attained by being somewhat laxist on the type of data provided, and is not a strict verification of the fasta specification agreement like can be done in BioPerl or BioPython.

It also works on gzipped compressed multi-fasta file without prior decompression.

Help

Run

prokgc --help

to get :

Usage:
  prokgc [OPTION?] - computes gc contents

Synopsis:
  Computes number of GC bases in the genome and coding sequences in three
  codons positions.

Help Options:
  -h, --help       Show help options

Application Options:
  -H, --header     Print csv header
  -c, --cds        CDS sequence file
  -g, --genome     genome sequence file
  -t, --taxid      Taxon ID of corresponding organism

Description:
  This script computes cds_gc contents in a multi-fasta file using kseq.h
  included in the htseq library. It prints out a summary of the number of
  chromosomes, the genome length, the number of GC and N bases, the number
  of sequences in the CDS file, the number of sequence with canonical stop
  codon, the cumulated length of coding sequences, the number of GC bases
  in CDS, and the number of GC at positions 1, 2 and 3 of codons.

  Return a csv file to STDOUT without header by default, with a header if
  --header is specified.

Install

Download

Download a copy of the release here and install using

tar xvzf prokgc-0.0.1.tar.gz

cd prokgc-0.0.1
./configure
make
make install

If last step is unsuccessful, try running it as sudo.

Prerequisites

This software requires glib-2.0 >= 2.50. MacOS users can download a recent glib version with

brew install glib

Linux users either have a recent glib version or can follow the instruction here. glib is made for portability and should be easy to install on most platforms.

Disclaimer

The kseq.h header is included in the klib library under the MIT/X11 licence and was not written by me. The MIT licence is permissive and GPL3 compatible.

This software is distributed without any warranty under the term of the GNU Public Licence 3.0, as mentioned in the enclosed LICENCE file.

prokgc's People

Contributors

sam217pa avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.