arq5x / gemini Goto Github PK

View Code? Open in Web Editor NEW

318.0 318.0 119.0 54.43 MB

a lightweight db framework for exploring genetic variation.

Home Page: http://gemini.readthedocs.org

License: MIT License

Python 55.75% Shell 37.10% Perl 0.67% CSS 0.02% HTML 6.47%

gemini's People

Contributors

Stargazers

Watchers

Forkers

chapmanb hjanime angelinasusan kunalbhutani egafni raonyguimaraes tanglingfung cmonnom kolart drmjc chenyu600 jdiez heuermh yong27 psinthong mdysahin nlbigas kmurphy902 jeffhsu3 oliviamr mictadlo aurora84 geniusphil kod3r aafreenucsd yyx2626 czarifis bgruening cc2qe lukeping shapez gkno kaul84 nicholasblackburn brentp rlesca01 ctb nuada impimba jgoecks cbrueffer noelnamai bgossele minocheae bpow udp3f xuzetan jsh58 jxchong mmoisse pombredanne mdshw5 linhua-sun melsiddieg shameer mbourgey cc13ny dezzan zengfengbo shicheng-guo qqss88 y4m4t4i dbitton nfarzaneh wangyumei-gd naumenko-sa whitesymmetry jixing475 jchenpku jz314 csardas mmesbahu nmael gregvonkuster michael-ford robinqi wangzhenfei mbootwalla ycczhao xguse scottsnapperlab scchess ismailm wenliangz drmaly jinjie-duan dennyglee basesloaded wangdi2014 sinonkt mjmiossec mamanambiya isb-cgc inambioinfo lospino87 pfpjs pdl30 zhanhuizhang enformatik mdiezfairen

gemini's Issues

error during annotate function

I tabixed a bed format file and tried to use the gemini annotate function and received the following error:

Traceback (most recent call last):
File "/usr/local/bin/gemini", line 5, in
pkg_resources.run_script('gemini==0.1.0', 'gemini')
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.35-py2.7.egg/pkg_resources.py", line 505, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.35-py2.7.egg/pkg_resources.py", line 1245, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/EGG-INFO/scripts/gemini", line 5, in
gemini.gemini_main.main()
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_main.py", line 545, in main
args.func(parser, args)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_annotate.py", line 124, in annotate
annotate_variants_bool(args, conn)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_annotate.py", line 67, in annotate_variants_bool
return _annotate_variants(args, conn, has_anno_hit)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_annotate.py", line 46, in _annotate_variants
get_val_fn(annotations_in_region(row, annos, "tuple", naming))))
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/annotations.py", line 224, in annotations_in_region
return _get_hits(coords, anno, parser_type)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/annotations.py", line 172, in _get_hits
hit_iter = annotation.fetch(chrom, start, end, parser=parser)
File "ctabix.pyx", line 214, in pysam.ctabix.Tabixfile.fetch (pysam/ctabix.c:3590)
File "ctabix.pyx", line 180, in pysam.ctabix.Tabixfile._parseRegion (pysam/ctabix.c:3157)
File "ctabix.pyx", line 52, in pysam.ctabix._force_bytes (pysam/ctabix.c:1752)
TypeError: Expected bytes, got unicode

Enforcing chromosome order for variants

Even though records are inserted into the variants table in reference order (viz., chrom, then start pos.), when records are selected back out of the table, sort order is not preserved. This is expected based on the way RDBMS query engines strive to otimize queries such that data is retrieved as quickly as possible.

It seems that the best solution is to index that table based on chrom and start and then when sorted-order is necessary, auto-populate an ORDER BY chrom, start to the query.

This is most necessary for tools that will exploit the map functionality in bedtools for quickly measuring variant density, windowed HWE, windowed LOH, etc.

VEP logic assumes Polyphen and SIFT have been annotated

Uma needs to document exactly how VEP and snpEff must be run.

For example, if -t VEP is used yet VEP was run without Polyphen options, line 58 in vep.py fails:

self.polyphen2 = self.polyphen_b[1].split(")")

Use a "project" directory to store DB and VCF/BCF

Keeping the DB and the VCF together in a project directory will enable new tools and will allow us to eventually get back to the original VCF records.

write tool for compound heterozygotes

If genotypes are phased, it's trivial.

If genotypes are unphased, report candidates.

parallelize the loading step

Even with all of the Cython optimizations, loading very large (many variants & many samples) VCF files into the DB still takes a tremendous amount of time.

One option for speeding this up would be to use Python's multiprocessing module to assign specific chunks of the VCF file to individual processes. Each process would populate it's own temporary version of each table in order to avoid deadlocks. At the end, the final tables would be created by taking a union of the temp tables. Index creation would have to be delayed until the end.

Open questions:

What is the best way to assign arbitrary chunks of Records from a VCF file to each process? Does the pysam tabix API support this? We will most likely have to roll our own.

Breakup variants table by attribute type

The variants table is getting rather large. It might make sense to break it up into a logical ste of subtables and join to a core variants table based on variant_id.

Add unit tests for all tools.

The INFO keys have been revised for the new dbSNP137 dataset. The logic for dbSNP annotations esp. clinicalsig needs to be changed

Add pybedtools to setup.py

The windower tool uses pybedtools, so pybedtools should be auto-installed as part of the setup.py script.

Cleanup crumbs in multi-core loading

Need to remove the chunked VCF after each chunk is loaded into the intermediate DB
Remove all chunked DBs after merging is complete.

Add 1000 Genomes variant allele frequencies.

Could we do overall frequencies, plus freqs broken down by ethnicity and sub-population?

Record a variant's amino acid and codo number.

Certainly not all LOF mutations are created equal. For example, stop codons that occur will truncate merely the terminal 1% of the polypeptide are far less pernicious that those truncating 99%.

Existing tools do a poor job of using this information to refine the list of functional variants for a given genome. We need to record the amino acid and codon number for each variant so we can develop new methods for refining the putative functional impact of a variant with respect to the position in the protein.

SnpEff seems to report some of this information, but perhaps we are not parsing it correctly?

Query for multiple Hets

A suggestion for enhancement from the current comp_het identification would be for a way to quickly screen for genes with multiple heterozygous variants. We may not have phased data but being able to quickly pull out all genes with multiple hets can be useful. Especially if we can also filter for genes with too many, two rare hits, etc.

Separate variant gene annotations into distinct table

Currently, the variants table contains a row for each variant/prediction combination. For example, if a non-synonymous variant affects 5 transcripts, 5 rows will be inserted into the table.

It would be much more convenient for basic queries to have the variants table only house one row for said variant, and have another table, variant_impacts, house the 5 rows for each transcript, while using a FK back to the variants table.

is_lof is not loaded correctly using VEP annotations.

All variants have is_lof = 0

de novo tool

Add a new tool to report de novo mutations when a child and his/her parents are available.

Installation error

Hi,
Right now the installation on OSX is a bit clunky, have to do python setup.py install --prefix /Users/aquinom/gemini to get an error message that the directory and subdirectories don't exist.
First you have to manually create gemini/lib/python2.7/site-packages then 'export PYTHONPATH=/Users/aquinom/gemini/lib/python2.7/site-packages'. Then when you run the install it crashes with the error message 'error: Setup script exited with error: unknown file type '.pyx' (from 'pysam/csamtools.pyx')' which a quick google search reveals is an issue of Cython vs Pyrex. At this point I'm a little stumped.

Record the INFO field of a VCF

The INFO fields of an input VCF need to be recorded. This would help issue a warning message for database columns that are set to None, due to missing INFO fields in the vcf.

Update 1000G, ESP, and dbSNP

New versions of these annotation files have been released. We need to upgrade to the latest versions.

Add support for Gerstein lab's VAT

Need a means for users to easily add custom annotation files

This was the primary feedback received at CSHL BoG.

gemini load error

I get an error upon loading a database with or without the --cores option. Seems like it's looking for a file called 'clinvar_20130118.vcf.gz'

The VCF was downloaded here (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz), run through snpEff, and tabixed with no other modifications. It validates with a warning only (see below)

pwd
#/mnt/thor_pool1/user_data/cc2qe/projects/plat_trio

gemini load -v CEU.trio.2010_03.genotypes.snpEff.vcf.gz -t snpEff --cores 10 CEU.trio.2010_03.genotypes.snpEff.vcf.gz.db
# Traceback (most recent call last):
#   File "/mnt/thor_pool1/user_data/cc2qe/software/bin/gemini", line 5, in <module>
#     pkg_resources.run_script('gemini==0.1.0', 'gemini')
#   File "build/bdist.linux-i686/egg/pkg_resources.py", line 489, in run_script
#   File "build/bdist.linux-i686/egg/pkg_resources.py", line 1207, in run_script
#   File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/EGG-INFO/scripts/gemini", line 5, in <module>
#     gemini.gemini_main.main()
#   File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_main.py", line 554, in main
#     args.func(parser, args)
#   File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_load.py", line 405, in load
#     annotations.load_annos()
#   File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/gemini/annotations.py", line 153, in load_annos
#     annos[anno] = pysam.Tabixfile(anno_files[anno])
#   File "ctabix.pyx", line 92, in pysam.ctabix.Tabixfile.__cinit__ (pysam/ctabix.c:2241)
#   File "ctabix.pyx", line 132, in pysam.ctabix.Tabixfile._open (pysam/ctabix.c:2661)
# IOError: file `/mnt/thor_pool1/user_data/cc2qe/software/gemini-master/share/gemini/data/clinvar_20130118.vcf.gz` not found

vcf-validator CEU.trio.2010_03.genotypes.snpEff.vcf.gz
# Leading or trailing space in attr_key-attr_value pairs is discouraged:
#         [Description] [Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' ]
#         INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' ">

Missing records in multi-core

Well, we're doing bioinformatics. Looks like there are occasionally some missing records in the chunk when loading with multi-core. Either a bug in grabix or a bug in the logic to compute the chunk ranges.

Mode for simply annotating a VCF file

Add a mode that will simply update an existing VCF file with the annotations provided by gemini using the INFO field.

SQLite floating point precision

Several columns in the variants table use FLOAT as the data type. Unfortunately, this appears to lack the necessary precision to represent very rare variants. For example, 8.99E-5 is stored as 0.0. Based on a bit of research, it seems that either a "REAL" data type or a DOUBLE(x,y) data type may work.

Shortcut for range queries

We need a shortcut for finding all variants within a certain chromosome interval. Since SVs and INDELs are in the variants table, it's not a simple as:

select * from variants where chrom = 'chr1' and start >= 100 and end <= 200

Two shortcuts should be provided:

Single intervals. Provide an interval a la samtools and all of the correct interval arithmetic will be auto-generated in a proper query behind the scenes.

gemini range -r chr1:100-200 my.db
A file of intervals. Provide a file of intervals and behind the scenes, a query will be combined with bedtools to return all variants that overlap >=1 interval in the file

gemini range -f intervals.bed my.db

Lastly, one can always just pipe to bedtools:

gemini get -s variants my.db | bedtools intersect -a - b.intervals.bed

Can we use the special "range" index in sqlite?

Export query results to JSON

This will support integration with JavaScript visualization frameworks such as scribl, D3, etc.

Add pathway support via KEGG or Reactome

Uma is working on this. Possibly use the pre-computed files available through the Broad Institute?

Blank columns for gene names using HGNC symbols in VEP

VEP gives HGNC symbols for genes when available and returns none otherwise, while using the option --hgnc. Since Gemini is using only HGNC option for VEP genes, this would basically return a none value when HGNC gene symbols are unavailable for the otherwise affected gene.

The code needs to be fixed to return the ensembl gene id (default output by VEP) for gene whenever HGNC is none.

Use pre-computed predictions for polyphen 2 and SIFT

Instead of relying upon snpEff or VEP, it may be better to simply create a pre-computed annotation file from the SIFT and PolyPhen tools.

Speed up merging of chunks.

When parallel loading exome size files with, for example 16 cores, the slowest part will now be the merging of the chunks. Currently this is done sequentially. It should be trivial to do this with a "merge-sort" algorithm, whereby you have 8 cores each merging two files, then 4 cores, merging the 8 from the previous step, etc. Boom.

gemini_load.py refactor

There is some duplicated and crufty code in here that should be refactored. Possibly wrapping some of the stuff up into objects to fit the style of the rest of the code base.

Add a tool for runs of homozygosity.

Support for SVs - requires VCF 4.1 support

Currently, only SNPs and INDELs are supported, as PyVcf only supports VCF 4.0.

We need to first enhance PyVcf to support 4.1 in order to support SVs in Gemini.

Allow users to specify a data installation directory

Currently, gemini installs all data files to /usr/share/data/gemini. We need to adjust this because many users won't have admin priveleges. We need to think about how best to do this. Perhaps we could just store a file in the package installation directory that solely stores the path of the data installation directory in a file. When gemini needs to look for annotation files, it will query this file, grab the path and be on its merry way.

Store virtual offset in db

Storing the virtual offset will allow us to quickly get back to the original VCF record if the VCF is tabixed or a BCF.

Good idea from Heng Li: http://www.biostars.org/p/65920/

Need a GeminiQuery class

THe query interface should be object oriented to facilitate interaction with other libraries.

Add clinvar?

dbSNP now has a separate VCF for annotations derived from clinvar. This may be worth adding, as currently, the existing dbSNP VCF merely states whether a variant has (or not) clinical significance. ClinVar ostensibly annotates exactly what the significance is (e.g., the disease names, etc.).

Add Exome Sequencing Project annotations

New columns should be added to the variants table describing the EA, AA, and ALL allele frequencies observed in the NHLBI Exome Sequencing Project data. In addition, we need a column indicating whether or not a variant is on the "Exome chip".

basic plotting functionality

use genometools' python API?
use matplotlib, similar to the work that @daler has done with pybedtools?

Need default tracks.
Need derived tracks.
Coloring
Density
Size

use pipes when loading a chunk

There is not need to land a temp fille when chunking the inout VCF. Just pipe from grabix to gemini_load_chunk, so as to save space. Makes a big difference with huge input VCF files.

Populate - VCF Error

/usr/local/src/gemini/build/scripts-2.6/gemini load -v /usr/local/projects/EdgeBio-20120608-Exome_WGS/secondary/NA12877-NG-MM1/snp/NA12877-NG-MM1.snp.snpEff.vcf -t snpEff edgebio-exome.db

/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/sql_extended.py:114: DeprecationWarning: Upcase class is deprecated, use upcaseTokens parse action instead
columnName = Upcase( delimitedList( ident, ".", combine=True ) )
/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/sql_extended.py:116: DeprecationWarning: Upcase class is deprecated, use upcaseTokens parse action instead
tableName = Upcase( delimitedList( ident, ".", combine=True ) )
Traceback (most recent call last):
File "/usr/local/src/gemini/build/scripts-2.6/gemini", line 5, in
gemini.gemini_main.main()
File "/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/gemini_main.py", line 233, in main
args.func(parser, args)
File "/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/gemini_load.py", line 331, in load
gemini_loader.populate_from_vcf()
File "/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/gemini_load.py", line 54, in populate_from_vcf
for var in self.vcf_reader:
File "parser.pyx", line 1052, in cyvcf.parser.Reader.next (cyvcf/parser.c:14214)
File "parser.pyx", line 928, in cyvcf.parser.Reader._parse_samples (cyvcf/parser.c:12670)
File "parser.pyx", line 128, in cyvcf.parser._Call.cinit (cyvcf/parser.c:2818)
KeyError: 'GT'

IPython cluster startup during merge

The way it is implemented we spin up a new IPython cluster mulitple times during the merge step, which causes a hit in performance.

tool for variant info across genomic windows

Write a generic tool for reporting BEDGRAPH variation stats for windows across the genome.

For example, using the nucl_diversity column in the variants table, we can report a BEDGRAPH of the average nucleotide diversity in each window:

chr1 10000 20000 2.3
chr1 20000 30000 3.7
...

Use pybedtools.window_maker to generate the windows, along with mapbed to do the calculations. Should be easy and quite powerful.