Git Product home page Git Product logo

gemini's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gemini's Issues

Breakup variants table by attribute type

The variants table is getting rather large. It might make sense to break it up into a logical ste of subtables and join to a core variants table based on variant_id.

error during annotate function

I tabixed a bed format file and tried to use the gemini annotate function and received the following error:

Traceback (most recent call last):
File "/usr/local/bin/gemini", line 5, in
pkg_resources.run_script('gemini==0.1.0', 'gemini')
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.35-py2.7.egg/pkg_resources.py", line 505, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.35-py2.7.egg/pkg_resources.py", line 1245, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/EGG-INFO/scripts/gemini", line 5, in
gemini.gemini_main.main()
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_main.py", line 545, in main
args.func(parser, args)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_annotate.py", line 124, in annotate
annotate_variants_bool(args, conn)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_annotate.py", line 67, in annotate_variants_bool
return _annotate_variants(args, conn, has_anno_hit)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_annotate.py", line 46, in _annotate_variants
get_val_fn(annotations_in_region(row, annos, "tuple", naming))))
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/annotations.py", line 224, in annotations_in_region
return _get_hits(coords, anno, parser_type)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/annotations.py", line 172, in _get_hits
hit_iter = annotation.fetch(chrom, start, end, parser=parser)
File "ctabix.pyx", line 214, in pysam.ctabix.Tabixfile.fetch (pysam/ctabix.c:3590)
File "ctabix.pyx", line 180, in pysam.ctabix.Tabixfile._parseRegion (pysam/ctabix.c:3157)
File "ctabix.pyx", line 52, in pysam.ctabix._force_bytes (pysam/ctabix.c:1752)
TypeError: Expected bytes, got unicode

Allow users to specify a data installation directory

Currently, gemini installs all data files to /usr/share/data/gemini. We need to adjust this because many users won't have admin priveleges. We need to think about how best to do this. Perhaps we could just store a file in the package installation directory that solely stores the path of the data installation directory in a file. When gemini needs to look for annotation files, it will query this file, grab the path and be on its merry way.

Separate variant gene annotations into distinct table

Currently, the variants table contains a row for each variant/prediction combination. For example, if a non-synonymous variant affects 5 transcripts, 5 rows will be inserted into the table.

It would be much more convenient for basic queries to have the variants table only house one row for said variant, and have another table, variant_impacts, house the 5 rows for each transcript, while using a FK back to the variants table.

Query for multiple Hets

A suggestion for enhancement from the current comp_het identification would be for a way to quickly screen for genes with multiple heterozygous variants. We may not have phased data but being able to quickly pull out all genes with multiple hets can be useful. Especially if we can also filter for genes with too many, two rare hits, etc.

Blank columns for gene names using HGNC symbols in VEP

VEP gives HGNC symbols for genes when available and returns none otherwise, while using the option --hgnc. Since Gemini is using only HGNC option for VEP genes, this would basically return a none value when HGNC gene symbols are unavailable for the otherwise affected gene.

The code needs to be fixed to return the ensembl gene id (default output by VEP) for gene whenever HGNC is none.

parallelize the loading step

Even with all of the Cython optimizations, loading very large (many variants & many samples) VCF files into the DB still takes a tremendous amount of time.

One option for speeding this up would be to use Python's multiprocessing module to assign specific chunks of the VCF file to individual processes. Each process would populate it's own temporary version of each table in order to avoid deadlocks. At the end, the final tables would be created by taking a union of the temp tables. Index creation would have to be delayed until the end.

Open questions:

  1. What is the best way to assign arbitrary chunks of Records from a VCF file to each process? Does the pysam tabix API support this? We will most likely have to roll our own.

gemini load error

I get an error upon loading a database with or without the --cores option. Seems like it's looking for a file called 'clinvar_20130118.vcf.gz'

The VCF was downloaded here (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz), run through snpEff, and tabixed with no other modifications. It validates with a warning only (see below)

pwd
#/mnt/thor_pool1/user_data/cc2qe/projects/plat_trio

gemini load -v CEU.trio.2010_03.genotypes.snpEff.vcf.gz -t snpEff --cores 10 CEU.trio.2010_03.genotypes.snpEff.vcf.gz.db
# Traceback (most recent call last):
#   File "/mnt/thor_pool1/user_data/cc2qe/software/bin/gemini", line 5, in <module>
#     pkg_resources.run_script('gemini==0.1.0', 'gemini')
#   File "build/bdist.linux-i686/egg/pkg_resources.py", line 489, in run_script
#   File "build/bdist.linux-i686/egg/pkg_resources.py", line 1207, in run_script
#   File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/EGG-INFO/scripts/gemini", line 5, in <module>
#     gemini.gemini_main.main()
#   File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_main.py", line 554, in main
#     args.func(parser, args)
#   File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_load.py", line 405, in load
#     annotations.load_annos()
#   File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/gemini/annotations.py", line 153, in load_annos
#     annos[anno] = pysam.Tabixfile(anno_files[anno])
#   File "ctabix.pyx", line 92, in pysam.ctabix.Tabixfile.__cinit__ (pysam/ctabix.c:2241)
#   File "ctabix.pyx", line 132, in pysam.ctabix.Tabixfile._open (pysam/ctabix.c:2661)
# IOError: file `/mnt/thor_pool1/user_data/cc2qe/software/gemini-master/share/gemini/data/clinvar_20130118.vcf.gz` not found

vcf-validator CEU.trio.2010_03.genotypes.snpEff.vcf.gz
# Leading or trailing space in attr_key-attr_value pairs is discouraged:
#         [Description] [Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' ]
#         INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' ">

gemini_load.py refactor

There is some duplicated and crufty code in here that should be refactored. Possibly wrapping some of the stuff up into objects to fit the style of the rest of the code base.

Missing records in multi-core

Well, we're doing bioinformatics. Looks like there are occasionally some missing records in the chunk when loading with multi-core. Either a bug in grabix or a bug in the logic to compute the chunk ranges.

Record a variant's amino acid and codo number.

Certainly not all LOF mutations are created equal. For example, stop codons that occur will truncate merely the terminal 1% of the polypeptide are far less pernicious that those truncating 99%.

Existing tools do a poor job of using this information to refine the list of functional variants for a given genome. We need to record the amino acid and codon number for each variant so we can develop new methods for refining the putative functional impact of a variant with respect to the position in the protein.

SnpEff seems to report some of this information, but perhaps we are not parsing it correctly?

Need a means for user to be reminded of table schema

As more and more annotations are integrated, it will become increasingly harder for the user to remember the columns that are in each table for ad hoc queries.

As a supplement to a good, oft-updated documentation site, we need a command line tool (perhaps gemini schema) for the user to quickly get a sense of the columsn, their types, and their defaults. It should also generate a link to the official docs.

SQLite floating point precision

Several columns in the variants table use FLOAT as the data type. Unfortunately, this appears to lack the necessary precision to represent very rare variants. For example, 8.99E-5 is stored as 0.0. Based on a bit of research, it seems that either a "REAL" data type or a DOUBLE(x,y) data type may work.

Record the INFO field of a VCF

The INFO fields of an input VCF need to be recorded. This would help issue a warning message for database columns that are set to None, due to missing INFO fields in the vcf.

tool for variant info across genomic windows

Write a generic tool for reporting BEDGRAPH variation stats for windows across the genome.

For example, using the nucl_diversity column in the variants table, we can report a BEDGRAPH of the average nucleotide diversity in each window:

chr1 10000 20000 2.3
chr1 20000 30000 3.7
...

Use pybedtools.window_maker to generate the windows, along with mapbed to do the calculations. Should be easy and quite powerful.

Add Exome Sequencing Project annotations

New columns should be added to the variants table describing the EA, AA, and ALL allele frequencies observed in the NHLBI Exome Sequencing Project data. In addition, we need a column indicating whether or not a variant is on the "Exome chip".

de novo tool

Add a new tool to report de novo mutations when a child and his/her parents are available.

Add clinvar?

dbSNP now has a separate VCF for annotations derived from clinvar. This may be worth adding, as currently, the existing dbSNP VCF merely states whether a variant has (or not) clinical significance. ClinVar ostensibly annotates exactly what the significance is (e.g., the disease names, etc.).

Add pybedtools to setup.py

The windower tool uses pybedtools, so pybedtools should be auto-installed as part of the setup.py script.

Restructure shortcuts into submodules

Theere are now too many shortcuts under the gemini get submodule. Each shortcut should be split out into distinct submodules to avoid confusion owing to the various options.

Populate - VCF Error

/usr/local/src/gemini/build/scripts-2.6/gemini load -v /usr/local/projects/EdgeBio-20120608-Exome_WGS/secondary/NA12877-NG-MM1/snp/NA12877-NG-MM1.snp.snpEff.vcf -t snpEff edgebio-exome.db

/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/sql_extended.py:114: DeprecationWarning: Upcase class is deprecated, use upcaseTokens parse action instead
columnName = Upcase( delimitedList( ident, ".", combine=True ) )
/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/sql_extended.py:116: DeprecationWarning: Upcase class is deprecated, use upcaseTokens parse action instead
tableName = Upcase( delimitedList( ident, ".", combine=True ) )
Traceback (most recent call last):
File "/usr/local/src/gemini/build/scripts-2.6/gemini", line 5, in
gemini.gemini_main.main()
File "/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/gemini_main.py", line 233, in main
args.func(parser, args)
File "/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/gemini_load.py", line 331, in load
gemini_loader.populate_from_vcf()
File "/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/gemini_load.py", line 54, in populate_from_vcf
for var in self.vcf_reader:
File "parser.pyx", line 1052, in cyvcf.parser.Reader.next (cyvcf/parser.c:14214)
File "parser.pyx", line 928, in cyvcf.parser.Reader._parse_samples (cyvcf/parser.c:12670)
File "parser.pyx", line 128, in cyvcf.parser._Call.cinit (cyvcf/parser.c:2818)
KeyError: 'GT'

use pipes when loading a chunk

There is not need to land a temp fille when chunking the inout VCF. Just pipe from grabix to gemini_load_chunk, so as to save space. Makes a big difference with huge input VCF files.

basic plotting functionality

use genometools' python API?
use matplotlib, similar to the work that @daler has done with pybedtools?

  1. Need default tracks.
  2. Need derived tracks.
  3. Coloring
  4. Density
  5. Size

Installation error

Hi,
Right now the installation on OSX is a bit clunky, have to do python setup.py install --prefix /Users/aquinom/gemini to get an error message that the directory and subdirectories don't exist.
First you have to manually create gemini/lib/python2.7/site-packages then 'export PYTHONPATH=/Users/aquinom/gemini/lib/python2.7/site-packages'. Then when you run the install it crashes with the error message 'error: Setup script exited with error: unknown file type '.pyx' (from 'pysam/csamtools.pyx')' which a quick google search reveals is an issue of Cython vs Pyrex. At this point I'm a little stumped.

Speed up merging of chunks.

When parallel loading exome size files with, for example 16 cores, the slowest part will now be the merging of the chunks. Currently this is done sequentially. It should be trivial to do this with a "merge-sort" algorithm, whereby you have 8 cores each merging two files, then 4 cores, merging the 8 from the previous step, etc. Boom.

Enforcing chromosome order for variants

Even though records are inserted into the variants table in reference order (viz., chrom, then start pos.), when records are selected back out of the table, sort order is not preserved. This is expected based on the way RDBMS query engines strive to otimize queries such that data is retrieved as quickly as possible.

It seems that the best solution is to index that table based on chrom and start and then when sorted-order is necessary, auto-populate an ORDER BY chrom, start to the query.

This is most necessary for tools that will exploit the map functionality in bedtools for quickly measuring variant density, windowed HWE, windowed LOH, etc.

Need a GeminiQuery class

THe query interface should be object oriented to facilitate interaction with other libraries.

Shortcut for range queries

We need a shortcut for finding all variants within a certain chromosome interval. Since SVs and INDELs are in the variants table, it's not a simple as:

select * from variants where chrom = 'chr1' and start >= 100 and end <= 200

Two shortcuts should be provided:

  1. Single intervals. Provide an interval a la samtools and all of the correct interval arithmetic will be auto-generated in a proper query behind the scenes.

    gemini range -r chr1:100-200 my.db

  2. A file of intervals. Provide a file of intervals and behind the scenes, a query will be combined with bedtools to return all variants that overlap >=1 interval in the file

    gemini range -f intervals.bed my.db

Lastly, one can always just pipe to bedtools:

gemini get -s variants my.db | bedtools intersect -a - b.intervals.bed

Can we use the special "range" index in sqlite?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.