arq5x / gemini Goto Github PK
View Code? Open in Web Editor NEWa lightweight db framework for exploring genetic variation.
Home Page: http://gemini.readthedocs.org
License: MIT License
a lightweight db framework for exploring genetic variation.
Home Page: http://gemini.readthedocs.org
License: MIT License
The variants table is getting rather large. It might make sense to break it up into a logical ste of subtables and join to a core variants table based on variant_id.
I tabixed a bed format file and tried to use the gemini annotate function and received the following error:
Traceback (most recent call last):
File "/usr/local/bin/gemini", line 5, in
pkg_resources.run_script('gemini==0.1.0', 'gemini')
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.35-py2.7.egg/pkg_resources.py", line 505, in run_script
self.require(requires)[0].run_script(script_name, ns)
File "/usr/local/lib/python2.7/dist-packages/distribute-0.6.35-py2.7.egg/pkg_resources.py", line 1245, in run_script
execfile(script_filename, namespace, namespace)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/EGG-INFO/scripts/gemini", line 5, in
gemini.gemini_main.main()
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_main.py", line 545, in main
args.func(parser, args)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_annotate.py", line 124, in annotate
annotate_variants_bool(args, conn)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_annotate.py", line 67, in annotate_variants_bool
return _annotate_variants(args, conn, has_anno_hit)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_annotate.py", line 46, in _annotate_variants
get_val_fn(annotations_in_region(row, annos, "tuple", naming))))
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/annotations.py", line 224, in annotations_in_region
return _get_hits(coords, anno, parser_type)
File "/usr/local/lib/python2.7/dist-packages/gemini-0.1.0-py2.7.egg/gemini/annotations.py", line 172, in _get_hits
hit_iter = annotation.fetch(chrom, start, end, parser=parser)
File "ctabix.pyx", line 214, in pysam.ctabix.Tabixfile.fetch (pysam/ctabix.c:3590)
File "ctabix.pyx", line 180, in pysam.ctabix.Tabixfile._parseRegion (pysam/ctabix.c:3157)
File "ctabix.pyx", line 52, in pysam.ctabix._force_bytes (pysam/ctabix.c:1752)
TypeError: Expected bytes, got unicode
Currently, gemini
installs all data files to /usr/share/data/gemini. We need to adjust this because many users won't have admin priveleges. We need to think about how best to do this. Perhaps we could just store a file in the package installation directory that solely stores the path of the data installation directory in a file. When gemini
needs to look for annotation files, it will query this file, grab the path and be on its merry way.
Currently, the variants
table contains a row for each variant/prediction combination. For example, if a non-synonymous variant affects 5 transcripts, 5 rows will be inserted into the table.
It would be much more convenient for basic queries to have the variants
table only house one row for said variant, and have another table, variant_impacts
, house the 5 rows for each transcript, while using a FK back to the variants
table.
A suggestion for enhancement from the current comp_het identification would be for a way to quickly screen for genes with multiple heterozygous variants. We may not have phased data but being able to quickly pull out all genes with multiple hets can be useful. Especially if we can also filter for genes with too many, two rare hits, etc.
VEP gives HGNC symbols for genes when available and returns none otherwise, while using the option --hgnc. Since Gemini is using only HGNC option for VEP genes, this would basically return a none value when HGNC gene symbols are unavailable for the otherwise affected gene.
The code needs to be fixed to return the ensembl gene id (default output by VEP) for gene whenever HGNC is none.
All variants have is_lof = 0
The way it is implemented we spin up a new IPython cluster mulitple times during the merge step, which causes a hit in performance.
gemini_load_cmd hard codes snpEff. need to tolerate VEP as well.
Even with all of the Cython optimizations, loading very large (many variants & many samples) VCF files into the DB still takes a tremendous amount of time.
One option for speeding this up would be to use Python's multiprocessing module to assign specific chunks of the VCF file to individual processes. Each process would populate it's own temporary version of each table in order to avoid deadlocks. At the end, the final tables would be created by taking a union of the temp tables. Index creation would have to be delayed until the end.
Open questions:
Record
s from a VCF file to each process? Does the pysam tabix API support this? We will most likely have to roll our own.I get an error upon loading a database with or without the --cores option. Seems like it's looking for a file called 'clinvar_20130118.vcf.gz'
The VCF was downloaded here (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/pilot_data/release/2010_07/trio/snps/CEU.trio.2010_03.genotypes.vcf.gz), run through snpEff, and tabixed with no other modifications. It validates with a warning only (see below)
pwd
#/mnt/thor_pool1/user_data/cc2qe/projects/plat_trio
gemini load -v CEU.trio.2010_03.genotypes.snpEff.vcf.gz -t snpEff --cores 10 CEU.trio.2010_03.genotypes.snpEff.vcf.gz.db
# Traceback (most recent call last):
# File "/mnt/thor_pool1/user_data/cc2qe/software/bin/gemini", line 5, in <module>
# pkg_resources.run_script('gemini==0.1.0', 'gemini')
# File "build/bdist.linux-i686/egg/pkg_resources.py", line 489, in run_script
# File "build/bdist.linux-i686/egg/pkg_resources.py", line 1207, in run_script
# File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/EGG-INFO/scripts/gemini", line 5, in <module>
# gemini.gemini_main.main()
# File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_main.py", line 554, in main
# args.func(parser, args)
# File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/gemini/gemini_load.py", line 405, in load
# annotations.load_annos()
# File "/mnt/thor_pool1/user_data/cc2qe/software/Python-2.7.3/lib/python2.7/site-packages/gemini-0.1.0-py2.7.egg/gemini/annotations.py", line 153, in load_annos
# annos[anno] = pysam.Tabixfile(anno_files[anno])
# File "ctabix.pyx", line 92, in pysam.ctabix.Tabixfile.__cinit__ (pysam/ctabix.c:2241)
# File "ctabix.pyx", line 132, in pysam.ctabix.Tabixfile._open (pysam/ctabix.c:2661)
# IOError: file `/mnt/thor_pool1/user_data/cc2qe/software/gemini-master/share/gemini/data/clinvar_20130118.vcf.gz` not found
vcf-validator CEU.trio.2010_03.genotypes.snpEff.vcf.gz
# Leading or trailing space in attr_key-attr_value pairs is discouraged:
# [Description] [Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' ]
# INFO=<ID=EFF,Number=.,Type=String,Description="Predicted effects for this variant.Format: 'Effect ( Effect_Impact | Functional_Class | Codon_Change | Amino_Acid_change| Amino_Acid_length | Gene_Name | Gene_BioType | Coding | Transcript | Exon [ | ERRORS | WARNINGS ] )' ">
There is some duplicated and crufty code in here that should be refactored. Possibly wrapping some of the stuff up into objects to fit the style of the rest of the code base.
If genotypes are phased, it's trivial.
If genotypes are unphased, report candidates.
Well, we're doing bioinformatics. Looks like there are occasionally some missing records in the chunk when loading with multi-core. Either a bug in grabix or a bug in the logic to compute the chunk ranges.
Keeping the DB and the VCF together in a project directory will enable new tools and will allow us to eventually get back to the original VCF records.
Uma is working on this. Possibly use the pre-computed files available through the Broad Institute?
Certainly not all LOF mutations are created equal. For example, stop codons that occur will truncate merely the terminal 1% of the polypeptide are far less pernicious that those truncating 99%.
Existing tools do a poor job of using this information to refine the list of functional variants for a given genome. We need to record the amino acid and codon number for each variant so we can develop new methods for refining the putative functional impact of a variant with respect to the position in the protein.
SnpEff seems to report some of this information, but perhaps we are not parsing it correctly?
As more and more annotations are integrated, it will become increasingly harder for the user to remember the columns that are in each table for ad hoc queries.
As a supplement to a good, oft-updated documentation site, we need a command line tool (perhaps gemini schema
) for the user to quickly get a sense of the columsn, their types, and their defaults. It should also generate a link to the official docs.
This will support integration with JavaScript visualization frameworks such as scribl, D3, etc.
Instead of relying upon snpEff or VEP, it may be better to simply create a pre-computed annotation file from the SIFT and PolyPhen tools.
Several columns in the variants table use FLOAT as the data type. Unfortunately, this appears to lack the necessary precision to represent very rare variants. For example, 8.99E-5 is stored as 0.0. Based on a bit of research, it seems that either a "REAL" data type or a DOUBLE(x,y) data type may work.
The INFO fields of an input VCF need to be recorded. This would help issue a warning message for database columns that are set to None, due to missing INFO fields in the vcf.
Write a generic tool for reporting BEDGRAPH variation stats for windows across the genome.
For example, using the nucl_diversity
column in the variants
table, we can report a BEDGRAPH of the average nucleotide diversity in each window:
chr1 10000 20000 2.3
chr1 20000 30000 3.7
...
Use pybedtools.window_maker to generate the windows, along with mapbed to do the calculations. Should be easy and quite powerful.
New columns should be added to the variants
table describing the EA, AA, and ALL allele frequencies observed in the NHLBI Exome Sequencing Project data. In addition, we need a column indicating whether or not a variant is on the "Exome chip".
Storing the virtual offset will allow us to quickly get back to the original VCF record if the VCF is tabixed or a BCF.
Good idea from Heng Li: http://www.biostars.org/p/65920/
This was the primary feedback received at CSHL BoG.
Add a new tool to report de novo mutations when a child and his/her parents are available.
dbSNP now has a separate VCF for annotations derived from clinvar. This may be worth adding, as currently, the existing dbSNP VCF merely states whether a variant has (or not) clinical significance. ClinVar ostensibly annotates exactly what the significance is (e.g., the disease names, etc.).
The windower
tool uses pybedtools, so pybedtools should be auto-installed as part of the setup.py script.
Theere are now too many shortcuts under the gemini get
submodule. Each shortcut should be split out into distinct submodules to avoid confusion owing to the various options.
/usr/local/src/gemini/build/scripts-2.6/gemini load -v /usr/local/projects/EdgeBio-20120608-Exome_WGS/secondary/NA12877-NG-MM1/snp/NA12877-NG-MM1.snp.snpEff.vcf -t snpEff edgebio-exome.db
/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/sql_extended.py:114: DeprecationWarning: Upcase class is deprecated, use upcaseTokens parse action instead
columnName = Upcase( delimitedList( ident, ".", combine=True ) )
/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/sql_extended.py:116: DeprecationWarning: Upcase class is deprecated, use upcaseTokens parse action instead
tableName = Upcase( delimitedList( ident, ".", combine=True ) )
Traceback (most recent call last):
File "/usr/local/src/gemini/build/scripts-2.6/gemini", line 5, in
gemini.gemini_main.main()
File "/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/gemini_main.py", line 233, in main
args.func(parser, args)
File "/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/gemini_load.py", line 331, in load
gemini_loader.populate_from_vcf()
File "/usr/local/python/lib/python2.6/site-packages/gemini-0.1.0-py2.6.egg/gemini/gemini_load.py", line 54, in populate_from_vcf
for var in self.vcf_reader:
File "parser.pyx", line 1052, in cyvcf.parser.Reader.next (cyvcf/parser.c:14214)
File "parser.pyx", line 928, in cyvcf.parser.Reader._parse_samples (cyvcf/parser.c:12670)
File "parser.pyx", line 128, in cyvcf.parser._Call.cinit (cyvcf/parser.c:2818)
KeyError: 'GT'
There is not need to land a temp fille when chunking the inout VCF. Just pipe from grabix to gemini_load_chunk, so as to save space. Makes a big difference with huge input VCF files.
use genometools' python API?
use matplotlib, similar to the work that @daler has done with pybedtools?
Hi,
Right now the installation on OSX is a bit clunky, have to do python setup.py install --prefix /Users/aquinom/gemini to get an error message that the directory and subdirectories don't exist.
First you have to manually create gemini/lib/python2.7/site-packages then 'export PYTHONPATH=/Users/aquinom/gemini/lib/python2.7/site-packages'. Then when you run the install it crashes with the error message 'error: Setup script exited with error: unknown file type '.pyx' (from 'pysam/csamtools.pyx')' which a quick google search reveals is an issue of Cython vs Pyrex. At this point I'm a little stumped.
When parallel loading exome size files with, for example 16 cores, the slowest part will now be the merging of the chunks. Currently this is done sequentially. It should be trivial to do this with a "merge-sort" algorithm, whereby you have 8 cores each merging two files, then 4 cores, merging the 8 from the previous step, etc. Boom.
Even though records are inserted into the variants
table in reference order (viz., chrom, then start pos.), when records are selected back out of the table, sort order is not preserved. This is expected based on the way RDBMS query engines strive to otimize queries such that data is retrieved as quickly as possible.
It seems that the best solution is to index that table based on chrom
and start
and then when sorted-order is necessary, auto-populate an ORDER BY chrom, start
to the query.
This is most necessary for tools that will exploit the map
functionality in bedtools
for quickly measuring variant density, windowed HWE, windowed LOH, etc.
Uma needs to document exactly how VEP and snpEff must be run.
For example, if -t VEP is used yet VEP was run without Polyphen options, line 58 in vep.py fails:
self.polyphen2 = self.polyphen_b[1].split(")")
Currently, only SNPs and INDELs are supported, as PyVcf only supports VCF 4.0.
We need to first enhance PyVcf to support 4.1 in order to support SVs in Gemini.
Add a mode that will simply update an existing VCF file with the annotations provided by gemini
using the INFO field.
THe query interface should be object oriented to facilitate interaction with other libraries.
We need a shortcut for finding all variants within a certain chromosome interval. Since SVs and INDELs are in the variants
table, it's not a simple as:
select * from variants where chrom = 'chr1' and start >= 100 and end <= 200
Two shortcuts should be provided:
Single intervals. Provide an interval a la samtools and all of the correct interval arithmetic will be auto-generated in a proper query behind the scenes.
gemini range -r chr1:100-200 my.db
A file of intervals. Provide a file of intervals and behind the scenes, a query will be combined with bedtools
to return all variants that overlap >=1 interval in the file
gemini range -f intervals.bed my.db
Lastly, one can always just pipe to bedtools
:
gemini get -s variants my.db | bedtools intersect -a - b.intervals.bed
Can we use the special "range" index in sqlite
?
Could we do overall frequencies, plus freqs broken down by ethnicity and sub-population?
New versions of these annotation files have been released. We need to upgrade to the latest versions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.