brentp / cruzdb Goto Github PK

View Code? Open in Web Editor NEW

133.0 17.0 41.0 3.13 MB

python access to UCSC genomes database

License: MIT License

Python 49.92% TeX 42.24% Shell 0.07% Jupyter Notebook 7.50% Makefile 0.27%

cruzdb's Introduction

A rendered version of the docs is available at: http://pythonhosted.org/cruzdb/

A paper describing cruzdb is in Bioinformatics: https://doi.org/10.1093/bioinformatics/btt534

cruzdb overview

The UCSC Genomes Database is a great resource for annotations, regulation and variation and all kinds of data for a growing number of taxa. This library aims to make utilizing that data simple so that we can do sophisticated analyses without resorting to awk-ful, error-prone manipulations. As motivation, here's an example of some of the capabilities:

>>> from cruzdb import Genome

>>> g = Genome(db="hg18")

>>> muc5b = g.refGene.filter_by(name2="MUC5B").first()
>>> muc5b
refGene(chr11:MUC5B:1200870-1239982)

>>> muc5b.strand
'+'

# the first 4 introns
>>> muc5b.introns[:4]
[(1200999L, 1203486L), (1203543L, 1204010L), (1204082L, 1204420L), (1204682L, 1204836L)]

# the first 4 exons.
>>> muc5b.exons[:4]
[(1200870L, 1200999L), (1203486L, 1203543L), (1204010L, 1204082L), (1204420L, 1204682L)]

# note that some of these are not coding because they are < cdsStart
>>> muc5b.cdsStart
1200929L

# the extent of the 5' utr.
>>> muc5b.utr5
(1200870L, 1200929L)

# we can get the (first 4) actual CDS's with:
>>> muc5b.cds[:4]
[(1200929L, 1200999L), (1203486L, 1203543L), (1204010L, 1204082L), (1204420L, 1204682L)]

# the cds sequence from the UCSC DAS server as a list with one entry per cds
>>> muc5b.cds_sequence #doctest: +ELLIPSIS
['atgggtgccccgagcgcgtgccggacgctggtgttggctctggcggccatgctcgtggtgccgcaggcag', ...]


>>> transcript = g.knownGene.filter_by(name="uc001aaa.2").first()
>>> transcript.is_coding
False

# convert a genome coordinate to a local coordinate.
>>> transcript.localize(transcript.txStart)
0L

# or localize to the CDNA position.
>>> print transcript.localize(transcript.cdsStart, cdna=True)
None

Command-Line Interface

with cruzdb 0.5.4+ installed, given a file input.bed you can do:

python -m cruzdb hg18 input.bed refGene cpgIslandExt

to have the intervals annotated with the refGene and cpgIslandExt tables from version hg18.

DataFrames

... are so in. We can get one from a table as:

>>> df = g.dataframe('cpgIslandExt') 
>>> df.columns #doctest: +ELLIPSIS
Index([chrom, chromStart, chromEnd, name, length, cpgNum, gcNum, perCpg, perGc, obsExp], dtype=object)

All of the above can be repeated using knownGene annotations by changing 'refGene' to 'knownGene'. And, it can be done easily for a set of genes.

Spatial

k-nearest neighbors, upstream, and downstream searches are available. Up and downstream searches use the strand of the query feature to determine the direction:

>>> nearest = g.knearest("refGene", "chr1", 9444, 9555, k=6) >>> up_list = g.upstream("refGene", "chr1", 9444, 9555, k=6) >>> down_list = g.downstream("refGene", "chr1", 9444, 9555, k=6)

Mirror

The above uses the mysql interface from UCSC. It is now possible to mirror any tables from UCSC to a local sqlite database via:

# cleanup

>>> import os >>> if os.path.exists("/tmp/u.db"): os.unlink('/tmp/u.db')

>>> g = Genome('hg18')

>>> gs = g.mirror(['chromInfo'], 'sqlite:////tmp/u.db')

and then use as:

>>> gs.chromInfo <class 'cruzdb.sqlsoup.chromInfo'>

Code

Most of the per-row features are implemented in cruzdb/models.py in the Feature class. If you want to add something to a feature (like the existing feature.utr5) add it here.

The tables are reflected using sqlalchemy and mapped in the __getattr__method of the Genome class in cruzdb/__init__.py

So a call like:

genome.knownGene

calls the __getattr__ method with the table arg set to 'knownGene' that table is then reflected and an object with parent classes of Feature and sqlalchemy's declarative_base is returned.

Contributing

YES PLEASE!

To start coding, it is probably polite to grab your own copy of some of the UCSC tables so as not to overload the UCSC server. You can run something like:

Genome('hg18').mirror(["refGene", "cpgIslandExt", "chromInfo", "knownGene", "kgXref"], "sqlite:////tmp/hg18.db")

Then the connection would be something like:

g = Genome("sqlite:////tmp/hg18.db")

If you have a feature you like to use/implement, open a ticket on github for discussion. Below are some ideas.

cruzdb's People

Contributors

Stargazers

Watchers

cruzdb's Issues

Gene ABR Cruzdb shows 8 different entries in RefGene whereas RefGene in UCSC shows 5

Hi
I have noticed a discrepancy between the number of entries returned by cruzdb for the gene ABR. If I do a search for all the refGene entries as such:

g = cruzdb.Genome(db="hg19")
genelist=g.refGene.filter_by(name2="ABR").all()

for gene in genelist:
     print gene

chr17   906757  1012340 ABR 0.00    -
chr17   906757  1090616 ABR 0.00    -
chr17   906757  982386  ABR 0.00    -
chr17   906757  935081  ABR 0.00    -
chr17   906757  1083268 ABR 0.00    -
chr17   906757  1029900 ABR 0.00    -
chr17   906757  1132974 ABR 0.00    -
chr17   1129184 1132974 ABR 0.00    -

This indicates 8 different entries for this gene. However if I look in UCSC in the Table browser using the RefSeq Genes table and selecting all fields from the selected table as my output format I find that only get 5 entries back.

#bin    name    chrom   strand  txStart txEnd   cdsStart    cdsEnd  exonCount   exonStarts  exonEnds    score   name2   cdsStartStat    cdsEndStat  exonFrames
73  NM_001092   chr17   -   906757  1012340 909319  1012308 22  906757,910404,912918,913968,915085,915927,916344,953289,953776,959274,960237,961209,961984,970316,973208,975853,976864,982569,986759,994904,1003876,1012173,    909409,910552,913024,914103,915225,916037,916404,953421,953874,959349,960342,961285,962107,970482,973330,975994,976917,982630,986867,995090,1003975,1012340,    0   ABR cmpl    cmpl    0,2,1,1,2,0,0,0,1,1,1,0,0,2,0,0,1,0,0,0,0,0,
9   NM_001159746    chr17   -   906757  1090616 909319  1028625 23  906757,910404,912918,913968,915085,915927,916344,953289,953776,959274,960237,961209,961984,970316,973208,975853,976864,982569,986759,994904,1003876,1028517,1090093,    909409,910552,913024,914103,915225,916037,916404,953421,953874,959349,960342,961285,962107,970482,973330,975994,976917,982630,986867,995090,1003975,1028702,1090616,    0   ABR cmpl    cmpl    0,2,1,1,2,0,0,0,1,1,1,0,0,2,0,0,1,0,0,0,0,0,-1,
73  NM_001256847    chr17   -   906757  935081  909319  934981  8   906757,910404,912918,913968,915085,915927,916344,934837,    909409,910552,913024,914103,915225,916037,916404,935081,    0   ABR cmpl    cmpl    0,2,1,1,2,0,0,0,
73  NM_001282149    chr17   -   906757  982386  909319  982099  18  906757,910404,912918,913968,915085,915927,916344,953289,953776,959274,960237,961209,961984,970316,973208,975853,976864,982053,  909409,910552,913024,914103,915225,916037,916404,953421,953874,959349,960342,961285,962107,970482,973330,975994,976917,982386,  0   ABR cmpl    cmpl    0,2,1,1,2,0,0,0,1,1,1,0,0,2,0,0,1,0,
9   NM_021962   chr17   -   906757  1083268 909319  1083021 23  906757,910404,912918,913968,915085,915927,916344,953289,953776,959274,960237,961209,961984,970316,973208,975853,976864,982569,986759,994904,1003876,1028517,1082960,    909409,910552,913024,914103,915225,916037,916404,953421,953874,959349,960342,961285,962107,970482,973330,975994,976917,982630,986867,995090,1003975,1028702,1083268,    0   ABR cmpl    cmpl    0,2,1,1,2,0,0,0,1,1,1,0,0,2,0,0,1,0,0,0,0,1,0,

In particular i notice that the region chr17:1129184-1132974 is not even mentioned in the UCSC Table browser output. Could you shed some light on what is happening here please?

Best wishes

Kevin

protein sequence

Hi Brent

Would you be able to add function for retrieving a protein coding sequence from mRNA refseq ID (NM_XXXXXX)?

Cheers
Joon

p.s) hope this is the ticket open..

report which exon from annotate

Genome.annotate currently reports "exon" if an exon feature overlaps the feature to be annotated. It should be able to report "exon_1", "exon_2", etc.
"exon_last", exon_first?

Add as PyPI package

This is a very useful package that should, IMO, be on PyPI.

If you don't want to mess with uploading it, I'd be happy to do that.

calculate bin to do more efficent query

I'd be happy to help with this, as I have logic to do this already in BEDTools. Basically, you need to be able to compute a bin for a given start/end. When searching for overlaps, you also need to create a list of higher level bins that should also be checked. These ending going into a where clause like:

where feature in ["..."].

Based on the five minutes I've spent looking at this, I surmise this should be a method in the Mixin class?

e.g.

def getBin(chrom, start, end) # this is used to compute the bin for a single feature.

and

def getBinList(chrom, start, end) # this is used to compute the possible overlapping bins that should be inspected as part of a SQL query.

Thoughts?

Generic table access

Currently the Mixin class is pretty focused on gene tables (refGene, knownGenes, etc). It would be really nice to have access to all of UCSC's tables, but to also support mixins on as many as possible.

How best to implement? Here are some ideas to kick around:

There are probably a handful of different kinds of tables in UCSC -- gene tables, like refGene and knownGenes, BED12 tables, and generic lookup tables like flybaseXref2004. There's also tables that point to bigWig files (which would be great to integrate with bx-python's bigWig parsing).

These are just some examples, I haven't gone through the schema to get an idea of really how many kinds of tables there are. But maybe it would be possible to map each kind of table to a Mixin class, and having a fallback do-nothing mixin that does nothing special (except general methods like Mixin.sql()). So as it stands now, such a dictionary would simply be something like {'refGene': Mixin, 'knownGenes': Mixin}

An additional challenge is that I couldn't get sqlalchemy's reflection to automatically recognize compound primary keys which the dm3.xenoMrna table has. Hopefully I'm overlooking something, otherwise this would mean having a curated mapping of such tables to what their primary keys should be.

list of tables

Hi Brent!
Can you tell me where I can get an overview of all the tables(like cpgIslandExt) of for example mm10?
I'm using cruzdb from the command line: python -m cruzdb mm10 metilene.bed cpgIslandExt ...

unable to use package after install

I fixed this locally by commenting out line #5 in cruzdb/init.py

error:

Python 2.7.1 (r271:86832, Jul 31 2011, 19:30:53)
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.

from cruzdb import Genome
Traceback (most recent call last):
File "", line 1, in
File "cruzdb/init.py", line 5, in
from tests import test
ImportError: No module named tests

Installation Issue

It may just be me, but after installing cruzdb...

pip install cruzdb

... and trying to import it in python and python3 I am getting this error:

>>> import cruzdb
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/josephsolvason/miniconda3/lib/python3.6/site-packages/cruzdb/__init__.py", line 5, in <module>
    from . import soup
  File "/Users/josephsolvason/miniconda3/lib/python3.6/site-packages/cruzdb/soup.py", line 1, in <module>
    from . import sqlsoup
  File "/Users/josephsolvason/miniconda3/lib/python3.6/site-packages/cruzdb/sqlsoup.py", line 458
    except KeyError, ke:
                   ^
SyntaxError: invalid syntax
>>>

Thoughts?

Joe

cruzdb issues when called within Python 3

I have installed cruzdb in Python 3.8.3. A few calls do not work. For example, it uses "basestring" and in Python 3 that is deprecated. It should be "str" instead. Also, wherever there is the statement "import MapperExtension", it does not find it; instead, it seems like it should be "import MapperOption as MapperExtension". The print statements are also incompatible with Python 3. Once I try correct those (not really sure about the "MapperOptions" and the correct print statements ), the annotation still doesn't seem to work. The annotation file is empty and I get messages like the following:

<_io.TextIOWrapper name='/tmp/tmpbq2cm8yh.cruzdb.chr20' mode='w' encoding='UTF-8'> chr20 73779902 73779936 0.0264 2 0.09704 1
<_io.TextIOWrapper name='/tmp/tmpbq2cm8yh.cruzdb.chr20' mode='w' encoding='UTF-8'> chr20 73793541 73793640 0.02017 4 5.766e-05 0.6151
<_io.TextIOWrapper name='/tmp/tmpbq2cm8yh.cruzdb.chr20' mode='w' encoding='UTF-8'> chr20 73816547 73816630 0.004666 5 7.165e-05 0.7571
<_io.TextIOWrapper name='/tmp/tmpbq2cm8yh.cruzdb.chr20' mode='w' encoding='UTF-8'> chr20 74035220 74035296 0.04682 3 9.501e-05 0.8712

Misleading docstring for dataframe()

Currently, the dataframe() docstring states:

"table : table
a table in this database or a query"

However it only supports full tables, as it does a single select() without any possibility of interaction.

sqlalchemy needs to be added to requirements

When installing via pip, sqlalchemy isn't installed but it is a dependency

add ncbi link when annotating

use knownGeneToLocus table and then

http://www.ncbi.nlm.nih.gov/gene/{locus}/

Helpers to handle canonical forms of genes

The title is a bit vague, so I'll explain my use case: I have a list of genes where I want to extract information on the promoter (hence upstream of the 5' UTR). However, I might get results > 1 due to different splicing isoforms, while I might be only interested in the so-called "canonical" (the longest) form.

Having something to help with this might be very helpful.

Is it possible to continue to mirror a database to the same destination if the connection breaks?

Hi I am trying to mirror the gbCdnaInfo table which is pretty large ~40Gb by:

import cruzdb

g = cruzdb.Genome(db="hg19")

gbCdnaInfo = g.mirror(['gbCdnaInfo'], 'sqlite:////home/test/gbCdnaInfo160104.db')

When I ran the code it managed to mirror 19Gb before the connection went down. I tried to restart the above script and the error I got was

attempting to add to existing sqlite database
Mirroring gbCdnaInfo
Traceback (most recent call last):
File "mirrordatabase160104.py", line 11, in <module>
gbCdnaInfo = g.mirror(['gbCdnaInfo'], 'sqlite:////home/test/gbCdnaInfo160104.db')
File "/usr/local/lib/python2.7/dist-packages/cruzdb/__init__.py", line 97, in mirror
return mirror(self, tables, dest_url)
File "/usr/local/lib/python2.7/dist-packages/cruzdb/mirror.py", line 110, in mirror
destination.execute(ins, records)
File "/usr/local/lib/python2.7/dist-packages/SQLAlchemy-0.9.7-py2.7-linux-x86_64.egg/sqlalchemy/orm/session.py", line 991, in execute
bind, close_with_result=True).execute(clause, params or {})
File "/usr/local/lib/python2.7/dist-packages/SQLAlchemy-0.9.7-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 729, in execute
return meth(self, multiparams, params)
File "/usr/local/lib/python2.7/dist-packages/SQLAlchemy-0.9.7-py2.7-linux-x86_64.egg/sqlalchemy/sql/elements.py", line 321, in _execute_on_connection
return connection._execute_clauseelement(self, multiparams, params)
File "/usr/local/lib/python2.7/dist-packages/SQLAlchemy-0.9.7-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 826, in _execute_clauseelement
compiled_sql, distilled_params
File "/usr/local/lib/python2.7/dist-packages/SQLAlchemy-0.9.7-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 958, in _execute_context
context)
File "/usr/local/lib/python2.7/dist-packages/SQLAlchemy-0.9.7-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 1160, in _handle_dbapi_exception
exc_info
File "/usr/local/lib/python2.7/dist-packages/SQLAlchemy-0.9.7-py2.7-linux-x86_64.egg/sqlalchemy/util/compat.py", line 199, in raise_from_cause
reraise(type(exception), exception, tb=exc_tb)
File "/usr/local/lib/python2.7/dist-packages/SQLAlchemy-0.9.7-py2.7-linux-x86_64.egg/sqlalchemy/engine/base.py", line 928, in _execute_context
context)
File "/usr/local/lib/python2.7/dist-packages/SQLAlchemy-0.9.7-py2.7-linux-x86_64.egg/sqlalchemy/engine/default.py", line 433, in do_executemany
cursor.executemany(statement, parameters)
sqlalchemy.exc.IntegrityError: (IntegrityError) UNIQUE constraint failed: gbCdnaInfo.id u'INSERT INTO "gbCdnaInfo" (id, acc, version, moddate, type, direction, source, organism, library, "mrnaClone", sex, tissue, development, cell, cds, keyword, description, "geneName", "productName", author, gi, mol) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)' ((1L, 'AB004856', 1, '2008-11-23', 'mRNA', '0', 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L, 1L, 3036932L, 'mRNA'), (2L, 'AB005263', 1, '2008-11-23', 'mRNA', '0', 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 1L, 0L, 2L, 2L, 1L, 3036936L, 'mRNA'), (3L, 'AB011407', 1, '2008-11-03', 'mRNA', '0', 1L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 3L, 1L, 0L, 0L, 3L, 1L, 4190955L, 'mRNA'), (4L, 'AB012144', 1, '1999-01-24', 'mRNA', '0', 2L, 2L, 0L, 1L, 0L, 0L, 0L, 0L, 4L, 2L, 0L, 3L, 0L, 2L, 3882098L, 'mRNA'), (5L, 'AB012145', 1, '1999-04-01', 'mRNA', '0', 3L, 3L, 0L, 2L, 0L, 0L, 0L, 0L, 5L, 3L, 0L, 0L, 0L, 3L, 4730806L, 'mRNA'), (6L, 'AB017109', 1, '2006-11-27', 'mRNA', '0', 4L, 4L, 0L, 0L, 0L, 0L, 0L, 0L, 6L, 1L, 0L, 6L, 4L, 4L, 4239966L, 'mRNA'), (7L, 'AB019621', 1, '1999-07-01', 'mRNA', '0', 5L, 5L, 0L, 0L, 0L, 0L, 0L, 0L, 7L, 4L, 0L, 0L, 5L, 5L, 4586513L, 'mRNA'), (8L, 'AB026157', 2, '1999-08-01', 'mRNA', '0', 6L, 6L, 0L, 3L, 0L, 0L, 0L, 0L, 8L, 5L, 0L, 0L, 6L, 6L, 5811598L, 'mRNA')  ... displaying 10 of 20001 total bound parameter sets ...  (20000L, 'AB229080', 1, '2007-05-15', 'mRNA', '0', 449L, 445L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 178L, 0L, 0L, 0L, 451L, 84576905L, 'mRNA'), (20001L, 'AB229081', 1, '2007-05-15', 'mRNA', '0', 449L, 445L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 178L, 0L, 0L, 0L, 451L, 84576906L, 'mRNA'))

This suggests that the mirror function is trying to pick up where it left off, do you know what might be the problem here. The second time round the error indicates a UNIQUE constraint fail but the original reason for the fail was

sqlalchemy.exc.OperationalError: (OperationalError) (2013, 'Lost connection to MySQL server during query') 'SELECT `gbCdnaInfo`.id, `gbCdnaInfo`.acc, `gbCdnaInfo`.version, `gbCdnaInfo`.moddate, `gbCdnaInfo`.type, `gbCdnaInfo`.direction, `gbCdnaInfo`.source, `gbCdnaInfo`.organism, `gbCdnaInfo`.library, `gbCdnaInfo`.`mrnaClone`, `gbCdnaInfo`.sex, `gbCdnaInfo`.tissue, `gbCdnaInfo`.development, `gbCdnaInfo`.cell, `gbCdnaInfo`.cds, `gbCdnaInfo`.keyword, `gbCdnaInfo`.description, `gbCdnaInfo`.`geneName`, `gbCdnaInfo`.`productName`, `gbCdnaInfo`.author, `gbCdnaInfo`.gi, `gbCdnaInfo`.mol \nFROM `gbCdnaInfo` \n LIMIT %s, %s' (48032000, 8000)

ImportError: cannot import name MapperExtension

Hello,

Since there are many problems with cruzdb and Python 3 I decided to create a conda environment for python 2.7 and try there:

conda create --name py27 python=2.7
conda activate py27
pip install cruzdb
pip install SQLAlchemy 
pip install Flask-SQLAlchemy
#to avoid another error that appeared missing sqlalchemy
python

However when I try (in python):

from cruzdb import Genome
I get the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/anaconda3/envs/py27/lib/python2.7/site-packages/cruzdb/__init__.py", line 5, in <module>
    from . import soup
  File "/home/user/anaconda3/envs/py27/lib/python2.7/site-packages/cruzdb/soup.py", line 1, in <module>
    from . import sqlsoup
  File "/home/user/anaconda3/envs/py27/lib/python2.7/site-packages/cruzdb/sqlsoup.py", line 11, in <module>
    from sqlalchemy.orm.interfaces import MapperExtension, EXT_CONTINUE
ImportError: cannot import name MapperExtension

Any thoughts?

one exon missing

Hi,

I've just installed cruzdb in Python 2.7. I found that the first exon was missed from this gene A4GNT, NM_016161. Here is how I fetched the sequence:
g = Genome(db="hg19")
gene='A4GNT'
gene_obj = g.refGene.filter_by(name2=gene).first()
print(len(gene_obj.mrna_sequence))

got 2, but RefSeq shows 3 exons

print(len(gene_obj.mrna_sequence[0]))

1161

print(len(gene_obj.mrna_sequence[1]))

434

print(len(gene_obj.mrna_sequence[2]))

index out of range error

I cross checked NM_016161 in NCBI, it seems the 1st exon (176 bps long) was not included.

Any help appreciated.

Eric.

script to create knownGene with name

the knownGene.name field is the kgID UCSC id. Would be nice to have a script to create a local copy of knownGene with name2 == kgXref.geneSymbol so that we dont need a join to get the gene name.

access to bioconductor's TxDb's

Bioconductor has a number of TxDb's (e.g. http://bioconductor.org/packages/2.11/data/annotation/html/TxDb.Hsapiens.UCSC.hg18.knownGene.html) that would be complementary and useful here. The data is stored in an sqlite database with schema

CREATE TABLE cds (
  _cds_id INTEGER PRIMARY KEY,
  cds_name TEXT NULL,
  cds_chrom TEXT NOT NULL,
  cds_strand TEXT NOT NULL,
  cds_start INTEGER NOT NULL,
  cds_end INTEGER NOT NULL,
  FOREIGN KEY (cds_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE chrominfo (
  _chrom_id INTEGER PRIMARY KEY,
  chrom TEXT UNIQUE NOT NULL,
  length INTEGER NULL,
  is_circular INTEGER NULL
);
CREATE TABLE exon (
  _exon_id INTEGER PRIMARY KEY,
  exon_name TEXT NULL,
  exon_chrom TEXT NOT NULL,
  exon_strand TEXT NOT NULL,
  exon_start INTEGER NOT NULL,
  exon_end INTEGER NOT NULL,
  FOREIGN KEY (exon_chrom) REFERENCES chrominfo (chrom)
);
CREATE TABLE gene (
  gene_id TEXT NOT NULL,
  _tx_id INTEGER NOT NULL,
  UNIQUE (gene_id, _tx_id),
  FOREIGN KEY (_tx_id) REFERENCES transcript
);
CREATE TABLE metadata 
( name TEXT,
    "value" TEXT 
);
CREATE TABLE splicing (
  _tx_id INTEGER NOT NULL,
  exon_rank INTEGER NOT NULL,
  _exon_id INTEGER NOT NULL,
  _cds_id INTEGER NULL,
  UNIQUE (_tx_id, exon_rank),
  FOREIGN KEY (_tx_id) REFERENCES transcript,
  FOREIGN KEY (_exon_id) REFERENCES exon,
  FOREIGN KEY (_cds_id) REFERENCES cds
);
CREATE TABLE transcript (
  _tx_id INTEGER PRIMARY KEY,
  tx_name TEXT NULL,
  tx_chrom TEXT NOT NULL,
  tx_strand TEXT NOT NULL,
  tx_start INTEGER NOT NULL,
  tx_end INTEGER NOT NULL,
  FOREIGN KEY (tx_chrom) REFERENCES chrominfo (chrom)
);

maybe access as Genome(local="TxDb.Hsapiens.UCSC.hg18.knownGene/inst/extdata/TxDb.Hsapiens.UCSC.hg18.knownGene.sqlite")

invalid syntax?

Hello,
I just installed the cruzdb package and came across this error as soon as I tried to import the sequences (from cruzdb import sequence)

File "/Users/Radwa/anaconda/lib/python3.6/site-packages/cruzdb/sqlsoup.py", line 458 except KeyError, ke: ^ SyntaxError: invalid syntax

Figure out what needs to be done for Python 3 compatiblity

Given that SQLSoup and SQLalchemy are all Python 3 compatible, the next step is to find the potential points of issues directly in cruzdb.

In particular:

Figure out how the bundled SQLSoup is different than the stock one
Track issues that 2to3 can't fix (strings vs bytes etc)
Move to a library like six to keep everything compatible with a single codebase (optional?)

Other MySQL dialects (aside MySQLdb) are not supported

cruzdb hardcodes mysql:// urls in its initialization, therefore any other MySQL dialects different from MySQLdb aren't supported even though SQLalchemy handles them fine.

Related to this is that the engine parameter for Genome is not used at all.

Removing this hardcoded limitation would ease a Python 3 port / support (given that both SQLSoup and SQLalchemy support Python 3).