tariqdaouda / pygeno Goto Github PK

View Code? Open in Web Editor NEW

309.0 25.0 50.0 10.81 MB

Personalized Genomics and Proteomics. Main diet: Ensembl, side dishes: SNPs

Home Page: http://pygeno.iric.ca

License: Apache License 2.0

Python 92.70% Makefile 3.60% Batchfile 3.70%

bioinformatics biology genomics proteomics genome genome-annotation genome-browser genome-sequencing genomes medical

pygeno's Introduction

CODE FREEZE:

PyGeno has long been limited due to it's backend. We are now ready to take it to the next level.

We are working on a major port of pyGeno to the open-source multi-modal database ArangoDB. PyGeno's code on both branches master and bloody is frozen until we are finished. No pull request will be merged until then, and we won't implement any new features.

pyGeno: A Python package for precision medicine and proteogenomics

pyGeno is (to our knowledge) the only tool available that will gladly build your specific genomes for you.

pyGeno is developed by Tariq Daouda at the Institute for Research in Immunology and Cancer (IRIC), its logo is the work of the freelance designer Sawssan Kaddoura. For the latest news about pyGeno, you can follow me on twitter @tariqdaouda.

Click here for The full documentation.

For the latest news about pyGeno, you can follow me on twitter @tariqdaouda.

Citing pyGeno:

Please cite this paper.

Installation:

It is recommended to install pyGeno within a virtual environement, to setup one you can use:

virtualenv ~/.pyGenoEnv
source ~/.pyGenoEnv/bin/activate

pyGeno can be installed through pip:

pip install pyGeno #for the latest stable version

Or github, for the latest developments:

git clone https://github.com/tariqdaouda/pyGeno.git
cd pyGeno
python setup.py develop

A brief introduction

pyGeno is a personal bioinformatic database that runs directly into python, on your laptop and does not depend upon any REST API. pyGeno is here to make extracting data such as gene sequences a breeze, and is designed to be able cope with huge queries. The most exciting feature of pyGeno, is that it allows to work with seamlessly with both reference and Personalized Genomes.

Personalized Genomes, are custom genomes that you create by combining a reference genome, sets of polymorphisms and an optional filter. pyGeno will take care of applying the filter and inserting the polymorphisms at their right place, so you get direct access to the DNA and Protein sequences of your patients.

from pyGeno.Genome import *

g = Genome(name = "GRCh37.75")
prot = g.get(Protein, id = 'ENSP00000438917')[0]
#print the protein sequence
print prot.sequence
#print the protein's gene biotype
print prot.gene.biotype
#print protein's transcript sequence
print prot.transcript.sequence

#fancy queries
for exon in g.get(Exon, {"CDS_start >": x1, "CDS_end <=" : x2, "chromosome.number" : "22"}) :
        #print the exon's coding sequence
        print exon.CDS
        #print the exon's transcript sequence
        print exon.transcript.sequence

#You can do the same for your subject specific genomes
#by combining a reference genome with polymorphisms
g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter())

And if you ever get lost, there's an online help() function for each object type:

from pyGeno.Genome import *

print Exon.help()

Should output:

Available fields for Exon: CDS_start, end, chromosome, CDS_length, frame, number, CDS_end, start, genome, length, protein, gene, transcript, id, strand

Creating a Personalized Genome:

Personalized Genomes are a powerful feature that allow you to work on the specific genomes and proteomes of your patients. You can even mix several SNP sets together.

from pyGeno.Genome import Genome
#the name of the snp set is defined inside the datawrap's manifest.ini file
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY')
#you can also define a filter (ex: a quality filter) for the SNPs
dummy = Genome(name = 'GRCh37.75', SNPs = 'dummySRY', SNPFilter = myFilter())
#and even mix several snp sets
dummy = Genome(name = 'GRCh37.75', SNPs = ['dummySRY', 'anotherSet'], SNPFilter = myFilter())

Filtering SNPs:

pyGeno allows you to select the Polymorphisms that end up into the final sequences. It supports SNPs, Inserts and Deletions.

from pyGeno.SNPFiltering import SNPFilter, SequenceSNP

class QMax_gt_filter(SNPFilter) :

        def __init__(self, threshold) :
                self.threshold = threshold

        #Here SNPs is a dictionary: SNPSet Name => polymorphism
        #This filter ignores deletions and insertions and
        #but applis all SNPs
        def filter(self, chromosome, **SNPs) :
                sources = {}
                alleles = []
                for snpSet, snp in SNPs.iteritems() :
                        pos = snp.start
                        if snp.alt[0] == '-' :
                                pass
                        elif snp.ref[0] == '-' :
                                pass
                        else :
                                sources[snpSet] = snp
                                alleles.append(snp.alt) #if not an indel append the polymorphism

                #appends the refence allele to the lot
                refAllele = chromosome.refSequence[pos]
                alleles.append(refAllele)
                sources['ref'] = refAllele

                #optional we keep a record of the polymorphisms that were used during the process
                return SequenceSNP(alleles, sources = sources)

The filter function can also be made more specific by using arguments that have the same names as the SNPSets

def filter(self, chromosome, dummySRY = None) :
        if dummySRY.Qmax_gt > self.threshold :
                #other possibilities of return are SequenceInsert(<bases>), SequenceDelete(<length>)
                return SequenceSNP(dummySRY.alt)
        return None #None means keep the reference allele

To apply the filter simply specify if while loading the genome.

persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = 'dummySRY', SNPFilter = QMax_gt_filter(10))

To include several SNPSets use a list.

persGenome = Genome(name = 'GRCh37.75_Y-Only', SNPs = ['ARN_P1', 'ARN_P2'], SNPFilter = myFilter())

Getting an arbitrary sequence:

You can ask for any sequence of any chromosome:

chr12 = myGenome.get(Chromosome, number = "12")[0]
print chr12.sequence[x1:x2]
# for the reference sequence
print chr12.refSequence[x1:x2]

Batteries included (bootstraping):

pyGeno's database is populated by importing datawraps. pyGeno comes with a few data wraps, to get the list you can use:

import pyGeno.bootstrap as B
B.printDatawraps()

Available datawraps for boostraping

SNPs
~~~~|
    |~~~:> Human_agnostic.dummySRY.tar.gz
    |~~~:> Human.dummySRY_casava.tar.gz
    |~~~:> dbSNP142_human_common_all.tar.gz


Genomes
~~~~~~~|
       |~~~:> Human.GRCh37.75.tar.gz
       |~~~:> Human.GRCh37.75_Y-Only.tar.gz
       |~~~:> Human.GRCh38.78.tar.gz
       |~~~:> Mouse.GRCm38.78.tar.gz

To get a list of remote datawraps that pyGeno can download for you, do:

B.printRemoteDatawraps()

Importing whole genomes is a demanding process that take more than an hour and requires (according to tests) at least 3GB of memory. Depending on your configuration, more might be required.

That being said importating a data wrap is a one time operation and once the importation is complete the datawrap can be discarded without consequences.

The bootstrap module also has some handy functions for importing built-in packages.

Some of them just for playing around with pyGeno (Fast importation and Small memory requirements):

import pyGeno.bootstrap as B

#Imports only the Y chromosome from the human reference genome GRCh37.75
#Very fast, requires even less memory. No download required.
B.importGenome("Human.GRCh37.75_Y-Only.tar.gz")

#A dummy datawrap for humans SNPs and Indels in pyGeno's AgnosticSNP  format.
# This one has one SNP at the begining of the gene SRY
B.importSNPs("Human.dummySRY_casava.tar.gz")

And for more Serious Work, the whole reference genome.

#Downloads the whole genome (205MB, sequences + annotations), may take an hour or more.
B.importGenome("Human.GRCh38.78.tar.gz")

Importing a custom datawrap:

from pyGeno.importation.Genomes import *
importGenome('GRCh37.75.tar.gz')

To import a patient's specific polymorphisms

from pyGeno.importation.SNPs import *
importSNPs('patient1.tar.gz')

For a list of available datawraps available for download, please have a look here.

You can easily make your own datawraps with any tar.gz compressor. For more details on how datawraps are made you can check wiki or have a look inside the folder bootstrap_data.

Instanciating a genome:

from pyGeno.Genome import Genome
#the name of the genome is defined inside the package's manifest.ini file
ref = Genome(name = 'GRCh37.75')

Printing all the proteins of a gene:

from pyGeno.Genome import Genome
from pyGeno.Gene import Gene
from pyGeno.Protein import Protein

Or simply:

from pyGeno.Genome import *

then:

ref = Genome(name = 'GRCh37.75')
#get returns a list of elements
gene = ref.get(Gene, name = 'TPST2')[0]
for prot in gene.get(Protein) :
      print prot.sequence

Making queries, get() Vs iterGet():

iterGet is a faster version of get that returns an iterator instead of a list.

Making queries, syntax:

pyGeno's get function uses the expressivity of rabaDB.

These are all possible query formats:

ref.get(Gene, name = "SRY")
ref.get(Gene, { "name like" : "HLA"})
chr12.get(Exon, { "start >=" : 12000, "end <" : 12300 })
ref.get(Transcript, { "gene.name" : 'SRY' })

Creating indexes to speed up queries:

from pyGeno.Gene import Gene
#creating an index on gene names if it does not already exist
Gene.ensureGlobalIndex('name')
#removing the index
Gene.dropIndex('name')

Find in sequences:

Internally pyGeno uses a binary representation for nucleotides and amino acids to deal with polymorphisms. For example,both "AGC" and "ATG" will match the following sequence "...AT/GCCG...".

#returns the position of the first occurence
transcript.find("AT/GCCG")
#returns the positions of all occurences
transcript.findAll("AT/GCCG")

#similarly, you can also do
transcript.findIncDNA("AT/GCCG")
transcript.findAllIncDNA("AT/GCCG")
transcript.findInUTR3("AT/GCCG")
transcript.findAllInUTR3("AT/GCCG")
transcript.findInUTR5("AT/GCCG")
transcript.findAllInUTR5("AT/GCCG")

#same for proteins
protein.find("DEV/RDEM")
protein.findAll("DEV/RDEM")

#and for exons
exon.find("AT/GCCG")
exon.findAll("AT/GCCG")
exon.findInCDS("AT/GCCG")
exon.findAllInCDS("AT/GCCG")
#...

Progress Bar:

from pyGeno.tools.ProgressBar import ProgressBar
pg = ProgressBar(nbEpochs = 155)
for i in range(155) :
      pg.update(label = '%d' %i) # or simply p.update()
pg.close()

pygeno's People

Contributors

Stargazers

Watchers

pygeno's Issues

Quick example on the home page does not work

The quick example on the homepage contains several typos and omissions (six in total!) that make the code unrunnable.

In general, but especially the quick example should be run through python to ensure that it works before pasting it into the docs. It also should be a self sufficient that does not need other information to run.

Since we are here I will make a note that this page does not do a good job in demonstrating what the library actually does. The actually interesting part for me as a python programmer is not that the library can extract the sequences for ensemble proteins - that is a job I can do with many tools already.

What interest me getting to the next level is combining and querying the Ensemble genes and the SNPs at the same time.

But the quick start stops at the most interesting line:

g = Genome(name = "GRCh37.75", SNPs = ["STY21_RNA"], SNPFilter = MyFilter()

ok there is promise here, but now what can I do here once I have this construct? What is MyFilter() what does that do.

Installation not working on Mac OS Yosemite 10.10.5

Hi Tariq,

Following the recommended installation, I tried to run this simple script:

#! /usr/local/bin/python
from pyGeno.Genome import *
g = Genome(name = "GRCh37.75")
sys.exit()

But I get this error:

Traceback (most recent call last):
  File "./script.py", line 8, in <module>
    g = Genome(name = "GRCh37.75")
  File "/usr/local/pyGeno/pyGeno/Genome.py", line 67, in __init__
    pyGenoRabaObjectWrapper.__init__(self, *args, **kwargs)
  File "/usr/local/pyGeno/pyGeno/pyGenoObjectBases.py", line 83, in __init__
    self.wrapped_object = self._wrapped_class(*args, **kwargs)
  File "/usr/local/lib/python2.7/site-packages/rabaDB/Raba.py", line 301, in __call__
    raise KeyError("Couldn't find any object that fit the arguments you've prodided to the constructor")
KeyError: "Couldn't find any object that fit the arguments you've prodided to the constructor"

Any ideas?

AgnosticSNP quality is a string and should be float

This is an issue for comparison because in python 2.7:

'0.001' > 20
True

For SNP filter this will silently fail:

    def filter(self, chromosome, **kwargs):

        for snp_set, snp in kwargs.iteritems():

                if snp.quality > self.threshold:

                    return SequenceSNP(snp.alt)

snp.quality must be cast to float to obtain the expected result

Asking for SNPs through get() does not work

Asking for SNPs through get() does not work. Please use the raba interface for retreiving SNPs:

from rabaDB.filters import *

f = RabaQuery('dbSNPSNP')
f.addFilter({"chromosomeNumber =" : 22, "start >":  x1, "end <": x2})
snps = f.run()

This will be fixed in the next issue.

Python 3 support

Make pyGeno compatible with python 3.

Unknown SNP type in manifest dbSNP, for dbSNP149

Below the error message:

In [1]: import pyGeno.bootstrap as B
In [2]: B.importSNPs("GRCh38p7_dbSNP149_common_all.tar.gz")
Importing polymorphism set: /u/eaudemard/dev/pyGeno/pyGeno/bootstrap_data/SNPs/GRCh38p7_dbSNP149_common_all.tar.gz... (This may take a while)
Downloading file: ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b149_GRCh38p7/VCF/common_all_20161122.vcf.gz...
done.
---------------------------------------------------------------------------
FutureWarning                             Traceback (most recent call last)
<ipython-input-2-e9741c79a81d> in <module>()
----> 1 B.importSNPs("GRCh38p7_dbSNP149_common_all.tar.gz")

/u/eaudemard/dev/pyGeno/pyGeno/bootstrap.pyc in importSNPs(name)
    108         """Import a SNP set shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties."""
    109         path = os.path.join(this_dir, "bootstrap_data", "SNPs/" + name)
--> 110         PS.importSNPs(path)

/u/eaudemard/dev/pyGeno/pyGeno/importation/SNPs.pyc in importSNPs(packageFile)
     69                         return _importSNPs_AgnosticSNP(setName, species, genomeSource, snpsFile)
     70                 else :
---> 71                         raise FutureWarning('Unknown SNP type in manifest %s' % typ)
     72         else :
     73                 raise KeyError("There's already a SNP set by the name %s. Use deleteSNPs() to remove it first" %setName)

FutureWarning: Unknown SNP type in manifest dbSNP

here the manifest.ini:

[package_infos]
description = SNP set for dbSNP149 that contains only common SNP. For more details: http://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/
maintainer = Eric Audemard
maintainer_contact = [email protected]
version = 1

[set_infos]
species = human
name = GRCh38p7_dbSNP149_common_all
type = dbSNP
source = ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b149_GRCh38p7/VCF/common_all_20161122.vcf.gz

[snps]
filename = ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b149_GRCh38p7/VCF/common_all_20161122.vcf.gz

pyGeno does not

The following issue has arisen

Using the pip install pyGeno command, pyGeno was installed as seen below.

"C:\Windows\system32>pip install pyGeno
Requirement already satisfied (use --upgrade to upgrade): pyGeno in c:\python27\lib\site-packages
Requirement already satisfied (use --upgrade to upgrade): rabaDB>=1.0.2 in c:\python27\lib\site-packages (from pyGeno)
You are using pip version 8.1.1, however version 8.1.2 is available.
You should consider upgrading via the 'python -m pip install --upgrade pip' command."

However, the genome import did not work with suggested command "import pyGeno.bootstrap as B "

It has given the following error:

"Traceback (most recent call last):
File "", line 1, in
File "C:\Python27\lib\site-packages\pyGeno__init__.py", line 4, in
pyGeno_init()
File "C:\Python27\lib\site-packages\pyGeno\configuration.py", line 103, in pyGeno_init
db = rabaDB.rabaSetup.RabaConnection(pyGeno_RABA_NAMESPACE)
File "C:\Python27\lib\site-packages\rabaDB\rabaSetup.py", line 21, in call
cls._instances[key] = type.call(cls, _args, *_kwargs)
File "C:\Python27\lib\site-packages\rabaDB\rabaSetup.py", line 48, in init
self.connection = sq.connect(RabaConfiguration(namespace).dbFile)
sqlite3.OperationalError: unable to open database file"

When we checked the pyGene in the C:\Python27\Lib\site-packages, both pyGene and rabaDB are found in that folder. We then uninstalled pyGene and Python 2.7.12 and reinstalled them to resolve the problem but the problem persists. Unfortunately, we could not figure out a way to resolve the problem.

Incompatibility of pyGene with non-English Windows 10 is the only thing we could think of as there are non-english characters. Any idea what to do?

Error: Genome object instantiation

Hi,
When I tried to create a Genome object, I received the following error:

from pyGeno.Genome import *
g = Genome(name = "GRCh37.75")

KeyError: "Couldn't find any object that fit the arguments you've prodided to the constructor".

I installed pyGeno with pip.

Thanks.

Write tests

We need more tests

Connecting data an logic

Connecting the data from arangodb to the logic in pyGeno.

printDatawraps fails when using setup.py install

The directory containing pyGeno's datawraps (bootstrap_data) doesn't get copied automatically when installing with 'python setup.py install' (in opposition to 'develop') or when installing with pip.

Importation of SNPs

Untouched, should remove Casava and TopHat

Unsupported translation of selenocysteine

Proteins with a selenocysteine are translated with a stop codon instead of U. GTF file stores codon position of selenocysteine.

2 ensembl_havana Selenocysteine 84670381 84670383 . - . gene_id "ENSMUSG00000076437"; gene_version "10"; transcript_id "ENSMUST00000117299"; transcript_version "8"; gene_name "Selenoh"; gene_source "ensembl_havana"; gene_biotype "protein_coding"; havana_gene "OTTMUSG00000013495"; havana_gene_version "5"; transcript_name "Selenoh-001"; transcript_source "ensembl_havana"; transcript_biotype "protein_coding"; tag "CCDS"; ccds_id "CCDS16187"; havana_transcript "OTTMUST00000032615"; havana_transcript_version "2"; tag "seleno"; tag "basic"; transcript_support_level "1";

Any suggestion how to support this feature?

Genome importation

Make sure everything is imported the right way

0 based vs 1 based (ensembl)
Selenocysteines

no module named configuration

when calling
from pyGeno.Genome import *

ImportError: No module named 'configuration'

pip installation does not work

pip install pyGeno

then

>>> import pyGeno.bootstrap as B
>>> B.printDatawraps()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/ialbert/.virtualenvs/work/lib/python2.7/site-packages/pyGeno/bootstrap.py", line 24, in printDatawraps
    l = listDatawraps()
  File "/Users/ialbert/.virtualenvs/work/lib/python2.7/site-packages/pyGeno/bootstrap.py", line 12, in listDatawraps
    for f in os.listdir(os.path.join(this_dir, "bootstrap_data/genomes")) :
OSError: [Errno 2] No such file or directory: '/Users/ialbert/.virtualenvs/work/lib/python2.7/site-packages/pyGeno/bootstrap_data/genomes'

Problem with insertions

I'm trying to get a translation for an insertion. Using the default SNPFiltering, I do not see the insertion in the sequence. I think it's not added to the sequence (it's is present in the db) and/or is treated as a SequenceSNP. Everything works great for snp.

I tried creating my own filter to support the insertion as explained in #33, but I get this error :

File "/u/boucherg/.virtualenvs/pyGeno_git/pyGeno/pyGeno/SNP.py", line 64, in __getattribute__ return Raba.__getattribute__(self, k) File "build/bdist.linux-x86_64/egg/rabaDB/Raba.py", line 648, in __getattribute__ TypeError: attribute name must be string, not 'int'

I'm not sure what I'm doing wrong. Here is the code and the snspset entry.

chromosomeNumber uniqueId start end ref alleles quality caller
5 1 170837542 170837543 - TCTT 0 custom

from pyGeno.Genome import *
 from pyGeno.importation.SNPs import * 
 from pyGeno.SNPFiltering import SNPFilter

 class MyFilter(SNPFilter) :
   	def __init__(self) :
   		SNPFilter.__init__(self)
   	def filter(self, chromosome, snp_custom) :
   		from pyGeno.SNPFiltering import  SequenceInsert, SequenceSNP, SequenceDel
   		for s in snp_custom:
   			if s.alleles != '-' and s.ref != '-':
   				return SequenceSNP(s.alleles)
   			elif s.alleles == '-':
   				return SequenceDel(len(s.ref))
   			elif s.ref == '-':
   				return SequenceInsert(s.alleles)

   if 'snp_custom' in getSNPSetsList() : 
   	deleteSNPs('snp_custom')

   importSNPs("snps_tmp")
   genome = Genome(name = 'GRCh37.75', SNPs='snp_custom', SNPFilter = MyFilter())
   gene = genome.get(Gene, name='NPM1')[0]
   tr = gene.get(Transcript, name='NPM1-001')[0]
   tr.sequence

[package_infos]
description = SNPs for testing purposes
maintainer = The Maintainer
maintainer_contact = maintainer [at] email.ca
version = 1

[set_infos]
species = human
name = snp_custom
type = agnosticsnp
source = Where do these snps come from?

[snps]
filename = snps.txt

GenomicLink (edges) does not work

Linking between different object is not happening

Remote datawraps

The following example doesn`t work.

B.printRemoteDatawraps()
Traceback (most recent call last):
File "", line 1, in
File "/usr/src/app/pyGeno/bootstrap.py", line 45, in printRemoteDatawraps
l = listRemoteDatawraps(location)
File "/usr/src/app/pyGeno/bootstrap.py", line 15, in listRemoteDatawraps
js = json.loads(response.read())
File "/usr/local/lib/python2.7/json/init.py", line 339, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 6 column 3 (char 113)

Multiprocessing problem with sqlite

Problems arise when importing pyGeno in main thread, and accessing the DB from spawned processes.

Example error: pygeno: DatabaseError: file is encrypted or is not a database

urllib error during B.importGenome in pip version

The pip version of the package generates an FTP error,
"IOError: [Errno ftr error] 200 Switching to Binary Mode", during bootstrap import of at least Human.GRCh37.75.tar.gz.

The traceback indicates line 46 of importation/Genomes.py may be causing the issue.
_getFile function:
line 46: urllib.urlretrieve (fil, finalFile)

The GitHub bloody branch replaces line 46 of importation/Genomes.py with an iterator and seems to resolve this issue.

You might want to update the pip version of pyGeno. Thanks for making this package available under Apache 2.0!

TRACEBACK:

IOError Traceback (most recent call last)
in ()
----> 1 get_ipython().magic(u'time B.importGenome("Human.GRCh37.75.tar.gz")')

/opt/conda/envs/python2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in magic(self, arg_s)
2156 magic_name, _, magic_arg_s = arg_s.partition(' ')
2157 magic_name = magic_name.lstrip(prefilter.ESC_MAGIC)
-> 2158 return self.run_line_magic(magic_name, magic_arg_s)
2159
2160 #-------------------------------------------------------------------------

/opt/conda/envs/python2/lib/python2.7/site-packages/IPython/core/interactiveshell.pyc in run_line_magic(self, magic_name, line)
2077 kwargs['local_ns'] = sys._getframe(stack_depth).f_locals
2078 with self.builtin_trap:
-> 2079 result = fn(*args,**kwargs)
2080 return result
2081

in time(self, line, cell, local_ns)

/opt/conda/envs/python2/lib/python2.7/site-packages/IPython/core/magic.pyc in (f, *a, **k)
186 # but it's overkill for just that one bit of state.
187 def magic_deco(arg):
--> 188 call = lambda f, *a, **k: f(*a, **k)
189
190 if callable(arg):

/opt/conda/envs/python2/lib/python2.7/site-packages/IPython/core/magics/execution.pyc in time(self, line, cell, local_ns)
1179 if mode=='eval':
1180 st = clock2()
-> 1181 out = eval(code, glob, local_ns)
1182 end = clock2()
1183 else:

in ()

/opt/conda/envs/python2/lib/python2.7/site-packages/pyGeno/bootstrap.pyc in importGenome(name, batchSize)
100 """Import a genome shipped with pyGeno. Most of the datawraps only contain URLs towards data provided by third parties."""
101 path = os.path.join(this_dir, "bootstrap_data", "genomes/" + name)
--> 102 PG.importGenome(path, batchSize)
103
104 def importSNPs(name) :

/opt/conda/envs/python2/lib/python2.7/site-packages/pyGeno/importation/Genomes.pyc in importGenome(packageFile, batchSize, verbose)
149 raise KeyError("The directory %s already exists, Please call deleteGenome() first if you want to reinstall" % seqTargetDir)
150
--> 151 gtfFile = _getFile(parser.get('gene_set', 'gtf'), packageDir)
152
153 chromosomesFiles = {}

/opt/conda/envs/python2/lib/python2.7/site-packages/pyGeno/importation/Genomes.pyc in _getFile(fil, directory)
44 printf("Downloading file: %s..." % fil)
45 finalFile = os.path.normpath('%s/%s' %(directory, fil.split('/')[-1]))
---> 46 urllib.urlretrieve (fil, finalFile)
47 printf('done.')
48 else :

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in urlretrieve(url, filename, reporthook, data, context)
96 else:
97 opener = _urlopener
---> 98 return opener.retrieve(url, filename, reporthook, data)
99 def urlcleanup():
100 if _urlopener:

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in retrieve(self, url, filename, reporthook, data)
243 except IOError:
244 pass
--> 245 fp = self.open(url, data)
246 try:
247 headers = fp.info()

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in open(self, fullurl, data)
211 try:
212 if data is None:
--> 213 return getattr(self, name)(url)
214 else:
215 return getattr(self, name)(url, data)

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in open_ftp(self, url)
556 value in ('a', 'A', 'i', 'I', 'd', 'D'):
557 type = value.upper()
--> 558 (fp, retrlen) = self.ftpcache[key].retrfile(file, type)
559 mtype = mimetypes.guess_type("ftp:" + url)[0]
560 headers = ""

/opt/conda/envs/python2/lib/python2.7/urllib.pyc in retrfile(self, file, type)
904 try:
905 cmd = 'RETR ' + file
--> 906 conn, retrlen = self.ftp.ntransfercmd(cmd)
907 except ftplib.error_perm, reason:
908 if str(reason)[:3] != '550':

/opt/conda/envs/python2/lib/python2.7/ftplib.pyc in ntransfercmd(self, cmd, rest)
332 size = None
333 if self.passiveserver:
--> 334 host, port = self.makepasv()
335 conn = socket.create_connection((host, port), self.timeout)
336 try:

/opt/conda/envs/python2/lib/python2.7/ftplib.pyc in makepasv(self)
310 def makepasv(self):
311 if self.af == socket.AF_INET:
--> 312 host, port = parse227(self.sendcmd('PASV'))
313 else:
314 host, port = parse229(self.sendcmd('EPSV'), self.sock.getpeername())

/opt/conda/envs/python2/lib/python2.7/ftplib.pyc in parse227(resp)
828
829 if resp[:3] != '227':
--> 830 raise error_reply, resp
831 global _227_re
832 if _227_re is None:

IOError: [Errno ftp error] 200 Switching to Binary mode.

sqlite3:OperationalError:No such table:

Hi,

I was trying to import my Genome file and I followed your instruction in the doc.

Start of importation is ok, it write "Importation begins!" progress is at 100% and an error occurs at following step :
"almost done saving chromosomes...
\ progress[ --~-?:> ] ?% (1/?) runtime: ..."
Last sentence of the message is :
"sqlite3.OperationalError: no such table: main.RabaList_exons_for_Transcript_Raba"

How can I solve this problem ?

Thanks for your help.

Integration of the new query method

Queries with "ORs" have to be integrated

checkPythonVersion is not ready for python 3

The checkPythonVersion will fail for some version of python 3.

It is not yet documented which version of python 3 is supported with pyGeno 2.0.0

mutant (SNV/indel) protein sequence generation

Hi Tariq,

Thank you for helping me on loading genome yesterday.

I further tested pyGeno on generating a mutant protein sequence.

Actually, I uploaded two ERBB2 variants and two EGFR indels with known AA change annotation, (ENST00000445658:c.T1888G:p.W630G and ENST00000445658:exon16:c.A1879G:p.S627G)

(ENST00000455089:c.2100_2114del:p.700_705del and ENST00000455089:c.2161_2162insTGGCCAGCG:p.M721delinsMASV)

However, I find the mutant protein sequence generated by pyGeno is exactly the same with the corresponding reference protein sequence. It seems the filter doesn't work. Would you point out what's wrong with my commands in pyGeno. ( Also, the test_var file was enclosed in attachment.) If there is anyway to efficiently implement both indels and SNP in one Filter?

Thank you.

Best

Hao

from pyGeno.importation.Genomes import *
from pyGeno.importation.SNPs import *
from pyGeno.Genome import *
from pyGeno.Transcript import Transcript
from pyGeno.SNPFiltering import SNPFilter
from pyGeno.SNPFiltering import SequenceSNP
from pyGeno.SNPFiltering import SequenceInsert
from pyGeno.SNPFiltering import SequenceDel
importSNPs('/test_snp_path/test_var')
class QMax_gt_filter(SNPFilter) :
... def init(self, threshold) :
... self.threshold = threshold
... def filter(self, chromosome, test_var = None) :
... if test_var.quality > self.threshold :
... #other possibilities of return are SequenceInsert(), SequenceDel()
... if test_var.alt[0] == '-':
... return SequenceDel(len(test_var.ref))
... if test_var.ref[0] == '-':
... return SequenceInsert(test_var.alt)
... elif test_var.alt[0] != '-' and test_var.ref[0] != '-':
... return SequenceSNP(test_var.alt)
... return None

mut_G = Genome(name = 'GRCh37.75', SNPs = 'test_var', SNPFilter = QMax_gt_filter(8))
mut_trans = mut_G.get(Transcript, id ='ENST00000445658')
mut_prot = mut_trans[0].protein
mut_prot.sequence

ref_G = Genome(name = 'GRCh37.75')
ref_trans = ref_G.get(Transcript, id ='ENST00000445658')
ref_prot = ref_trans[0].protein
ref_prot.sequence
`

mut_prot sequence output:
'MELAALCRWGLLLALLPPGAASTQDNYLSTDVGSCTLVCPLHNQEVTAEDGTQRCEKCSKPCARVCYGLGMEHLREVRAVTSANIQEFAGCKKIFGSLAFLPESFDGDPASNTAPLQPEQLQVFETLEEITGYLYISAWPDSLPDLSVFQNLQVIRGRILHNGAYSLTLQGLGISWLGLRSLRELGSGLALIHHNTHLCFVHTVPWDQLFRNPHQALLHTANRPEDECVGEGLACHQLCARGHCWGPGPTQCVNCSQFLRGQECVEECRVLQGLPREYVNARHCLPCHPECQPQNGSVTCFGPEADQCVACAHYKDPPFCVARCPSGVKPDLSYMPIWKFPDEEGACQPCPINCTHSCVDLDDKGCPAEQRASPLTSIISAVVGILLVVVLGVVFGILIKRRQQKIRKYTMRRLLQETELVEPLTPSGAMPNQAQMRILKETELRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAGVGSPYVSRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFVVIQNEDLGPASPLDSTFYRSLLEDDDMGDLVDAEEYLVPQQGFFCPDPAPGAGGMVHHRHRSSSTRSGGGDLTLGLEPSEEEAPRSPLAPSEGAGSDVFDGDLGMGAAKGLQSLPTHDPSPLQRYSEDPTVPLPSETDGYVAPLTCSPQPEYVNQPDVRPQPPSPREGPLPAARPAGATLERPKTLSPGKNGVVKDVFAFGGAVENPEYLTPQGGAAPQPHPPPAFSPAFDNLYYWDQDPPERGAPPSTFKGTPTAENPEYLGLDVPV`

ref_prot.sequence output:
'MELAALCRWGLLLALLPPGAASTQDNYLSTDVGSCTLVCPLHNQEVTAEDGTQRCEKCSKPCARVCYGLGMEHLREVRAVTSANIQEFAGCKKIFGSLAFLPESFDGDPASNTAPLQPEQLQVFETLEEITGYLYISAWPDSLPDLSVFQNLQVIRGRILHNGAYSLTLQGLGISWLGLRSLRELGSGLALIHHNTHLCFVHTVPWDQLFRNPHQALLHTANRPEDECVGEGLACHQLCARGHCWGPGPTQCVNCSQFLRGQECVEECRVLQGLPREYVNARHCLPCHPECQPQNGSVTCFGPEADQCVACAHYKDPPFCVARCPSGVKPDLSYMPIWKFPDEEGACQPCPINCTHSCVDLDDKGCPAEQRASPLTSIISAVVGILLVVVLGVVFGILIKRRQQKIRKYTMRRLLQETELVEPLTPSGAMPNQAQMRILKETELRKVKVLGSGAFGTVYKGIWIPDGENVKIPVAIKVLRENTSPKANKEILDEAYVMAGVGSPYVSRLLGICLTSTVQLVTQLMPYGCLLDHVRENRGRLGSQDLLNWCMQIAKGMSYLEDVRLVHRDLAARNVLVKSPNHVKITDFGLARLLDIDETEYHADGGKVPIKWMALESILRRRFTHQSDVWSYGVTVWELMTFGAKPYDGIPAREIPDLLEKGERLPQPPICTIDVYMIMVKCWMIDSECRPRFRELVSEFSRMARDPQRFVVIQNEDLGPASPLDSTFYRSLLEDDDMGDLVDAEEYLVPQQGFFCPDPAPGAGGMVHHRHRSSSTRSGGGDLTLGLEPSEEEAPRSPLAPSEGAGSDVFDGDLGMGAAKGLQSLPTHDPSPLQRYSEDPTVPLPSETDGYVAPLTCSPQPEYVNQPDVRPQPPSPREGPLPAARPAGATLERPKTLSPGKNGVVKDVFAFGGAVENPEYLTPQGGAAPQPHPPPAFSPAFDNLYYWDQDPPERGAPPSTFKGTPTAENPEYLGLDVPV`

manifest.ini

`[package_infos]
description = mutant peptide generation
maintainer = Tariq Daouda
maintainer_contact = [email protected]
version = 1

[set_infos]
species = human
name = test_var
type = Agnostic
source = TCGA variants

[snps]
filename = test_var.txt
`

test_var.txt

chromosomeNumber uniqueId start end ref alt quality caller
17 1 37881637 37881637 A G 255 GATK
17 2 37881646 37881646 T G 255 GATK
7 3 55242465 55242479 GGAATTAAGAGAAGC - 255 GATK
7 4 55242465 55242479 - TGGCCAGCG 255 GATK

Out of frame protein sequences

Problem: proteins whose translation start sites are not certain gives out of frame sequences.
Solution: Somehow frame of the first exon should be included while generating CDS.

refGenome=Genome(name="GRCh38.80")
refProt=refGenome.get(Protein,id="ENSP00000349216")[0]
print "pyGeno"
print refProt.sequence
gencode_seq="XHIRIMKRRVHTHWDVNISFREASCSQDGNLPTLISSVHRSRHLVMPEHQS
RCEFQRGSLEIGLRPAGDLLGKRLGRSPRISSDCFSEKRARSESPQEALLLPRELGPSMAPEDHYRRLV
SALSEASTFEDPQRLYHLGLPSHDLLRVRQEVAAAALRGPSGLEAHLPSSTAGQRRKQGL
AQHREGAAPAAAPSFSERELPQPPPLLSPQNAPHVALGPHLRPPFLGVPSALCQTPGYGF
LPPAQAEMFAWQQELLRKQNLARLELPADLLRQKELESARPQLLAPETALRPNDGAEELQ
RRGALLVLNHGAAPLLALPPQGPPGSGPPTPSRDSARRAPRKGGPGPASARPSESKEMTG
ARLWAQDGSEDEPPKDSDGEDPETAAVGCRGPTPGQAPAGGAGAEGKGLFPGSTLPLGFP
YAVSPYFHTGAVGGLSMDGEEAPAPEDVTKWTVDDVCSFVGGLSGCGEYTRVFREQGIDG
ETLPLLTEEHLLTNMGLKLGPALKIRAQVARRLGRVFYVASFPVALPLQPPTLRAPEREL
GTGEQPLSPTTATSPYGGGHALAGQTSPKQENGTLALLPGAPDPSQPLC"
print "GENCODE"
print gencode_seq
first_exon_frame=refProt.transcript.exons[0].frame
print first_exon_frame
new_seq= "X"+translateDNA(refProt.transcript.cDNA[0:-3],frame="f"+str(1+first_exon_frame))
print "Corrected sequence"
print new_seq
print showDifferences(gencode_seq,new_seq)

Error in importing my_SNP.tar.gz

Hi, I was trying to import my SNP file and I followed your instruction here. However, I got the error message as follows:
Importing polymorphism set: my/path/to/my_SNP.tar.gz... (This may take a while)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyGeno/importation/SNPs.py", line 65, in importSNPs
return _importSNPs_dbSNPSNP(setName, species, genomeSource, snpsFile)
File "pyGeno/importation/SNPs.py", line 192, in _importSNPs_dbSNPSNP
snpData = VCFFile(snpsFile, gziped = True, stream = True)
File "pyGeno/tools/parsers/VCFTools.py", line 89, in __init__
self.parse(filename, gziped, stream)
File "pyGeno/tools/parsers/VCFTools.py", line 106, in parse
ll = self.f.readline()
File "/usr/lib/python2.7/gzip.py", line 464, in readline
c = self.read(readsize)
File "/usr/lib/python2.7/gzip.py", line 268, in read
self._read(readsize)
File "/usr/lib/python2.7/gzip.py", line 303, in _read
self._read_gzip_header()
File "/usr/lib/python2.7/gzip.py", line 197, in _read_gzip_header
raise IOError, 'Not a gzipped file'
IOError: Not a gzipped file

I zipped both manifest.ini and snps.txt to my_SNP.tar.gz using tar -cvzf.

Do you have any idea why this issue comes up and how I could fix this?

Thanks

Syntax in documentation for advanced queries (e.g., calling a list of gene within start and end coordinates) problem

Good afternoon, this question is related to making advanced queries on pyGeno with .get

Based on this documentation, we can query for specific details as such:

#even complex stuff
exons = myChromosome.get(Exons, {'start >=' : x1, 'stop <' : x2})
hlaGenes = myGenome.get(Gene, {'name like' : 'HLA'})

sry = myGenome.get(Transcript, { "gene.name" : 'SRY' })

Unfortunately, none of these commands seem to work, while basic commands for getting specific genes based on their ids work:

#in this case both queries will yield the same result
myGene.get(Protein, id = "ENSID...")
myGenome.get(Protein, id = "ENSID...")

In this situation, I am attempting to call a list of genes within a particular set of coordinates on a particular chromosome. To illustrate the problem, I use .get to call p53 (Chr17 in humans):

# getting the gene based on id
gene_example = g.get(Gene, id = 'ENSG00000141510')

# confirming the gene based on chromosome - note that I give the index [0] because for some reason, .get seems to generate a single-index list of the Raba object
print(gene_example[0].chromosome.number)
>17

# now, I get the start and end coords
x1 = gene[0].start
x2 = gene[0].end

# finally, I test getting the gene using the coords
gene_test = g.get(Gene, {'start >=': x1, 'end <=': x2, 'chromosome.number': 17})

Ultimately, gene_test is not assigned to any value because g.get can't find anything within those coordinates. Even when I tested by replacing x1 and x2 with nearly the entire chromosomal length, no genes were identified.

Would anyone happen to know the correct syntax for this? Perhaps it has changed in recent updates. Thank you!

Translate mitochondrial chromosome with Vertebrate Mitochondrial Code

Current sequence:
("gene:Gene, name: MT-ND1, id: ENSG00000198888, strand: '+' > Chromosome: number MT > <Raba obj: ('Genome_Raba', 0.11207103328668244), raba_id: 1>", '\n')
('seq:IPMANLLLLIVPILIAMAFLMLTERKILGYIQLRKGPNVVGPYGLLQPFADAIKLFTKEPLKPATSTITLYITAPTLALTIALLLTPLPIPNPLVNLNLGLLFILATSSLAVYSILSGASNSNYALIGALRAVAQTISYEVTLAIILLSTLLISGSFNLSTLITTQEHLLLLPSPLAIIFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFIAEYTNIIIINTLTTTIFLGTTYDALSPELYTTYFVTKTLLLTSLFLIRTAYPRFRYDQLIHLLKNFLPLTLALLI*YVSIPITISSIPPQT', '\n')

Expected sequence:
MPMANLLLLIVPILIAMAFLMLTERKILGYMQLRKGPNVVGPYGLLQPFADAMKLFTKEP
LKPATSTITLYITAPTLALTIALLLWTPLPMPNPLVNLNLGLLFILATSSLAVYSILWSG
WASNSNYALIGALRAVAQTISYEVTLAIILLSTLLMSGSFNLSTLITTQEHLWLLLPSWP
LAMMWFISTLAETNRTPFDLAEGESELVSGFNIEYAAGPFALFFMAEYTNIIMMNTLTTT
IFLGTTYDALSPELYTTYFVTKTLLLTSLFLWIRTAYPRFRYDQLMHLLWKNFLPLTLAL
LMWYVSMPITISSIPPQT

Cannot install pyGeno successfully

I tried to install pyGeno in python2.7 ubuntu16.04.10，but Can't install successfully. The error message is as follows：
Traceback (most recent call last):
File "", line 1, in
File "/home/dongl/.local/lib/python2.7/site-packages/pyGeno/init.py", line 3, in
from .configuration import pyGeno_init
File "/home/dongl/.local/lib/python2.7/site-packages/pyGeno/configuration.py", line 3, in
import rabaDB.rabaSetup
File "/home/dongl/.local/lib/python2.7/site-packages/rabaDB/rabaSetup.py", line 24
class RabaConfiguration(object, metaclass=RabaNameSpaceSingleton) :
^
SyntaxError: invalid syntax

how can I solve this problem? Please help me