sebriois / biomart Goto Github PK

View Code? Open in Web Editor NEW

61.0 61.0 13.0 57 KB

Python biomart API

License: BSD 2-Clause "Simplified" License

Python 100.00%

biomart's People

Contributors

Stargazers

Watchers

Forkers

lmmx dalloliogm llrs yk-tanigawa krassowski walidlgiph ggirelli olchowik csmu-cenr miguelpmachado ladislav-hovan zhangfuyuan69 wangfff8

biomart's Issues

KeyError: 'displayName' when calling BiomartDataset.search()

I'm not sure if this is a usage issue, or an actual bug.

The code:

from biomart import BiomartServer

server = BiomartServer( "http://www.biomart.org/biomart" )

ens = server.databases['ensembl']
hsapiens = ens.datasets['hsapiens_gene_ensembl']
hsapiens.search()

The output:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-55-c144feb5b758> in <module>()
----> 1 hsapiens.search()

/Library/Python/2.7/site-packages/biomart/dataset.pyc in search(self, params, header, count)
     66 
     67         if not self._filters or not self._attributes:
---> 68             self.fetch_configuration()
     69 
     70         root = Element( 'Query' )

/Library/Python/2.7/site-packages/biomart/dataset.pyc in fetch_configuration(self)
     52         for filter_description in xml.iter( 'FilterDescription' ):
     53             name = filter_description.attrib['internalName']
---> 54             self._filters[name] = biomart.BiomartFilter( filter_description.attrib )
     55 
     56         # Attributes

/Library/Python/2.7/site-packages/biomart/filter.pyc in __init__(self, params)
      2     def __init__(self, params):
      3         self.name = params['internalName']
----> 4         self.displayName = params['displayName']
      5         self.type = params['type']
      6         self.default = ('default' in params and params['default'] == 'true')

KeyError: 'displayName'

I presume this is coming from the BiomartDataset.fetch_configuration() function?

Your help is greatly appreciated.

http://www.biomart.org/biomart

When you enter this address you get a 404 error, making package completely impossible to work with.
The only link working is this one:
http://useast.ensembl.org/biomart/martview
And it doesn't have the same attributes

TypeError: list object is not an iterator when running search()

Hello!
Following the documentation I ran the code below:

from biomart import BiomartServer
import pandas as pd

# Connect to biomart
server = BiomartServer( "http://www.ensembl.org/biomart" )
server.verbose = True

# Check available databases
#server.show_databases()

# Select Genes database
db = server.databases['ENSEMBL_MART_ENSEMBL']

# Check available datasets (species)
#db.show_datasets()

# Select H. sapiens dataset
ds = db.datasets['hsapiens_gene_ensembl']

response = ds.search()

And everything works properly until I run the ds.search() command, which triggers an error. Here is the output of the code above:

[BiomartServer:'http://www.ensembl.org/biomart/martservice'] Fetching databases
[BiomartDatabase:'Ensembl Genes 89'] Fetching datasets
[BiomartDataset:'hsapiens_gene_ensembl'] Searching using following params:
{}
[BiomartDataset:'hsapiens_gene_ensembl'] Fetching attributes
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<string>", line 46, in <module>
  File "/usr/local/lib/python2.7/dist-packages/biomart/dataset.py", line 187, in search
    page = next(filter(lambda attr_page: attr_page.is_default, self._attribute_pages.values()))
TypeError: list object is not an iterator

Any idea of what I'm doing wrong?

Many thanks in advance!

p.s.: the ds.count() command works properly :)

Human Gene Ensembl Retrieval by Filter Issue

I have been trying to query the biomart using your package, the filters do not seem to work and I am confused about what type to use for the filter "gene_id" in Homo Sapiens Gene Ensembl. The most confusion part is that, the query works without filters, but does not seem to work with the filters.

from biomart import BiomartServer
server = BiomartServer( "http://www.biomart.org/biomart" ) 
server.verbose = True 
hman_ens_genome = server.datasets['hsapiens_gene_ensembl']
result = hman_ens_genome.search(header=1)

The above query works retrieving the list of Ensembl Gene IDs and Ensembl Transcript IDs. However when I try any filters, as below - it does not seem to retrieve.

   result = hman_ens_genome.search('filters': {
  'gene_id': ['ENSG00000050393']
  },header=1)

Any Idea, if I am doing anything wrong with the query ?

BiomartException when using boolean filter

Hi,

I'm trying to query biomart with the following script:

server = BiomartServer( "http://www.ensembl.org/biomart")
server.verbose = False
dataset = server.datasets['hsapiens_gene_ensembl']

response = dataset.search({
        'filters': {
            'transcript_gencode_basic': 'only',
            'ensembl_gene_id': 'ENSG00000007372',
            'transcript_biotype': 'protein_coding',
        },

        'attributes': [
            'ensembl_gene_id',
            'ensembl_transcript_id',
            'external_gene_name',
            'cdna'
        ]
}, header = 1)

This gives me the following exception:

Query ERROR: caught BioMart::Exception: non-BioMart die(): Can't locate object method "setTable" via package "BioMart::Configuration::BooleanFilter" at /nfs/public/release/ensweb/latest/live/mart/www_96/biomart-perl/lib/BioMart/Query.pm line 2132.

From my tests it looks like the issue is with the "transcript_gencode_basic" filter which is of type boolean, but does not accept True or False but, rather, 'only' and 'excluded'.

Is this fixable or is it a problem with biomart?

Thanks

Virtual Schema Error

hi,

I am trying to retrieve transcript information from "http://plants.ensembl.org/biomart" Server using the dataset "athaliana_eg_gene". Unfortunanetly its not working and I get an error:

Query ERROR: caught BioMart::Exception::Usage: WITHIN Virtual Schema : default, Dataset athaliana_eg_gene NOT FOUND

I found out the the virtual schema is somehow set to 'default', but in my case I need another one here named 'plants_mart'. Is there a way to set this option on my own? Or what solution do you suggest?

best regards

mary

UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 1386: invalid start byte

Hi,

I try to print filters of my database but I have an error and i can't resolve them :

import os
import pprint as pp
from biomart import BiomartServer

server = BiomartServer( "http://feb2014.archive.ensembl.org/biomart/" )
gene = server.datasets['hsapiens_gene_ensembl']
print "gene",gene
pp.pprint(gene.show_filters()) # uses pprint

gene Homo sapiens genes (GRCh37.p13)

Traceback (most recent call last):
File "ensembl.py", line 55, in
pp.pprint(gene.show_filters(),width=1) # uses pprint
File "/home/clerc/.local/lib/python2.7/site-packages/biomart/dataset.py", line 68, in show_filters
pprint.pprint(self.filters)
File "/home/clerc/.local/lib/python2.7/site-packages/biomart/dataset.py", line 64, in filters
self.fetch_filters()
File "/home/clerc/.local/lib/python2.7/site-packages/biomart/dataset.py", line 82, in fetch_filters
line = line.decode('utf8')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 1386: invalid start byte

I can't use the GRCh37 dataset

Hi,
I would like to use the dataset GRCh37 since I´m still working on this.

After set up the connection to the server with the command:

server = BiomartServer("http://www.ensembl.org/biomart")

if I type:

server.show_datasets()

I only get the most updated version (GRCh38) of the specie I am interested in, but I would like to use the GRCh37 dataset.
How could I solve this issue? Thank you

Addition to Anaconda cloud?

Is there any possibility of adding this to Anaconda cloud? Would be nice to have the ability to install it into a conda environment.

Biomart queries to Ensembl require less restrictive assertions of filters existence

Some BioMarts, particularly one hosted by Ensembl, uses filters to pass information about attributes. It might sound weird, but it has a rationale here. Look on the following attributes page: Flanking Sequences. The additional parameter of the attribute Upstream flank will be sent (as you can clearly see by preview of XML query) as a filter. Filters allows passing additional values, and the biomart takes advantage of that. Quite odd, but works (at least in web client).

So the problem is, the biomart package will respond for corresponding query with:
biomart.BiomartException: The filter 'downstream_flank' does not exist.

Here is the code for reproduction:

from biomart import BiomartDataset
som_snp = BiomartDataset('www.ensembl.org/biomart', name='hsapiens_snp_som')
query_dict = {'attributes': [u'refsnp_id'], 'filters': {u'downstream_flank': 100}}
response = som_snp.search(query_dict)
for line in response.iter_lines():
    line = line.decode('utf-8')
    print(line.split("\t"))

I've developed a workaround for this issue on my fork and I'm going to made a pull request soon. Please provide a feedback here (if any), and - if you are also affected by this bug - more test cases. Cheers :)

show_databases() not working (XML Parsing Broken?)

Hi,

A script using your API started returning the following error:

Traceback (most recent call last):

  File "/home/shiny/anaconda3/lib/python3.6/site-packages/IPython/core/interactiveshell.py", line 2910, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-4-b441ad38d3c3>", line 1, in <module>
    server.show_datasets()

  File "/home/shiny/anaconda3/lib/python3.6/site-packages/biomart/server.py", line 93, in show_datasets
    pprint.pprint(self.datasets)

  File "/home/shiny/anaconda3/lib/python3.6/site-packages/biomart/server.py", line 62, in datasets
    self.fetch_datasets()

  File "/home/shiny/anaconda3/lib/python3.6/site-packages/biomart/server.py", line 86, in fetch_datasets
    for database in self.databases.values():

  File "/home/shiny/anaconda3/lib/python3.6/site-packages/biomart/server.py", line 56, in databases
    self.fetch_databases()

  File "/home/shiny/anaconda3/lib/python3.6/site-packages/biomart/server.py", line 70, in fetch_databases
    xml = fromstring(r.text)

  File "/home/shiny/anaconda3/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
    parser.feed(text)

  File "<string>", line unknown
ParseError: undefined entity: line 36, column 5

All I'm trying to do is use show_databases() or show_datasets():

> import biomart
> server = biomart.BiomartServer("http://useast.ensembl.org/biomart")
> server.show_datasets()

The API is up to date, I'm running Python 3.6.4. I noticed this error today, and the script was working as of about a week ago, so not sure when the actual change occurred (whatever that change may be). My script has been untouched in the time between then.

This script is pretty essential to my (and others') work flow, so it'd be great to get this resolved ASAP. Thanks!

No module named server

Traceback (most recent call last):
File "./maf_symbol_updater.py", line 3, in
from biomart import BiomartServer
File "/usr/lib/python3.4/site-packages/biomart/init.py", line 1, in
from server import BiomartServer
ImportError: No module named 'server'

nonetheless: ls /usr/lib/python3.4/site-packages/biomart
attribute.py database.py dataset.py filter.py init.py lib pycache server.py test

does this not work with python3 at all, or is something else wrong?

Querying with filters not working for Ensembl datasets

I'm trying to perform a filtered query:

mus_ensembl = server.datasets['mmusculus_gene_ensembl']
response = mus_ensembl.search({'filters': {'strand': '+'}})

which produces an empty response. Any ideas? I have tried different filters, none of them seem to be working.

Listed multiple values in chromosome_name filter error?

Hi,

I am trying to filter a query by different chromosome names using a list of chromosomes, but I think listed multiple filter values are not currently supported. Example:

gene = 'BRCA1'
response = hsapiens_gene_ensembl.search({
    'filters':{
        'chromosome_name':['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', 'X', 'Y', 'MT'],
        'hgnc_symbol':gene,
        'transcript_gencode_basic':'only',
    },
    'attributes':[
        'chromosome_name','exon_chrom_start',
        'exon_chrom_end','external_gene_name',
        'ensembl_transcript_id','rank'
    ]
}, header = 1)

Thus, I get the following error:

---------------------------------------------------------------------------
BiomartException                          Traceback (most recent call last)
<ipython-input-57-4fc8303edde6> in <module>()
     12         'ensembl_transcript_id','rank'
     13     ]
---> 14 }, header = 1) # if you need the columns header

~/python_venvs/py3.6_in_silico/lib/python3.6/site-packages/biomart/dataset.py in search(self, params, header, count)
    174                 error_msg = "The value '%s' for filter '%s' cannot be used." % (filter_value, filter_name)
    175                 error_msg += " Use one of: [%s]" % ", ".join(map(str, dataset_filter.accepted_values))
--> 176                 raise biomart.BiomartException(error_msg)
    177 
    178         # check attributes unless we're only counting

BiomartException: The value '1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,MT,X,Y' for filter 'chromosome_name' cannot be used. Use one of: [1, 2, 3, 4, 5, 6,....

Are there intentions to implement these features in the near future?

Unable to retrive data from Ensembl BioMart due to conflict of attribute names on different pages

As you can see on the linked page of Ensembl BioMart, there are some biomarts, where you have multiple attributes with the same names placed on different attribute pages.

To check that it is true, compare: (using link provided above)

on "Variant" attributes' page: in "Variant Information" group: "Chromosome name", "Chromosome position start (bp)", "Chromosome position end (bp)", "Strand", etc.
on "Flanking Sequences" page: in "Flanking Sequences" group: (of course the same as above) "Chromosome name", "Chromosome position start (bp)", "Chromosome position end (bp)", "Strand"...

So subsets of attribute names are identical. Using XML view you can see that not only displayNames, but also names are the same. When trying to use biomart package with the biomart server configured like in the example above, following will occur:
biomart.BiomartException: You must use attributes that belong to the same attribute page.

This might be reproduced with the following code:

from biomart import BiomartDataset
som_snp = BiomartDataset('www.ensembl.org/biomart', name='hsapiens_snp_som')
response = som_snp.search({'attributes': [u'chr_name', u'ensembl_gene_stable_id']})
for line in response.iter_lines():
    line = line.decode('utf-8')
    print(line.split("\t"))

Of course attributes belong to the same attribute page; essentially 'chr_name' belongs to both 'Variant' and 'Flanking Sequences' pages. I've developed a fix for this bug on my fork and I'm going to made a pull request soon. Please provide a feedback here (if any), and - if you are also affected by this bug - more test cases. Cheers.

PS. compatibility with Ensembl might be crucial since the biomart community servers face some issues with migration recently.

dataset.search doesn't use formatter

Processing The response

Hi,
So I write a small code for ensembl and get a response, but when i try to print it it just says <Response [200]> can someone help me out

requests.exceptions.TooManyRedirects: Exceeded 30 redirects

I am trying to fetch some data from

server = BiomartServer("http://mar2016.archive.ensembl.org/biomart/martview")

However I get error

Traceback (most recent call last):
  File "./importBiomart.py", line 29, in <module>
    server = BiomartServer("http://mar2016.archive.ensembl.org/biomart/martview")
  File "/usr/local/lib/python2.7/dist-packages/biomart/server.py", line 27, in __init__
    self.assert_alive()
  File "/usr/local/lib/python2.7/dist-packages/biomart/server.py", line 48, in assert_alive
    self.get_request()
  File "/usr/local/lib/python2.7/dist-packages/biomart/server.py", line 103, in get_request
    r = requests.get(self.url, proxies = proxies)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
    return request('get', url, params=params, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 630, in send
    history = [resp for resp in gen] if allow_redirects else []
  File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 111, in resolve_redirects
    raise TooManyRedirects('Exceeded %s redirects.' % self.max_redirects, response=resp)
requests.exceptions.TooManyRedirects: Exceeded 30 redirects.

Please help, pretty urgent.

use pointer attributes for linking datasets

UnicodeEncodeError: 'ascii' codec can't encode characters in position 749834-749837: ordinal not in range(128)

Error Msg:

UnicodeEncodeError: 'ascii' codec can't encode characters in position 749834-749837: ordinal not in range(128)

File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1641, in feed

Not quite the other people meet the same error?