Git Product home page Git Product logo

nosql-biosets's Introduction

Project aim and summary

NoSQL-biosets project includes scripts for indexing and querying selected free bioinformatics datasets. In addition to datasets, project aims to support common bioinformatics data types and formats, such as GFF. Elasticsearch and MongoDB are two databases supported for most datasets included in the project. Neo4j and PostgreSQL support was implemented as the third database option for few datasets, namely for IntEnz, PubTator and HGNC.

Datasets supported

Datasets that had more attention and have more stable support:

Datasets that has been added recently:

Datasets that had less attention after the initial support added to the project:

Project aims to connect above datasets by implementing query APIs for common query patterns with individual and multiple indexes. It also includes initial work on returning query results of IntEnz, DrugBank, HMDB, ModelSEEDdb, and MetaNetX datasets as graphs.

A sister project aims to develop index scripts for sequence similarity search results, either in NCBI-BLAST json format or in BLAST tabular format which is used by other search programs as well, such as LAMBDA and DIAMOND. HSPsDB project aims to link the indexed search results to the datasets indexed with this project, nosqlbiosets.

Installation

Download nosqlbiosets project source code and install required libraries:

git clone https://bitbucket.org/hspsdb/nosql-biosets.git
cd nosql-biosets
pip install -r requirements.txt --user

Since this project is yet in early stages you may need to check and modify source code of the scripts time to time, for this reason light install nosqlbiosets project to your local Python library/package folders using the setup.py develop and --user options that should allow you to run the index scripts from project source folders:

python setup.py develop --user

Default values of the hostname and port numbers of Elasticsearch and MongoDB servers are read from ./conf/dbservers.json file. Save your settings in this file to avoid entering --host and --port parameters in command line.

Usage

Example command lines for downloading UniProt Knowledgebase Swiss-Prot data set (~690M) and for indexing:

$ wget ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/\
knowledgebase/complete/uniprot_sprot.xml.gz

Make sure your Elasticsearch server is running in your localhost. If you are new to Elasticsearch and you are using Linux the easiest way is to download Elasticsearch with the TAR option (~32M). After extracting the tar file cd to your Elasticsearch folder and run ./bin/elasticsearch command.

Downloaded UniProt xml file can be indexed by running the following command from nosqlbiosets project root folder, typically requires 2 to 8 hours with Elasticsearch, and between 1 and 5 hours with MongoDB

./nosqlbiosets/uniprot/index.py ./uniprot_sprot.xml.gz\
   --host localhost --db Elasticsearch --index uniprot

Example query: list most mentioned gene names

curl -XGET "http://localhost:9200/uniprot/_search?pretty=true"\
 -H 'Content-Type: application/json' -d'
{
  "size": 0,
  "aggs": {
    "genes": {
      "terms": {
        "field": "gene.name.#text.keyword",
        "size": 5
      },
      "aggs": {
        "tids": {
          "terms": {
            "field": "gene.name.type.keyword",
            "size": 5
          }
        }
      }
    }
  }
}'

Check ./tests/test_uniprot_queries.py and ./nosqlbiosets/uniprot/query.py for example queries with Elasticsearch and MongoDB.

Similar Work

  • https://github.com/daler/gffutils: "GFF and GTF files are loaded into SQLite3 databases, allowing much more complex manipulation of hierarchical features (e.g., genes, transcripts, and exons) than is possible with plain-text methods alone"

    We are inspired by the gffutils project. Needless to say, nosql-biosets project doesn't yet have a level of maturity comparable to the gffutils library.

  • https://github.com/quinlan-lab/vcf2db (SQLite, MySQL, PostgreSQL)

Copyright

NoSQL-biosets project has been developed at King Abdullah University of Science and Technology, http://www.kaust.edu.sa

NoSQL-biosets project is licensed with MIT license. If you would like to support the project with selecting a different license you can discuss this by contacting the relevant offices of KAUST.

Acknowledgements

  • Computers and systems used in developing this work have been maintained by John Hanks and Arnaud Hungler

nosql-biosets's People

Contributors

uludag avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.