SIFTER

###A pipeline for large-scale phylogeny-based protein function prediction using SIFTER (Statistical Inference of Function Through Evolutionary Relationships)

SIFTER is a statistical approach to predicting protein function that uses a protein family's phylogenetic tree, as the natural structure for representing protein relationships, overlaid with all known protein functions in the family.

This package provides a pipeline for large scale protein function prediction using SIFTER algorithm. Thus, it can be used for genome-wide protein function prediction.

The large Scale implementation of SIFTER is developed by Sayed Mohammad Ebrahim Sahraeian at Department of Plant and Microbial Biology, University of California, Berkeley.

Please cite new paper: Sahraeian SME, Luo KR, Brenner SE (2015)

The orignial SIFTER algorithm is developed by Barbara E Engelhardt. Original paper:

Engelhardt BE, Jordan MI, Srouji JR, Brenner SE. 2011. Genome-scale phylogenetic function annotation of large and diverse protein families.Genome Research 21:1969-1980. doi:10.1101/gr.104687.109

You can also use the SIFTER webserver at http://sifter.berkeley.edu to access online the predictions on 16,863,537 proteins across 232,403 species. More information at:

Sahraeian SME, Luo KR, Brenner SE. 2015. SIFTER Search: A web server for accurate phylogeny-based protein function prediction. Nucleic Acids Research, to appear

Other previous developers: Philip Johnson, Steven R. Chan, Micheal Souza

##Download Package

mkdir sifter_large_scale
cd sifter_large_scale
git clone https://github.com/BrennerLab/SIFTER.git

Extract lib and data folders:

cd SIFTER/large_scale_v1.0
tar -xzvf data.tar.gz
tar -xzvf lib.tar.gz

##Scripts:

sifter_find_families.py             Finds Pfam families for your 
                                    query protein or species.

sifter_gather_family_data.py        Gathers necessary 'alignment', 
                                    'tree', and 'evidence' files 
                                    needed to run SIFTER for each 
                                    query family.
                                    [NOTE: Only use this script if you 
                                    don't wish to use the precomputed 
                                    data files built based on latest 
                                    releases of Pfam and GOA 
                                    databases, OR if you have a set 
                                    of sequences (of a novel genome) 
                                    which is not already in Pfam]

sifter_prepare.py                    Prepares necessary files for 
                                    your query to run SIFTER on.

sifter_run.py                        Runs SIFTER on the prepared 
                                    files generated by 
                                    'sifter_prepare.py'.

sifter_extract.py                    Extracts SIFTER predictions for 
                                    your query proteins, species, or
                                    families. (using the outputs of 
                                    'sifter_run.py'.)

sifter_build_sql_database.py        build the MySQL database that is
									needed to run all other scripts.
                                    [NOTE: Only use this script if 
                                    you don't wish to use the 
                                    precomputed SQL database built 
                                    based on latest releases of 
                                    Pfam and GOA databases.]

##Requirments:

1-Install necessary Prerequisite packages:

sudo apt-get update
sudo apt-get install mysql-server build-essential python-dev pip \
    libmysqlclient-dev liblapack-dev libatlas-dev gfortran \
    default-jdk git 

Python packages:
    sudo pip install numpy 
    sudo pip install scipy
    sudo pip install Biopython
    sudo pip install ete2
    sudo pip install MySQL-python
    sudo pip install sqlite

TO BE ABLE TO RUN ON NEW GENOMES (which don't exist in Pfam current release):

2-Install PfamScan source code and data (Documentation at: https://wiki.gacrc.uga.edu/wiki/Pfam_scan)

wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/PfamScan.tar.gz
tar -zxvf PfamScan.tar.gz
cd PfamScan
wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.dat.gz
gunzip Pfam-A.hmm.dat.gz
wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
wget ftp://ftp.sanger.ac.uk/pub/databases/Pfam/current_release/active_site.dat.gz
gunzip active_site.dat.gz
hmmpress Pfam-A.hmm

3-Install Perl packages sudo apt-get install libmoose-perl sudo cpan BioPerl sudo cpan IPC:RUN

4-Install hmmer: wget ftp://selab.janelia.org/pub/software/hmmer3/3.1b1/hmmer-3.1b1.tar.gz tar -zxvf hmmer-3.1b1.tar.gz cd hmmer-3.1b1 mkdir build ./configure make sudo make install

5-Install TAXTASTIC sudo pip install taxtastic

6-Fix the paths to hmmbuild, hmmpress, hmmalign, and taxit if not already in your PATH.

##Usage

Before running SIFTER you should performs the one time data setup (STEPs 1 and 2 below).

NOTE: To run the scripts first change directory to the scripts folder

	cd scripts/large_scale_v1.0/

STEP 1-Build SIFTER MySQL databases:

To Run the Python scripts in this package you need to have a MySQL 
database that encompasses the necessary information into the tables 
(The .gz file is ~34GB).

--You can download use the precomputed databases as follows:

    cd data
    wget sifter.berkeley.edu/media/sifter_db.gz    
    [You may need to set max_allowed_packet=1024M in the [mysqld] 
    section of your MySQL my.cnf and then restart the MySQL]
    mysqladmin -u user -p pass create sifter_db
    gunzip < sifter_db.gz | mysql sifter_db  

--Alternatively you can build the MySQL from scratch using latest
available data using the 'sifter_build_sql_database.py' script 
(this may take 1-2 days)

    python sifter_build_sql_database.py ../example/tmp_output

STEP 2- Obtain family data

To run SIFTER you need to prepare necessary tree and evidence files.
We have prepared necessary data files based on latest releases of 
Pfam and Gene Ontology Annotation databases. You can download those 
files as follows (The .tar.gz file is ~25GB):

    wget sifter.berkeley.edu/media/families_data.tar.gz
    tar -xzvf families_data.tar.gz

Alternatively, you may gather necessary family data using the 
'sifter_gather_family_data.py' script. You can use this script to
gather data for ALL Pfam families ((this may take 1-2 days) or ONLY 
the Pfam families you may need for running your query.

Example:
-To create family data for two Pfam families PF12491 and PF13820:
    python sifter_gather_family_data.py -f PF12491,PF13820 ../example/fam_data

-To create family data for families which a spesific gene has domain in:
    python sifter_find_families.py -p C0JYY2_HUMAN ../examples/family_list.txt
    python sifter_gather_family_data.py -i ../example/family_list.txt ../example/fam_data
    
-To create family for All Pfam families:
    python sifter_find_families.py -A ../examples/family_list.txt
    python sifter_gather_family_data.py -i ../example/family_list.txt ../example/fam_data
    
-If you already have the precomputed family data and you want to keep 
the alignments and trees data (since Pfam database does not change 
so fast), but update the evidence annotations to most recent version,
you may use the following:
    mv families_data/annotations families_data/old_annotations
    python sifter_build_sql_database.py --goa --ont --id ../example/tmp_output
    python sifter_find_families.py -A ../examples/family_list.txt
    python sifter_gather_family_data.py -i ../example/family_list.txt path/to/families_data

STEP 3-Run SIFTER for your query

Given that the mysql (step 1) and family data (step 2) are ready, you can
start running SIFTER on you queries.

Example:
-To run SIFTER on some families:
    python sifter_prepare.py -f PF12491,PF13820 path/to/families_data ../example/queries
    python sifter_run.py ../example/queries ../example/results
    python sifter_extract.py -f PF12491,PF13820 ../example/results ../examples/preds.txt

-To run SIFTER on a gene:
    python sifter_prepare.py -p C0JYY2_HUMAN path/to/families_data ../example/queries
    python sifter_run.py ../example/queries ../example/results
    python sifter_extract.py -p C0JYY2_HUMAN ../example/results ../examples/preds.txt

-To run SIFTER on a species (NOTE: This usually will take a long time as the species has 
domains in many families).
    python sifter_prepare.py -s 9823 path/to/families_data ../example/queries
    python sifter_run.py ../example/queries ../example/results
    python sifter_extract.py -s 9823 ../example/results ../examples/preds.txt

##Run SIFTER a new genome (or set of genes) not already in Pfam database

If your query genes are not already in the Pfam database, you can not
use the precomputed families data to build necessary phylogenetic tree
and evodence files for SIFTER to run.

So, you need to find the families where the query genes have domains in
and add those genes    to the current trees. The following steps help you
to do this task: (NOTE:You should have  followd Steps 2-6 of the 
Installation)
    a) First make sure your query data is not in Pfam. Let say you have
       a list of genes (UniProt ids) in the protein_list.txt. Run:
       python sifter_find_families.py --ip ../examples/protein_list.txt ../examples/family_list.txt
       
       If it returned zero hits in family_list.txt then you have a 
       set of new genes.
       
    b) Find Pfam domains of your query genes (We assume the fasta file 
       of your query genes is at 'myseq.fasta'). 
       
           perl /path/to/PfamScan/pfamscan.pl --fasta ../example/myseq.fasta --dir /path/to/PfamScan/ -e_dom 1 -e_seq 1 -outfile  ../example/pfam_res.txt
           python sifter_gather_family_data.py -A --seq_file ../example/myseq.fasta --hit_file ../example/pfam_res.txt --taxid 1192197 path/to/families_data\n"
           python sifter_prepare.py -A --hit_file ../example/pfam_res.txt path/to/families_data ../example/queries
           python sifter_run.py ../example/queries ../example/results
           python sifter_extract.py -A --hit_file ../example/pfam_res.txt ../example/results ../examples/preds.txt
       
       If you want to run only on specific set of families you can use:
           perl /path/to/PfamScan/pfamscan.pl --fasta ../example/myseq.fasta --dir /path/to/PfamScan/ -e_dom 1 -e_seq 1 -outfile  ../example/pfam_res.txt
           python scripts/sifter_gather_family_data.py -f PF12491,PF13820 --seq_file ../example/myseq.fasta --hit_file ../example/pfam_res.txt --taxid 1192197  path/to/families_data\n"
           python sifter_prepare.py -f PF12491,PF13820 --hit_file ../example/pfam_res.txt path/to/families_data ../example/queries
           python sifter_run.py ../example/queries ../example/results
           python sifter_extract.py -f PF12491,PF13820 --hit_file ../example/pfam_res.txt ../example/results ../examples/preds.txt

##Estimating the running time

Once you run the 'sifter_prepare.py' script to prepare you query, a 
file ('running_estimation.csv') will be generated in the output folder
that shows the information for individual families the SIFTER will 
be run on along with the estimated running time on each family.
You can also use the following webpage to get an estimate of SIFTER 
running time based on different family features.
http://sifter.berkeley.edu/complexity/

##Control on Speed

You can control the speed of running SIFTER using the -x and -t options in the 'sifter_prepare.py' script:

    -x   INT     Maximum number of nonzero elements
                 in the transition matrix. Should be
                 a number in [1e5,1e7] for reasonable
                 time and accuracy balance (Default=1e6)
                 Smaller value leads to faster runningtime.

    -t   INT     Number of functions to truncate
                 to in approximation [Default: adaptive based
                 on -x option]
                 Smaller value leads to faster runningtime.

##Multi-threading:

You can run the 'sifter_gather_family_data.py' and 'sifter_run.py' scripts on multi-threads using -n option. (Default is 4)

##Datasets used:

Used in Version 1.0:

Pfam 27.0 
(March 2013, 14831 families)

GOA 
ftp://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/gene_association.goa_uniprot.gz 
(Updated on March 31, 2015)

Uniprot ID Mapping 
(ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/idmapping/)
(Updated on Jan 7th, 2015)

Gene Ontology
http://archive.geneontology.org/latest-termdb/go_daily-termdb.rdf-xml.gz
(Updated on March 31, 2015)

NCBI Species tree
ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
(Updated on March 3rd, 2015)

##Version History

5/15/2015 large_scale_v1.0 source code and data released

marciopocebon / sifter-1 Goto Github PK

sifter-1's Introduction

SIFTER

sifter-1's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent