wglab / phen2gene Goto Github PK

Phenotype driven gene prioritization for HPO

License: MIT License

Python 83.88% Shell 15.71% Dockerfile 0.41%

phen2gene's Introduction

Phen2Gene

Phen2Gene is a phenotype-driven gene prioritization tool, that takes HPO (Human Phenotype Ontology) IDs as inputs, searches and prioritizes candidate causal disease genes. It is distributed under the MIT License by Wang Genomics Lab. Additionally we have provided a web server and an associated RESTful API service for running Phen2Gene. Finally, a mobile app for Phen2Gene and several other genetic diagnostic tools from our lab is being tested and will be available soon.

Citing Phen2Gene

Please cite: Zhao, M., Havrilla, J. M., Fang, L., Chen, Y., Peng, J., Liu, C., Wu C., Sarmady M., Botas P., Isla J., Lyon G., Weng C., Wang, K. (2019). Phen2Gene: Rapid Phenotype-Driven Gene Prioritization for Rare Diseases.NAR Genomics and Bioinformatics, Volume 2, Issue 2, June 2020, lqaa032

Prerequisites

If you do not wish to use Anaconda, simply install the packages in the file environment.yml using pip. If you use conda, some packages may not properly install without updating conda using conda update conda first.

Installation

Using Docker

If you are lucky enough to have Docker or some equivalent like Singularity or Podman installation is easy as pie, just download the docker image with the following command:

docker pull genomicslab/phen2gene

Test out your Docker image with the below commands:

On Unix/Linux:

docker run -it --rm -v $PWD/out:/code/out -t genomicslab/phen2gene -m HP:0001250 -out out/prioritizedgenelist

On Windows Powershell:

docker run -it --rm -v ${PWD}/out:/code/out -t genomicslab/phen2gene -m HP:0001250 -out out/prioritizedgenelist

As of the Jan 2021 version of the HPO2Gene KnowledgeBase, if you see in out/prioritizedgenelist/output_file.associated_gene_list:

Rank    Gene    ID      Score   Status
1       KCNQ2   3785    1.0     SeedGene
2       KCNQ3   3786    0.936339        SeedGene
3       UBE3A   7337    0.93565 SeedGene
4       MECP2   4204    0.89883 SeedGene
5       FGFR2   2263    0.830351        SeedGene

You have succeeded. The arguments described below in this document will work if you replace python3 phen2gene.py with docker run -it --rm -v $PWD/out:/code/out -t phen2gene.

In Anaconda

First, install Miniconda, a minimal installation of Anaconda, which is much smaller and has a faster installation. Note that this version is meant for Linux below, macOS and Windows have a different script:

curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh

Go through all the prompts (installation in $HOME is recommended). After Anaconda is installed successfully, simply run:

git clone https://github.com/WGLab/Phen2Gene.git
cd Phen2Gene
conda env create -f environment.yml
conda activate phen2gene
bash setup.sh

General Use Case

This software can be used in one of three scenarios:

Ideally, you have a list of physician-curated HPO terms describing a patient phenotype and a list of potential candidate disease variants that overlap gene space and you want to narrow down the list of variants by prioritizing candidate disease genes, often in tandem with variant prioritization software, which cannot as of yet score STR expansions or SVs unlike Phen2Gene which is variant agnostic.
You do not have variants, but you have HPO terms and would like to get some candidate genes for your disease that you may want to target sequence, as it is much cheaper than whole-genome or whole-exome sequencing.
If you have clinical notes, you can use tools like EHR-Phenolyzer or Doc2HPO for processing clinical notes into HPO terms using natural language processing (NLP) techniques, then apply scenario 1 or 2 as relevant.

Input files

Input files to Phen2Gene should contain HPO IDs, separated by UNIX-recognized new line characters (i.e., \n). Alternatively you can use a space separated list of HPO IDs on the command line.

Examples of how to run Phen2Gene with the `provided HPO_sample.txt` file

Input HPO IDs via input file (typical use case)

python3 phen2gene.py -f example/HPO_sample.txt -out out/prioritizedgenelist

Input HPO IDs via input file, and candidate gene list file (another common use case)

python3 phen2gene.py -f example/HPO_sample.txt -out out/prioritizedgenelist -l example/1000genetest.txt

Use Skewness and Information Content

-w sk uses a skewness-based weighting of genes for each HPO term (default, and recommended)
-w w and -w ic do not use skew, but utilize information content in the tree structure (slightly worse performance)
-w u is unweighted

python3 phen2gene.py -f example/HPO_sample.txt -w sk -out out/prioritizedgenelist

Run Phen2Gene with verbose messages

python3 phen2gene.py -f example/HPO_sample.txt -v -out out/prioritizedgenelist

Input HPO IDs manually, if desired

python3 phen2gene.py -m HP:0000021 HP:0000027 HP:0030905 HP:0010628 -out out/prioritizedgenelist

Add H2GKB location manually, if desired

python3 phen2gene.py -f example/HPO_sample.txt -d full_path_to_H2GKB.zip_extraction_folder -out out/prioritizedgenelist

RESTful API and Web Server

Examples of how to use the Web Server and the RESTful API can be found in the Docs.

Getting Help

Please use the Phen2Gene issues page if you have any questions!

Creating the benchmark data figures from the manuscript

In order, run:

bash setup.sh     # You can skip it if you ran it in the installation.

bash runtest.sh

If you only want the benchmark data and nothing else:

bash getbenchmark.sh /directory/to/download/to

The figures are in the folder figures.

Example of Use Case #2, where you have filtered candidate variants (also in the manuscript)

After changing the code example/ANKRD11example.sh so the ANNOVAR db is built where you would like it, simply run:

bash example/ANKRD11example.sh

Going through the code in example/ANKRD11example.sh, first one downloads a list of candidate variants from the article referenced in the manuscript where the patient has KBG syndrome.

Then, we annotate with ANNOVAR to retrieve gene annotations for these variants, functional consequence information (exonic, intronic, nonsynonymous), amino acid change information, and population frequency.

We next filter out common variants (>1% in gnomAD 2.1.1) and use Phen2Gene to rank the candidate genes based on HPO terms.

Combining this information with the variants, we can re-rank Phen2Gene's candidate list as in the script filterbyannovar.py and discover that the variant for the causal gene ANKRD11 is now ranked number 1 after being ranked number 2 by HPO term. The number 1 ranked gene by HPO, VPS13B, is filtered out because the only candidate variant (8-100133706-T-G) has an extremely high allele frequency in gnomAD(74%!).

phen2gene's People

Contributors

Stargazers

Watchers

Forkers

pablobotas arielol abolia bbz525 amrr101 mrbrown6210 shunsunsun shawnxd xin8you kaiwentw1018 destinywd bobosui m-pauper quannguyenminh103

phen2gene's Issues

Add code to filter on candidate genes

Mengge and I need to add an option to filter on candidate genes both because we need to do so to properly compare to tools that require gene lists like Phevor or AMELIE. Additionally, it is a good option for users to have if they already have a gene list in mind. All users should note that this is still 100% optional, and not required to use the tool at all.

None of the HPO terms in the example is valid?

Hi, just doing a first pass here, but none of the HPO terms in the example is recognized as valid:

$ ./phen2gene.py -v -m HP:0000001 HP:0000021 HP:0000027 HP:0030905 HP:0030910 HP:0010628 -out out/out

HPO weighting model: Ontology-based Informatin Content

HP:0000001 is not a valid human phenotype.
Phen2Gene skipped it.
HP:0030910 is not a valid human phenotype.
Phen2Gene skipped it.
HP:0000021 is not a valid HPO term. cal
HP:0000027 is not a valid HPO term. cal
HP:0030905 is not a valid HPO term. cal
HP:0010628 is not a valid HPO term. cal
Finished.
Output path: out/out/

Didn't checked the source code yet, but any hints?

My environment

Python 3.7.3
anaconda Command line client (version 1.7.2)
Linux

Phen2Gene package vs publication

Hi!
Looking into this repository, it seems like it is built to replicate the results from the publication. Am I right?

We are going to deploy Phen2Gene, which means we will probably remove the items that are intended for the publication only. To keep full traceability with you it may make sense if you could split it and we fork the new repository intended for the code of Phen2Gene only. My point is that from my point of view any development we do on the package should point to Wang Lab's Github and with the current status both repositories will quickly diverge.

Suggestion: simply create a copy of this named Phen2Gene-publication (or something like that) and we would work on a fork of this current repository.

The first step for us would be to remove unnecessary files and add a dockerfile.

Any thoughts about this? :)

Updated Knowledgebase

Hi!
Do you have pipelines to update the KB?

Thanks!

input file format

How can I use the input file with phenotype terms extracted manually to run the Phen2Gene process? if I use the EHR-phenotype to convert the term to HPO ID, which is different from the phenolyzer's input ?

Issue to install `decorator`

As asked by Prof. Wang, I tested the installation with my anaconda
in Chop HPC, and it failed to install decorator. The error is given below

2020-04-29 11:43:04,489 - conda.core.link - ERROR - An error occurred while installing package 'conda-forge::decorator-4.4.1-py_0'.
OSError(2, 'No such file or directory')
Attempting to roll back.

This might be related to my old version of anaconda (4.4.8), since I can successfully install it in biocluster (conda version 4.6.3).

This might be helpful for those users with older anaconda.

webserver down

is the service with a temporary or definite downtime?

Suggested improvements to environment yaml and to include absolute path of "lib" folder in the python scripts

Hi,
I have been trying to work with Phen2Gene. It seems like a great tool. But I found some little issues, which I think would be great if resolved and can make it more user-friendly.

Issues with installation on Mac OSX: There are dependencies that are specific to Linux (like ld_impl_linux, libgcc-ng, libgfortran-ng and libstdcxx-ng). Also the tags for other dependencies are specific to Linux machines. For example: python=3.8.0=h357f687_5. "h357f687_5" is specific to Linux and I don't think that should not be included in the Yaml so as to make it compatible to Mac OSX as well. Just "python=3.8.0" should work fine for all machines.
This yaml file should work for all machines:
"
name: phen2gene
channels:

conda-forge
bioconda
defaults
dependencies:
_libgcc_mutex=0.1=main
ca-certificates=2019.11.28=hecc5488_0
certifi=2019.11.28=py38_0
cycler=0.10.0=py_2
decorator=4.4.1=py_0
freetype=2.10.0 - icu=64.2
kiwisolver=1.1.0
libblas=3.8.0=14_openblas
libcblas=3.8.0=14_openblas
libffi=3.2.1
liblapack=3.8.0=14_openblas
libopenblas=0.3.7
libpng=1.6.37
matplotlib-base=3.1.2
ncurses=6.1
networkx=2.4=py_0
numpy=1.17.3
openssl=1.1.1d
pandas=0.25.3
patsy=0.5.1=py_0
pip=19.3.1=py38_0
pyparsing=2.4.5=py_0
python=3.8.0
python-dateutil=2.8.1=py_0
pytz=2019.3=py_0
readline=8.0
scipy=1.3.2
seaborn=0.9.0=py_2
setuptools=42.0.2=py38_0
six=1.13.0=py38_0
sqlite=3.30.1
statsmodels=0.10.2
tk=8.6.10
tornado=6.0.3
wheel=0.33.6=py38_0
xz=5.2.4
zlib=1.2.11
"

'./lib' reference in several python scripts like: phen2gene.py, lib/weight_assignment.py, lib/calculation.py. If you are running phen2gene outside the folder, it would not throw any error just skip saying the HPO id is not a valid human phenotype rather than throwing an error on actual knowledgebase not found. It should be more explicit and user should be able to run phen2gene outside of the cloned repo. A nice thing would be to have setup.py that can install phen2gene and update path of knowledgebase etc based on where user wants to install it.

Hope my suggestions are useful and I am excited to try phen2gene.
Thank you.

Update KB

Could you provide a status report on this matter? Additionally, I'd like to revise the Knowledge Base. If you have any scripts for modifying Phenolyzer outputs corresponding to each HPO term, I would be happy to use them. I appreciate your help.

BUG: the score is different between json and tsv when weight_model=='s'

when weight_model=='s'

json: in lib/json_format.py, the score is gene_ID[j]][1]/highest_score

gene_info_dict[gene_dict[gene_ID[j]][0]] = {'Rank':str(rank_ave) , 'gene_id':str(gene_ID[j]),'score':str( float(gene_dict[gene_ID[j]][1]/highest_score)), 'status':gene_dict[gene_ID[j]][2]}

tsv: in lib/output.py, the score is gene_dict[gene_symbols[j]][1]

output_file.write(str(rank_ave) + "\t" + gene_dict[gene_symbols[j]][0] + "\t" + str(gene_dict[gene_symbols[j]][-1]) + "\t" + str( gene_dict[gene_symbols[j]][1]  ) + "\t" + gene_dict[gene_symbols[j]][2] + "\n" )

one has highest_score, another not.
I think it's a bug.

Huge amount of files in Knowledge database

Hi Phen2gene authors
I'm using your tool in a cluster and I've an issue with the amount of files tha comprises the DB. This DB has the following folders:

Knowledgebase: 14259 files
skewness: 14259 files
weights: 14259 files

In my cluster I've a file number limit that it's reached with this DB. In the case of skewness, there is only one number per file and they could be merged in one file. The same for weights folder but it has more information. Could you give some solution to this problem?
Thank you in advance
Pedro Seoane

Possible issue for HP:0003002 (Breast carcinoma)

Hi there,
When I try to get the prioritized gene list for HP:0003002, there is a gene that pops up at the top with a score of 1.00. The gene is WIST1. This gene symbol is non-existent (not even present as an alias).

Could you have a look?

I am using web version - https://phen2gene.wglab.org/