bionf / phyloprofile Goto Github PK

View Code? Open in Web Editor NEW

31.0 9.0 9.0 2.33 MB

A phylogenetic profile analysis tool

Home Page: https://applbio.biologie.uni-frankfurt.de/phyloprofile/

License: Other

R 99.23% Python 0.57% HTML 0.20% CSS 0.01%

orthologs shiny phylogenetic-profile visualization r heatmap interactive-visualizations bioinformatics

phyloprofile's Introduction

PhyloProfile

Click here for the full PDF version of the BOSC2017 poster

PhyloProfile is an R package that comes together with a Shiny-based tool for integrating, visualizing and exploring multi-layered phylogenetic profiles.

Alongside the presence/absence patterns of orthologs across large taxon collections, PhyloProfile allows the integration of any two additional information layers. These complementary data, like sequence similarity between orthologs, similarities in their domain architecture, or differences in functional annotations enable a more informed interpretation of phylogenetic profiles.

By utilizing the NCBI taxonomy, PhyloProfile can dynamically collapse taxa into higher systematic groups. This enables rapidly changing the resolution from the comparative analyses of proteins in individual species to that of entire kingdoms or even domains without changes to the input data.

PhyloProfile furthermore allows for a dynamic filtering of profiles – taking the taxonomic distribution and the additional information layers into account. This, along with functions to estimate the age of genes and core gene sets facilitates the exploration and analysis of large phylogenetic profiles.

Take a look at the functionality of PhyloProfile and explore the installation-free online version to learn more.

Installation & Usage
Input Data
Walkthrough & Examples
Bugs
Acknowledgements
License
How-To Cite
Contact

Installation & Usage

PhyloProfile requires the latest version of R (check for required R version here). Please install or update R on your computer before continue.

Then start R to install and use PhyloProfile.

Using BiocManager

PhyloProfile is available at Bioconductor (require Bioc version ≥ 3.14). To install PhyloProfile, start R and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("PhyloProfile")

To install the development version of PhyloProfile, please use the devel version of Bioconductor:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(version='devel')
BiocManager::install("PhyloProfile")

or install the dev version of PhyloProfile from our github repository using devtools.

Using devtools

The dev version of PhyloProfile can be installed from this github repository using devtools:

if (!requireNamespace("devtools"))
    install.packages("devtools")
devtools::install_github("BIONF/PhyloProfile", INSTALL_opts = c('--no-lock'), build_vignettes = TRUE)

Using Conda

PhyloProfile can also be installed within a conda environment. First, add bioconda to the list of your conda channels:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

then install PhyloProfile using the standard conda install command:

conda install bioconductor-phyloprofile

This installation step can take a while regardless of the method used, as all necessary dependencies will be downloaded and installed automatically. (Note: Depending on your system this sometimes fails, please check the console log for error messages concerning the dependency installation)

Start PhyloProfile's Shiny app

From the R terminal, enter:

library(PhyloProfile)
runPhyloProfile()

Check your web browser, PhyloProfile will be displayed there ;-) For the first time running, the tool will download a pre-caculated taxonomy data. Please be patient until you see a message for uploading input files.

Please check our detailed instructions if you encounter any problems while installing and starting the program.

Input Data

PhyloProfile can read a number of different input files, including multi-FASTA files, regular tab-separated files, OMA ID lists or OrthoXML files. The additional information layers can be embedded in the OrthoXML or be provided separately as tab-separated files.

We described all suppported input formats in section Input Data in our PhyloProfile's Wiki.

Walkthrough & Examples

Read the walkthrough slides to explore the functionality of the PhyloProfile GUI.

Check the vignette for learning how to use PhyloProfile's functions in some specific use-cases:

browseVignettes("PhyloProfile")

Bugs

Any bug reports or comments, suggestions are highly appreciated. Please open an issue on GitHub or be in touch via email.

Acknowledgements

We would like to thank

Bastian for the great initial idea and his kind support,
Members of Ebersberger group for many valuable suggestions and ...bug reports :)

Contributors

License

This tool is released under MIT license.

How-To Cite

Ngoc-Vinh Tran, Bastian Greshake Tzovaras, Ingo Ebersberger, PhyloProfile: dynamic visualization and exploration of multi-layered phylogenetic profiles, Bioinformatics, Volume 34, Issue 17, 01 September 2018, Pages 3041–3043, https://doi.org/10.1093/bioinformatics/bty225

or use the citation function in R CMD to have it directly in BibTex or LaTeX format

citation("PhyloProfile")

Contact

Vinh Tran [email protected]

phyloprofile's People

Contributors

Stargazers

Watchers

Forkers

carlamoelbert altingia ajwije hannahbioi ermali7 gedankenstuecke stefanbiermann trvinh jurudo

phyloprofile's Issues

Using the GH Wiki-pages

I think that we should put the wiki pages to some use (probably after #31 is solved). This could be useful to declutter the README.md, which otherwise might become too overwhelming for people reading about PhyloProfile for the first time.

Things that could go there:

More detailed how-tos for the different functions
Explanations of the file formats
the FAQ that's now inside PhyloProfile

Deactivating Auto update plot is not working

Whether the check box of Auto update plot is checked or not does not seem to have an effect, because the plot is always updating when the user makes any changes.

Version

Observed in version 0.3.0/0.3.2

Type of Issue

Bug

Logfile

There is no error or other message in the R log when checking or unchecking the box

show absent orthologs in detailed plot

detailed plot does not show absent orthologs when input data doesn't have any value for var1 and var2

Customized profile: "plot sequences" button hidden by profile

The plot sequences button in the customized profile view is sometimes (partially) hidden by the profile as soon as one is already plotted.

Provide some test data

Maybe create a subfolder testdata where there's a small data set that can be used to play around with the tool after downloading & installing it? Could be the same as for a potential online demo.

Reconfigure as a package?

Reconfiguring the code as an R package simplifies installation, esp. for dependencies over requiring users to use Git and then run the setup script in R.

Description

Basically same as above, just caught my attention from looking through the installation instructions in the README.

Type of Issue

feature-request

Related Issue(s)

This might also help to organize the tests for #60.

improve documentation

Description

new walk-through slide
new performance test

Type of Issue

Related Issue(s)

Logfile

Add ToC to README

By now the README.md is rather extensive. A Table of Contents that links to the subsections would be great for new people to find their way around. Luckily there are some simple bash scripts which can do this.

Change theme of histogram(s)❓

The standard ggplot2 theme looks a bit ugly for the FAS score histogram. Maybe one could change it to theme_minimal or comparable?

improve input / output options

Description

METADATA for input (e.g. refspec, number of genes for preData, plot size, clustering...)
input file contains more than 5 columns (users will choose 2 for visualising)
additional input option for OMA (either list of protein IDs or list of OMA groups from file or copy/paste)
option to input oneSeq folder (*.phyloprofile, *.extended.fa, *. *.domains)
upload 2 domains files and add switch button to choose which will be used for plotting
use shinyFile for input domain and fasta folder
output taxonomy tree together with domains of selected taxon (like DoMosaics)

Type of Issue

Related Issue(s)

Logfile

Long format input to large for upload

Description

Hi there,
I have an orthologe table from OrthoFinder with 144 species, which makes it quite large. I have converted the orthofinder groups into the long format looking like this:

geneID  ncbiID  orthoID FAS     traceability
OG0000000       ncbi81532       ACANSP|TRINITY_DN10128_c0_g3__TRINITY_DN10128_c0_g3_i1__g.39236__m.39236        NA      NA
OG0000000       ncbi81532       ACANSP|TRINITY_DN10739_c0_g2__TRINITY_DN10739_c0_g2_i1__g.43040__m.43040        NA      NA
OG0000000       ncbi81532       ACANSP|TRINITY_DN10739_c0_g2__TRINITY_DN10739_c0_g2_i2__g.43043__m.43043        NA      NA
OG0000000       ncbi81532       ACANSP|TRINITY_DN10797_c0_g1__TRINITY_DN10797_c0_g1_i1__g.43004__m.43004        NA      NA
OG0000000       ncbi81532       ACANSP|TRINITY_DN1109_c0_g1__TRINITY_DN1109_c0_g1_i1__g.3023__m.3023    NA      NA

However, when I try to upload this into the PhyloProfile GUI it complains that the file size is too large. It is indeed ~440Mb. Can the file size limit be increased? Is there another way to get the OrthoFinder output into PhyloPattern?

Type of Issue

Potential bug, or feature request.

Related Issue(s)

Logfile

Making a first release

With the submission of PhyloProfile we should probably start using some kind of versioning-system. This is partly inspired by the milestone that our PR to taxize got.

GitHub makes that easy with their tags/releases-feature. A new release for PhlyloProfile can be created here.

The suggested guideline is to use semantic versioning, with MAJOR.MINOR.PATCH being the numbering system. So the current state could be v1.0.0 or v.1.0.0-beta, depending on how conservative one wants to be.

20000001 is not larger than the largest ncbi taxonomy ID anymore

problem with adding IDs not from ncbi taxonomy database. Previously the largest ID from ncbi was 1834343. Now this number has increased. Therefore adding new taxa IDs starting with 2000001 may cause the problem that some IDs got duplicated.

speed up PhyloProfile

Description

Some interesting posts:
https://datascienceplus.com/strategies-to-speedup-r-code/
https://shiny.rstudio.com/articles/plot-caching.html
https://ryanhafen.com/blog/plot-lots-of-data/

Type of Issue

Related Issue(s)

Logfile

improve the source code

Description

Follow this style guide from Hadley Wickham to have a well formatted code.
Try to use lintr package to improve the current codes.
Should think to add this format checking into Travis-CI.

Type of Issue

Enhancement

Related Issue(s)

Logfile

color of domain plot

Description

Color of the same domains between 2 proteins should be the same to make it easier to compare.

Type of Issue

Improvement

Related Issue(s)

Logfile

OMA Standalone OrthoXML error

Hello dear developers,

I encountered a problem loading orthoXML file of OMA Standalone analysis. As far as I know OMA doesn't generate any scores for orthologs (similar to FAS score from hamstr/fdog). But PhyloProfile doesn't allow to upload orthoxml file without any variables.
Is it possible somehow to use the files without scores for visualization in PhyloProfile?

Thank you!

missing taxa when download customized profile - is it ok?

Description

The downloaded customized profile could have less taxa than the original data, which could cause a wrong calculation of the percentage of the present taxa in each superkingdom, just because the number of included taxa has been changed.

Type of Issue

Related Issue(s)

Logfile

Handling missing inputs with req(...)

Description

try this https://shiny.rstudio.com/articles/req.html to check missing input

Type of Issue

improvement

plot features

Description

PLOT CONFIGURATION

plot configuration as yml file (if not, use default values)

DOMAIN PLOT

desc for domains (with URL)
show domain plot directly on main/customized profile tab
option to filter out common domains; or dropdown menu for selecting domains of interest
simplify text (seq ID, feature ID)
should be more appealing

MAIN PLOT

add lines to split taxon groups (e.g. euk, bac, archaea), user can choose the resolution (e.g. phylum/kingdom/superkingdom,..)
highlight a group of genes (e.g. gene cluster)
auto adjust plot area to number of genes and species
link seq ID with UNIPROT or NCBI if possible
set number of proteins to be displayed should be in the input page
option to lock clustered profile from the filtering (i.e. profile after filtering will not change its order)
option to show all taxa in the input independent on the current genes

CUSTOMIZED PLOT

[option to] expand the selectize input for gene list

DETAILED PLOT

Simplify seq ID by showing Species + gene ID (not the default hamstr ID)
Make sure to sort the Y-axis (gene IDs)
Display systematic string for the two taxa (?)

Type of Issue

Related Issue(s)

Logfile

Idea: Customized Profile, Filter by Rank

So far the Customized Profile only allows filtering on the taxonomic rank given for the main plot. Which means if you're on the species level and have 400 taxa with 100 fungi you can't easily filter for those 100 fungi, as you'd need to give all 100 fungal species manually in the customized profile.

It would be great if the Customized Profile would allow to select a taxonomic rank and then sub-select for this given Rank (e.g. Rank: Kingdom Supertaxa: Fungi).

write tests for new functions

Description

Copied from @gedankenstuecke email: We really should start to write some tests for PhyloProfile to make sure that all functions do put out what they’re supposed to. If we get started with this a tiny bit before the sprint adding more tests could make a good sprint task. :-)

Type of Issue

To make the distribution easier and this tool ... cooler :D

Related Issue(s)

Logfile

check the correct format for domain input files

Description

The domain input files must contain the correct info in corresponding columns, and the number of the columns muss be between 6 - 8 for the v.0.3.0-beta and 5-7 for the v0.2.1 version.

The tool shows an user-infriendly error message when the input file do not follow those requirements.

Type of Issue

Improve the validity check

Related Issue(s)

Logfile

Comments on branch `v0.3.0-beta`

Feedback on branch for Mozilla OL

Description

I'm concerned a bit about the python system calls. Would using rPython be better in that it would handle checking for python being installed and possibly avoid other issues with using the system() function to call python from the command line?
Agree with issue #60 on adding tests for functions. This could expand to using the demo data files for testing (though re-iterating #63, I think this works a bit more seamlessly if the project follows an R package structure).
Add documentation to specify the inputs/outputs of functions, especially in functions.R.
Check style on code in functions.R. My linter produced several reasonable complaints about the formatting here and variable names.

Type of Issue

feature request

Related Issue(s)

#60 to add tests.
#63 to make the project into an R package.

privileges gone through repo-transfer

Poor @trvinh can't edit the settings of the repository any longer as he accidentally gifted them away with the transfer from his personal repo to the @BIONF repo. Should be fixed. 😂

automatically install all packages via script

problem with online version

Somehow the Detailed plot of the online version doesn't work the same as the offline version. The order of sequences is incorrect :(

Modularising app code

Description

Use shiny modules to modularise the code for better maintaining & collaboration.
Ref: https://shiny.rstudio.com/articles/modules.html

Type of Issue

Code enhancement

Related Issue(s)

#56
#63

Logfile

rewrite ncbiTaxonomyParser.pl using python

Description

Rewrite scripts/ncbiTaxonomyParser.pl in Python to remove the dependency from Perl.

Type of Issue

Enhancement

Related Issue(s)

#64

Logfile

issue with checkNewick function

Description

checkNewick function should be checked.

(number_of_commas) = (number_of_open parentheses + 1) => not always true (for example: ((a,b),c); is also a correct newick tree where (number_of_commas) = (number_of_open parentheses))
missingTaxa returns an empty list and creates an error msg " not exist in main input file!"

Type of Issue

Bug

Related Issue(s)

Logfile

move taxanomy info file to dropbox

Bug: Could not find file server.R to run

Description

After all libraries were downloaded and installed, R couldn't find the file server.R which is included in the package.

Solution:

Check default directory
Change the path like to your package

Note: I tested this package using windows system

Type of Issue

Bug

Related Issue(s)

This could be related to #63

Logfile

demo files for OMA Standalone OrthoXML-converter are gone.

Whops, these should be in some place where people can see them right away. I'd still be in favor of having some demo-folder with some small example files to that end. Especially as "🐵 see, 🐵 do" is often the best way to learn how to run something 😂

rooting user input tree

Description

Currently the user input taxonomy tree has to be rooted (and sorted) before uploading into the tool. The selecting (and changing) reference species option cannot affect the plot.

Type of Issue

Make it possible to use the chosen reference species to root the input taxonomy tree if it is not rooted with that species.

Related Issue(s)

Logfile

Gene age estimation plots: Wrong ordering

It appears that the gene age distribution plot sometimes orders the bars itself in the wrong order. The 161 genes to the left belong to the Ascomycota, clicking on that range will successfully list the Ascomycota genes. Unfortunately the underlying stacked bar-plot is giving the color for the LECA genes (blue) instead of red for the Ascomycota.

Make binary colors look nicer

Right now the dots are getting a rather ugly brown color if you don't upload any data for var1. Guess it would be nicer if this was either a nicer color (e.g. the blue used otherwise) or user selectable?

link to demo data including files in dropbox

check the user-defined taxonomy file before processing

Description

An incorrect user-defined taxonomy file (newTaxa.txt) can cause many invisible problems by making the taxonomy matrix file broken (taxonomyMatrix.txt, and in some cases, even idList.txt and rankList.txt are also affected).
Those problems occur mostly because of the duplicated IDs or taxon Names between newTaxa.txt and the idList.txt, rankList.txt or taxonomyMatrix.txt.

So, this newTaxa.txt file should be checked before being processed for:

duplicated IDs with existing taxonomy files
duplicated names with existing taxonomy files
validity of new taxon IDs (greater than the current highest ncbi taxon ID)

And It would be good to implement an indicators for this process, so that the user can estimate how long it will take to finish the job.

Type of Issue

Improve the validity check

Related Issue(s)

Logfile

Obtain ncbi taxonomy from local database

Description

Add option for getting ncbi taxonomy from local database (e.g. nodes.dmp, names.dmp or nodeDB).

Type of Issue

New feature

Related Issue(s)

Suggestion of @evolgenomology in #78

Logfile

Add histogram for taxon presence distribution❓

First off: well done on adding the histogram on the FAS scores, it makes it much easier to see how different cutoffs will impact how much data is "hidden" in the end. Do you think it would be useful to add the same for the % of taxa being present in a given supertaxon? As you offer to filter according to that it might be useful as well. 😄

Readme-Link

In PhyloProfile we have now linked out to the GitHub-page that renders the README.md, while this is a nice hack to show the README to people, it has the drawback that it takes people away from the app. Which is unwanted, especially if people are already actively using their app.

We should fix this by either:

Make sure the link opens in a new tab/window
The whole thing is rendered inside the PhyloProfile-App, without losing the main navigation and all progress/analyses made.

The first is probably easier, the second would be nicer and could be used for rendering out GH-Wiki as well.

X Axis labels horizontal adjustment

Description

With large X axis labels and with any angle, the leftmost X axis label can reach outside the plotting space. Then, the plotting function cuts off the label. If you set the angle at 90° to avoid this issue, the horizontal adjustment disjoints the label from the corresponding data column.

There are two suggestions:

Less favorable: Increase the plotting space to the left by 0.5 to 1 px by overriding the margin.
More favorable: Set the hjust value to 0.5 or 0 when setting the angle to 90°, whatever would centralize the label. I am aware that conditional formatting like this may not be straightforward with R plotting functions.

Type of Issue

Feature request.

Related Issue(s)

None that I am aware of.

Logfile

Observed in the Anaconda package:

Name                            Version           Build            Channel
bioconductor-phyloprofile       1.4.9             r40hdfd78af_0    bioconda

Evidence:

convergence point

Description

Identifying the time point when an ancestry gene got duplicated?

Type of Issue

Related Issue(s)

Logfile

Profiles clustering

So far, the clustering of profiles is done before any of the filters (% present, variable 1, variable 2) are applied. Furthermore, it always clusters on the binary presence/absence of species, even if the subsequent display is done on the grouping on a higher taxonomic level.

It would be awesome if the clustering is done

on the filtered data, after individual taxon presences are removed,
on the grouped level, using the % of species present as the values for the distance matrix.

Otherwise the resulting clusterings look very strange if you look at filtered & grouped profiles, as the clustering does not correspond to what you actually observe.

Filter for number of co-orthologs

Description

It would be nice to have an additional filter so that the user is able to only select orthologous groups without co-orthologs. This could also be a numerical filter with slider. The user could also select a maximum number of co-orthologs to be present.

Type of Issue

Feature request

Provide scripts for input conversion

So far the input data needs to be in a specific matrix format (see data/demo) in order to be compatible with PhyloProfile.

Most people will not have their data ready in this format. In order to make it easier it would be cool if there was a way to automate the data preprocessing, either by including the corresponding conversion inside PhyloProfile or by providing a script that takes long format data and converts them into the appropriate matrices.

decide whether to provide external conversion or do it inside PhyloProfile
implement the preferred solution 😉

Integrate into a 'Galaxy Interactive Environment'

Description

This is a suggestion for a Mozilla Global Sprint 2018 activity. It would be great to be able to install PhyloProfile as a Galaxy Interactive Environment. That might allow researchers to build PhyloProfile into a larger analysis workflow or automate input data preparation.

There are a few other projects that have integrated Shiny apps into Galaxy that we might be able to build from:

https://github.com/65MO/Galaxy-E
https://github.com/ValentinChCloud/shiny-GIE
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5691353/

Type of Issue

Suggestion & Feature Request

Related Issue(s)

Relates to #63 because it may involve adjusting the installation process.

data filter using uploaded list of genes

Description

some of the functions (e.g. total number of genes, list of genes to highlight) still use the whole input data instead of only the subset based on the uploaded list of gene of interest.

problem with distribution plots when using long-format input

fix problem with distribution plots when using long-format input
change size of distribution plot title
make download function for distribution plots

wrong information for clicked point

Description

After clustering the profile, the "point info" shows incorrect info for the clicked point.
It seems that, only the plot is reordered but the data frame used for plotting is not.

Type of Issue

a BIG bug 👎