Git Product home page Git Product logo

phylonext's Introduction

PhyloNext - PD (Phylogenetic Diversity) in the cloud

GitHub (latest release) Nextflow run with docker run with singularity GitHub license
CI/CD status: Nextflow (full pipeline) OToL Biodiverse
DOI

PhyloNext is the automated pipeline for the analysis of phylogenetic diversity using GBIF occurrence data, species phylogenies from Open Tree of Life, and Biodiverse software.

Introduction

Current pipeline brings together two critical research data infrastructures, the Global Biodiversity Information Facility (GBIF) and Open Tree of Life (OToL), to make them more accessible to non-experts.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies.

The pipeline could be launched in a cloud environment (e.g., the Microsoft Azure Cloud Computing Services, Amazon AWS Web Services, and Google Cloud Computing Services).

Pipeline summary

  1. Filtering of GBIF species occurrences for various taxonomic clades and geographic areas
  2. Removal of non-terrestrial records and spatial outliers (using density-based clustering)
  3. Preparation of phylogenetic tree (currently, only pre-constructed phylogenetic trees are available; with the update of OToL, phylogenetic trees will be downloaded automatically using API) and name-matching with GBIF species keys
  4. Spatial binning of species occurrences using Uber’s H3 system (hexagonal hierarchical spatial index)
  5. Estimation of phylogenetic diversity and endemism indices using Biodiverse program
  6. Visualization of the obtained results

Quick Start

An example command to run the pipilene:

nextflow run vmikk/phylonext -r main \
  --input "/mnt/GBIF/Parquet/2022-01-01/occurrence.parquet/" \
  --classis "Mammalia" --family  "Felidae,Canidae" \
  --country "DE,PL,CZ"  \
  --minyear 2000  \
  --dbscan true  \
  --phytree $(realpath "${HOME}/.nextflow/assets/vmikk/phylonext/test_data/phy_trees/Mammals.nwk") \
  --iterations 100  \
  -resume

Web GUI

To facilitate easy and efficient navigation for exploring the PhyloNext pipeline, a user-friendly, web-based graphical user interface (GUI) has been developed by Thomas Stjernegaard Jeppesen.

The GUI is available at https://phylonext.gbif.org/.

NB! To access the GUI, users must have a GBIF user account. To register an account, please visit https://www.gbif.org/.

Documentation

The PhyloNext pipeline comes with documentation about the pipeline usage at https://phylonext.github.io/.

Main pipeline parameters and output are desribed here:

To show a help message, run nextflow run vmikk/phylonext -r main --help.

=====================================================================
PhyloNext: GBIF phylogenetic diversity pipeline :  Version 1.4.0
=====================================================================

Pipeline Usage:
To run the pipeline, enter the following in the command line:
    nextflow run vmikk/phylonext -r main --input ... --outdir ...

Options:
REQUIRED:
    --input               Path to the directory with parquet files (GBIF occurrcence dump)
    --outdir              The output directory where the results will be saved
OPTIONAL:
    --phylum              Phylum to analyze (multiple comma-separated values allowed); e.g., "Chordata"
    --classis             Class to analyze (multiple comma-separated values allowed); e.g., "Mammalia"
    --order               Order to analyze (multiple comma-separated values allowed); e.g., "Carnivora"
    --family              Family to analyze (multiple comma-separated values allowed); e.g., "Felidae,Canidae"
    --genus               Genus to analyze (multiple comma-separated values allowed); e.g., "Felis,Canis,Lynx"
    --specieskeys         Custom list of GBIF specieskeys (file with a single column, with header)

    --phytree             Custom phylogenetic tree
    --taxgroup            Specific taxonomy group in Open Tree of Life (default, "All_life")
    --phylabels           Type of tip labels on a phylogenetic tree ("OTT" or "Latin")
    --maxage              Manually assign root age for a tree obtained from Open Tree of Life; e.g., 127
    --phyloonly           Prune Open Tree tips for which there are no phylogenetic inputs; logical, default, false

    --country             Country code, ISO 3166 (multiple comma-separated values allowed); e.g., "DE,PL,CZ"
    --latmin              Minimum latitude of species occurrences (decimal degrees); e.g., 5.1
    --latmax              Maximum latitude of species occurrences (decimal degrees); e.g., 15.5
    --lonmin              Minimum longitude of species occurrences (decimal degrees); e.g., 47.0
    --lonmax              Maximum longitude of species occurrences (decimal degrees); e.g., 55.5
    --minyear             Minimum year of record's occurrences; default, 1945
    --maxyear             Maximum year of record's occurrences; default, none
    --coordprecision      Coordinate precision threshold (less than maximum allowed value; default, 0.1)
    --coorduncertainty    Maximum allowed coordinate uncertainty, meters (default, 10000)
    --coorduncertaintyexclude Black list of coordinate uncertainty values (default, "301,3036,999,9999")
    --basisofrecordinclude Basis of record to include from the data; e.g., "PRESERVED_SPECIMEN"
    --basisofrecordexclude Basis of record to exclude from the data; e.g., "FOSSIL_SPECIMEN,LIVING_SPECIMEN"
    --polygon             Custom area of interest (a file with polygons in GeoPackage format)
    --wgsrpd              Polygons of World Geographical Regions; e.g., "pipeline_data/WGSRPD.RData"
    --regions             Names of World Geographical Regions; e.g., "L1_EUROPE,L1_ASIA_TEMPERATE"
    --noextinct           File with extinct species specieskeys for their removal (file with a single column, with header)
    --excludehuman        Logical, exclude genus "Homo" from occurrence data (default, true)
    --roundcoords         Numeric, round spatial coordinates to N decimal places, to reduce the dataset size (default, 2; set to negative to disable rounding)
    --h3resolution        Spatial resolution of the H3 geospatial indexing system; e.g., 4

    --dbscan              Logical, remove spatial outliers with density-based clustering; e.g., "false"
    --dbscannoccurrences  Minimum species occurrence to perform DBSCAN; e.g., 30
    --dbscanepsilon       DBSCAN parameter epsilon, km; e.g., "700"
    --dbscanminpts        DBSCAN min number of points; e.g., "3"

    --terrestrial         Land polygon for removal of non-terrestrial occurrences; e.g., "pipeline_data/Land_Buffered_025_dgr.RData"
    --rmcountrycentroids  Polygons with country and province centroids; e.g., "pipeline_data/CC_CountryCentroids_buf_1000m.RData"
    --rmcountrycapitals   Polygons with country capitals; e.g., "pipeline_data/CC_Capitals_buf_10000m.RData"
    --rminstitutions      Polygons with biological institutuions and museums; e.g., "pipeline_data/CC_Institutions_buf_100m.RData"
    --rmurban             Polygons with urban areas; e.g., "pipeline_data/CC_Urban.RData"

    --deriveddataset      Prepare a list of DOIs for the datasets used (default, true)

    --indices             Comma-seprated list of diversity and endemism indices; e.g., "calc_richness,calc_pd,calc_pe"
    --randname            Randomisation scheme type; e.g., "rand_structured"
    --iterations          Number of randomisation iterations; e.g., 1000
    --biodiversethreads   Number of Biodiverse threads; e.g., 10
    --randconstrain       Polygons to perform spatially constrained randomization (GeoPackage format)

Leaflet interactive visualization:
    --leaflet_var         Variables to plot; e.g., "RICHNESS_ALL,PD,SES_PD,PD_P,ENDW_WE,SES_ENDW_WE,PE_WE,SES_PE_WE,CANAPE,Redundancy"
    --leaflet_canapesuper Include the `superendemism` class in CANAPE results (default, false)
    --leaflet_color       Color scheme for continuous variables (default, "RdYlBu")
    --leaflet_palette     Color palette for continuous variables (default, "quantile")
    --leaflet_bins        Number of color bins for continuous variables (default, 5)
    --leaflet_sescolor    Color scheme for standardized effect sizes, SES (default, "threat"; alternative - "hotspots)
    --leaflet_redundancy  Redundancy threshold for hiding the grid cells with low number of records (default, 0 = display all grid cells)

Static visualization:
    --plotvar             Variables to plot (multiple comma-separated values allowed); e.g., "RICHNESS_ALL,PD,PD_P"
    --plottype            Plot type
    --plotformat          Plot format (jpg,pdf,png)
    --plotwidth           Plot width (default, 18 inches)
    --plotheight          Plot height (default, 18 inches)
    --plotunits           Plot size units (in,cm)
    --world               World basemap

NEXTFLOW-SPECIFIC:
    -qs                   Queue size (max number of processes that can be executed in parallel); e.g., 8
    -w                    Path to the working directory to store intermediate results (default, "./work")
    -resume               Execute the pipeline using the cached results.<br>Useful to continue executions that was stopped by an error
    -profile              Configuration profile; e.g., "docker"
    -params-file          Parameter file in YAML or JSON format (e.g., "Mammals.yaml")
    -c / -C               Configuration file (`-C` ignores all default values) (default, "nextflow.config")

Source code for the documentation can be found at https://github.com/PhyloNext/phylonext.github.io.

Credits

PhyloNext pipeline was developed by Vladimir Mikryukov and Kessy Abarenkov.

Biodiverse program and Perl scripts accompanying PhyloNext were written by Shawn Laffan (Laffan et al., 2010).

Scripts for getting an induced subtree from the Open Tree of Life were developed by Emily Jane McTavish.

We thank the following people for their extensive assistance in the development of this pipeline: Joe Miller, Shawn Laffan, Tim Robertson, Emily Jane McTavish, John Waller, Thomas Stjernegaard Jeppesen, and Matthew Blissett.

Also we are very grateful to Manuele Simi and nf-core community for helpful advices on the development of this pipeline.

For more details, please see the Acknowledgments section in the docs.

Funding

The work is supported by a grant “PD (Phylogenetic Diversity) in the Cloud” to GBIF Supplemental funds from the GEO-Microsoft Planetary Computer Programme.

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

For further information or help, don't hesitate to file an issue on GitHub.

Future plans

Citations

If you use PhyloNext pipeline for your analysis, please cite it using the following DOI: 10.5281/zenodo.7974081

Laffan SW, Lubarsky E, Rosauer DF (2010) Biodiverse, a tool for the spatial analysis of biological and related diversity. Ecography, 33: 643-647. DOI: 10.1111/j.1600-0587.2010.06237.x

An extensive list of references for the tools used by the pipeline can be found in the Citations section in the documentation.

phylonext's People

Contributors

vmikk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

phylonext's Issues

pandoc document conversion failed with error 137

I ran the pipeline and everything worked up until til the leaflet generation which failed with pandoc document conversion failed with error 137 , any idea what this could be about?
Seems like it was the actual export of the Cloropleth.html file the errored

  Loading Biodiverse results
  ..Observed indices
  ..SES-scores
  ..P-values
  Preparing data for CANAPE analysis
  ..Running the CANAPE analysis
  ...Inferring endemism type
  Estimating sampling redundancy
  ..Loading a file with the total number of GBIF-records per H3-cell
  ..Adding N records to the main table
  ..Estimating the index
  Exporting the data table
  Preparing gridcell polygons
  ..Adding diversity estimates to polygons
  ..Adding H3 cell names
  ..Fixing antimeridian issue
  Creating leaflet map
  ..Generating polygon labels
  ..Building basemap
  ..Preparing color palettes
  ..Adding polygons
  ...  RICHNESS_ALL 
  ...  PD 
  ...  SES_PD 
  ...  PD_P 
  ...  ENDW_WE 
  ...  SES_ENDW_WE 
  ...  PE_WE 
  ...  SES_PE_WE 
  ... CANAPE
  ... Redundancy
  ..Adding variable selector
  ..Hiding variables
  ..Exporting the results
Command error:
  Killed
  Error: pandoc document conversion failed with error 137
  Execution halted

Percentages vs absolute PD

On the map of phylogenetic diversity, the legend has percentages of phylogenetic diversity, but when you hover over a cell it shows the absolute value of PD. It is not clear what the percentages are based on.

Intervals of legend change unexpectedly

The legend shows intervals of percentages (we are looking at PD), but these intervals seem to vary between 4 or 5 levels for different runs, which makes it hard to compare them.

Different amount of PD between 2 runs

These 2 phylogenies are with and without Wollemia nobilis. We ran phylonext for both of these phylogenies. All parameters were default except for type of phylogenetic tree labels (Latin) family (Araucariaceae), min year (1980) and country (Australia), number of randomisation (20).

The problem is that some cells where Wollemia nobilis is not present have different amounts of PD between the 2 runs. We would expect if Wollemia nobilis is not present, they would have exactly the same amount of PD.

These are the two phylogenies we ran:

(((Agathis_australis,Agathis_palmerstonii,((((Agathis_atropurpurea,Agathis_microstachya),Agathis_robusta),((Agathis_dammara,Agathis_borneensis),(Agathis_silbae,Agathis_macrophylla))),((Agathis_moorei,(Agathis_ovata,Agathis_corbassonii)),Agathis_lanceolata)),Agathis_montana,Agathis_obtusa,Agathis_vitiensis,Agathis_brownii,Agathis_alba,Agathis_philippinensis,Agathis_flavescens,Agathis_orbicula,Agathis_kinabaluensis,Agathis_labillardierei,Agathis_lenticula,Agathis_endertii,Agathis_silbaii)Agathis,(Wollemia_nobilis)Wollemia),(((Araucaria_heterophylla,((((Araucaria_columnaris,(Araucaria_luxurians,Araucaria_nemorosa)),((Araucaria_humboldtensis,Araucaria_laubenfelsii),Araucaria_rulei)),((Araucaria_bernieri,(Araucaria_schmidii,Araucaria_subulata)),Araucaria_scopulorum)),(Araucaria_biramulata,Araucaria_montana))),(Araucaria_cunninghamii_var._cunninghamii)Araucaria_cunninghamii),((Araucaria_araucana,Araucaria_angustifolia),(Araucaria_bidwillii,Araucaria_hunsteinii)),Araucaria_excelsa,Araucaria_muelleri,Araucaria_braziliana,Araucaria_lignitici,Araucaria_bernieri,Araucaria_annulata,Araucaria_balcombensis,Araucaria_fimbriatus,Araucaria_grandifolia,Araucaria_hastiensis,Araucaria_longifolia,Araucaria_marensii,Araucaria_nathorsti,Araucaria_planus,Araucaria_prominens,Araucaria_readiae,Araucaria_taeriensis,Araucaria_uncinatus,Araucaria_araucoensis,Araucaria_brownii,Araucaria_jeffreyi,Araucaria_marylandica,Araucaria_neocookii,Araucaria_pichileufensis,Araucaria_prodromus)Araucaria,(Agathoxylon_pseudoparenchymatosum)Agathoxylon,(Balmeisporites_minutus)Balmeisporites)Araucariaceae;

((Agathis_australis,Agathis_palmerstonii,((((Agathis_atropurpurea,Agathis_microstachya),Agathis_robusta),((Agathis_dammara,Agathis_borneensis),(Agathis_silbae,Agathis_macrophylla))),((Agathis_moorei,(Agathis_ovata,Agathis_corbassonii)),Agathis_lanceolata)),Agathis_montana,Agathis_obtusa,Agathis_vitiensis,Agathis_brownii,Agathis_alba,Agathis_philippinensis,Agathis_flavescens,Agathis_orbicula,Agathis_kinabaluensis,Agathis_labillardierei,Agathis_lenticula,Agathis_endertii,Agathis_silbaii)Agathis,(((Araucaria_heterophylla,((((Araucaria_columnaris,(Araucaria_luxurians,Araucaria_nemorosa)),((Araucaria_humboldtensis,Araucaria_laubenfelsii),Araucaria_rulei)),((Araucaria_bernieri,(Araucaria_schmidii,Araucaria_subulata)),Araucaria_scopulorum)),(Araucaria_biramulata,Araucaria_montana))),(Araucaria_cunninghamii_var._cunninghamii)Araucaria_cunninghamii),((Araucaria_araucana,Araucaria_angustifolia),(Araucaria_bidwillii,Araucaria_hunsteinii)),Araucaria_excelsa,Araucaria_muelleri,Araucaria_braziliana,Araucaria_lignitici,Araucaria_bernieri,Araucaria_annulata,Araucaria_balcombensis,Araucaria_fimbriatus,Araucaria_grandifolia,Araucaria_hastiensis,Araucaria_longifolia,Araucaria_marensii,Araucaria_nathorsti,Araucaria_planus,Araucaria_prominens,Araucaria_readiae,Araucaria_taeriensis,Araucaria_uncinatus,Araucaria_araucoensis,Araucaria_brownii,Araucaria_jeffreyi,Araucaria_marylandica,Araucaria_neocookii,Araucaria_pichileufensis,Araucaria_prodromus)Araucaria,(Agathoxylon_pseudoparenchymatosum)Agathoxylon,(Balmeisporites_minutus)Balmeisporites)Araucariaceae;

indices and leaflet_var params

I have a couple of questions

  1. Is it correct that omitting the --indices param will make the pipeline default to all indices?
  2. If so, and the pipeline is started with only a few indices, like --indices calc_richness,calc_phylo_rpd1 is it correct that the user then also needs to adjust the --leaflet_var param, otherwise some input data will be missing for that step?

If this is correctly understood, is it possible to get a mapping of --leaflet_var values to which indices they depend on?

Best,
Thomas

Identification of title on execution report

When you look at the execution report, you can not tell which run it is from, so it would be useful to have the title and description of the run in the execution report, and just the title on the map.

Is it possible to query the Open Tree directly?

I was able to run a PhyloNext analysis on GBIF using the Acacia_Mishler,2014_Latin-labels.nwk tree on Australia without any problems: https://phylonext.gbif.org/run/eeb8f380-5716-4e47-bb72-2ba6de29da91

I have been trying to use PhyloNext to use the synthetic Open Tree to look up patterns of distribution in the genus Erica in Africa. However, this consistently fails for me (see e.g. https://phylonext.gbif.org/run/d8c5b4ea-abb7-455d-9184-1d53403b2e95). Is this because I neither chose a OToL tree in the "Choose a predefined tree from OTL" field on the form nor entered a phylogeny in the "Phylogenetic tree in newick format (phytree param)" field? I assumed that if I left both of these fields empty, then PhyloNext would look up the occurrence data as per the phylogenetic filters and then pick up the minimum tree for those taxa from OpenTree. Is that incorrect? If leaving these two fields empty is what caused the Internal Server Error in the results, then the form should make that clearer (by making one of those fields required, or by adding a note to indicate that one of those fields need to be filled in); otherwise, let me know and I'll send you the pipeline arguments I provided for that run so we can figure out why it's resulting in an Internal Server Error.

Single colour gradient PD and richness

Multiple colour gradients are for when you have above and below a treshold, which is not the case for PD and richness. A single colour gradient would be more appropriate here.

To do

Known issues

  • Data filtering by occurrence issues not implemented.
    Currently, Apache Arrow does not support filtering data using columns with array-type data (e.g., the issue column in the GBIF occurrence dump). See ARROW-16702 [apache/arrow issue 31991] and ARROW-16641 [apache/arrow issue 32045] . It is possible to filter using DuckDB, but the query consumes >100 GB RAM.

  • Estimation of PD for grid cells with a single species

  • Estimation of memory required for each process not implemented
    (could be useful if the pipeline will be launched on HPC or in the cloud)
    Given the size of the input, it is possible to guess the amount of RAM required for a task.
    But we need to collect the data for various use-case scenarios to make a raw estimate.

Intervals and gradients mixed

Map shows legend with colour intervals on user interface, but the generated pdf shows gradients. These should be made the same.

Feature request

It would be nice to get an overall value for PD in the results considering all data (without spatial component). The reason we want to do this, is to understand phylogenetic diversity loss when we submit multiple different phylogenies.

Filter for establishment means

Currently there is not much data, but in the long term we want to encourage people to indicate whether species are domestic or cultivated. A filter for establishment means would be useful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.