biobakery / maaslin2 Goto Github PK

MaAsLin2: Microbiome Multivariate Association with Linear Models

Home Page: http://huttenhower.sph.harvard.edu/maaslin2

License: Other

R 100.00%

biobakery public tools bioconductor microbiome metagenomics differential-abundance-analysis false-discovery-rate repeated-measures multiple-covariates

maaslin2's Introduction

MaAsLin2 User Manual

MaAsLin2 is the next generation of MaAsLin (Microbiome Multivariable Association with Linear Models).

MaAsLin2 is comprehensive R package for efficiently determining multivariable association between clinical metadata and microbial meta-omics features. MaAsLin2 relies on general linear models to accommodate most modern epidemiological study designs, including cross-sectional and longitudinal, along with a variety of filtering, normalization, and transform methods.

If you use the MaAsLin2 software, please cite our manuscript:

Mallick H, Rahnavard A, McIver LJ, Ma S, Zhang Y, Nguyen LH, Tickle TL, Weingart G, Ren B, Schwager EH, Chatterjee S, Thompson KN, Wilkinson JE, Subramanian A, Lu Y, Waldron L, Paulson JN, Franzosa EA, Bravo HC, Huttenhower C (2021). Multivariable Association Discovery in Population-scale Meta-omics Studies. PLoS Computational Biology, 17(11):e1009442.

Check out the MaAsLin 2 tutorial for an overview of analysis options.

If you have questions, please direct it to :
MaAsLin2 Forum
Google Groups (Read only)

Description
Requirements
Installation
How to Run
Visualization
Troubleshooting

Description

MaAsLin2 finds associations between microbiome multi-omics features and complex metadata in population-scale epidemiological studies. The software includes multiple analysis methods (including support for multiple covariates and repeated measures), filtering, normalization, and transform options to customize analysis for your specific study.

Requirements

MaAsLin2 is an R package that can be run on the command line or as an R function.

Installation

MaAsLin2 can be run from the command line or as an R function.

If only running from the command line, you do not need to install the MaAsLin2 package but you will need to install the MaAsLin2 dependencies.

From command line

Download the source: MaAsLin2.master.zip
Decompress the download:
- $ unzip master.zip
Install the Bioconductor dependencies edgeR and metagenomeSeq.
Install the CRAN dependencies:
- $ R -q -e "install.packages(c('lmerTest','pbapply','car','dplyr','vegan','chemometrics','ggplot2','pheatmap','hash','logging','data.table','glmmTMB','MASS','cplm','pscl'), repos='http://cran.r-project.org')"
Install the MaAsLin2 package (only r,equired if running as an R function):
- $ R CMD INSTALL maaslin2

From R

To install the latest release version of MaAsLin 2:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("Maaslin2")

To install the latest development version of MaAsLin 2:

install.packages("devtools")
library("devtools")
install_github("biobakery/Maaslin2")

How to Run

MaAsLin2 can be run from the command line or as an R function. Both methods require the same arguments, have the same options, and use the same default settings.

Input Files

MaAsLin2 requires two input files.

Data (or features) file
- This file is tab-delimited.
- Formatted with features as columns and samples as rows.
- The transpose of this format is also okay.
- Possible features in this file include taxonomy or genes.
Metadata file
- This file is tab-delimited.
- Formatted with features as columns and samples as rows.
- The transpose of this format is also okay.
- Possible metadata in this file include gender or age.

The data file can contain samples not included in the metadata file (along with the reverse case). For both cases, those samples not included in both files will be removed from the analysis. Also the samples do not need to be in the same order in the two files.

NOTE: If running MaAsLin2 as a function, the data and metadata inputs can be of type data.frame instead of a path to a file.

Output Files

MaAsLin2 generates two types of output files: data and visualization.

Data output files
- all_results.tsv
  - This includes the same data as the data.frame returned.
  - This file contains all results ordered by increasing q-value.
  - The first columns are the metadata and feature names.
  - The next two columns are the value and coefficient from the model.
  - The next column is the standard deviation from the model.
  - The N column is the total number of data points.
  - The N.not.zero column is the total of non-zero data points.
  - The pvalue from the calculation is the second to last column.
  - The qvalue is computed with p.adjust with the correction method.
- significant_results.tsv
  - This file is a subset of the results in the first file.
  - It only includes associations with q-values <= to the threshold.
- ``features```
  - This folder includes the filtered, normalized, and transformed versions of the input feature table.
  - These steps are performed sequentially in the above order.
  - If an option is set such that a step does not change the data, the resulting table will still be output.
- models.rds
  - This file contains a list with every model fit object.
  - It will only be generated if save_models is set to TRUE.
- residuals.rds
  - This file contains a data frame with residuals for each feature.
- fitted.rds
  - This file contains a data frame with fitted values for each feature.
- ranef.rds
  - This file contains a data frame with extracted random effects for each feature (if random effects are specified).
- maaslin2.log
  - This file contains all log information for the run.
  - It includes all settings, warnings, errors, and steps run.
Visualization output files
- heatmap.pdf
  - This file contains a heatmap of the significant associations.
- [a-z/0-9]+.pdf
  - A plot is generated for each significant association.
  - Scatter plots are used for continuous metadata.
  - Box plots are for categorical data.
  - Data points plotted are after filtering but prior to normalization and transform.

Run a Demo

Example input files can be found in the inst/extdata folder of the MaAsLin2 source. The files provided were generated from the HMP2 data which can be downloaded from https://ibdmdb.org/ .

HMP2_taxonomy.tsv: is a tab-demilited file with species as columns and samples as rows. It is a subset of the taxonomy file so it just includes the species abundances for all samples.

HMP2_metadata.tsv: is a tab-delimited file with samples as rows and metadata as columns. It is a subset of the metadata file so that it just includes some of the fields.

Command line

$ Maaslin2.R --fixed_effects="diagnosis,dysbiosisnonIBD,dysbiosisUC,dysbiosisCD,antibiotics,age" --random_effects="site,subject" --standardize=FALSE inst/extdata/HMP2_taxonomy.tsv inst/extdata/HMP2_metadata.tsv demo_output

Make sure to provide the full path to the MaAsLin2 executable (ie ./R/Maaslin2.R).
In the demo command:
- HMP2_taxonomy.tsv is the path to your data (or features) file
- HMP2_metadata.tsv is the path to your metadata file
- demo_output is the path to the folder to write the output

In R

library(Maaslin2)
input_data <- system.file(
    'extdata','HMP2_taxonomy.tsv', package="Maaslin2")
input_metadata <-system.file(
    'extdata','HMP2_metadata.tsv', package="Maaslin2")
fit_data <- Maaslin2(
  input_data, input_metadata, 'demo_output',
  fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
  random_effects = c('site', 'subject'),
  reference=c("diagnosis,CD"),
  standardize = FALSE, cores=1)

Session Info

Session info from running the demo in R can be displayed with the following command.

sessionInfo()

Options

Run MaAsLin2 help to print a list of the options and the default settings.

$ Maaslin2.R --help Usage: ./R/Maaslin2.R [options] <data.tsv> <metadata.tsv> <output_folder>

Options: -h, --help Show this help message and exit

-a MIN_ABUNDANCE, --min_abundance=MIN_ABUNDANCE
    The minimum abundance for each feature [ Default: 0 ]   

-p MIN_PREVALENCE, --min_prevalence=MIN_PREVALENCE
    The minimum percent of samples for which a feature 
    is detected at minimum abundance [ Default: 0.1 ]

-b MIN_VARIANCE, --min_variance=MIN_VARIANCE
   Keep features with variance greater than [ Default: 0.0 ]

-s MAX_SIGNIFICANCE, --max_significance=MAX_SIGNIFICANCE
    The q-value threshold for significance [ Default: 0.25 ]

-n NORMALIZATION, --normalization=NORMALIZATION
    The normalization method to apply [ Default: TSS ]
    [ Choices: TSS, CLR, CSS, NONE, TMM ]

-t TRANSFORM, --transform=TRANSFORM
    The transform to apply [ Default: LOG ]
    [ Choices: LOG, LOGIT, AST, NONE ]

-m ANALYSIS_METHOD, --analysis_method=ANALYSIS_METHOD
    The analysis method to apply [ Default: LM ]
    [ Choices: LM, CPLM, NEGBIN, ZINB ]

-r RANDOM_EFFECTS, --random_effects=RANDOM_EFFECTS
    The random effects for the model, comma-delimited
    for multiple effects [ Default: none ]

-f FIXED_EFFECTS, --fixed_effects=FIXED_EFFECTS
    The fixed effects for the model, comma-delimited
    for multiple effects [ Default: all ]

-c CORRECTION, --correction=CORRECTION
    The correction method for computing the 
    q-value [ Default: BH ]

-z STANDARDIZE, --standardize=STANDARDIZE
    Apply z-score so continuous metadata are 
    on the same scale [ Default: TRUE ]

-l PLOT_HEATMAP, --plot_heatmap=PLOT_HEATMAP
    Generate a heatmap for the significant 
    associations [ Default: TRUE ]

-i HEATMAP_FIRST_N, --heatmap_first_n=HEATMAP_FIRST_N
    In heatmap, plot top N features with significant 
    associations [ Default: TRUE ]

-o PLOT_SCATTER, --plot_scatter=PLOT_SCATTER
    Generate scatter plots for the significant
    associations [ Default: TRUE ]
    
-g MAX_PNGS, --max_pngs=MAX_PNGS
    The maximum number of scatter plots for signficant associations 
    to save as png files [ Default: 10 ]

-O SAVE_SCATTER, --save_scatter=SAVE_SCATTER
    Save all scatter plot ggplot objects
    to an RData file [ Default: FALSE ]

-e CORES, --cores=CORES
    The number of R processes to run in parallel
    [ Default: 1 ]
    
-j SAVE_MODELS --save_models=SAVE_MODELS
    Return the full model outputs and save to an RData file
    [ Default: FALSE ]

-d REFERENCE, --reference=REFERENCE
    The factor to use as a reference level for a categorical variable 
    provided as a string of 'variable,reference', semi-colon delimited for 
    multiple variables. Not required if metadata is passed as a factor or 
    for variables with less than two levels but can be set regardless.
    [ Default: NA ]

Contributions

Thanks go to these wonderful people:

Nick Waters [email protected]
- Design of the PR and attribution process

Troubleshooting

Question: When I run from the command line I see the error Maaslin2.R: command not found. How do I fix this?
- Answer: Provide the full path to the executable when running Maaslin2.R.
Question: When I run as a function I see the error Error in library(Maaslin2): there is no package called 'Maaslin2'. How do I fix this?
- Answer: Install the R package and then try loading the library again.
Question: When I try to install the R package I see errors about dependencies not being installed. Why is this?
- Answer: Installing the R package will not automatically install the packages MaAsLin2 requires. Please install the dependencies and then install the MaAsLin2 R package.

maaslin2's People

Stargazers

Watchers

maaslin2's Issues

Include version in log file?

It would be really nice if Maaslin2.log could include the version of Maaslin used as one of the lines in the log.

Edit Maaslin2 plot font sizes

Hello all,

I was wondering if there are any options available for reducing or adjusting the font size in maaslin2 plots.

Thanks!

error while running maaslin2 in R

Hi all,

Here is the error I get when I run maaslin2 in R. I was able to run it in Galaxy using pcl file, though. Don't know why says values are character.

fit_data = Maaslin2(pathabundance_relab, metadata_t, "results/2018-06-29-seq-QC-and-trimming/HUMANn2/pathabundance/MaAsLin")
[1] "Warning: Deleting existing log file: results/2018-06-29-seq-QC-and-trimming/HUMANn2/pathabundance/MaAsLin/maaslin2.log"
2019-07-22 18:43:54 INFO::Writing function arguments to log file
2019-07-22 18:43:54 INFO::Verifying options selected are valid
2019-07-22 18:43:54 INFO::Determining format of input files
2019-07-22 18:43:54 INFO::Input format is data samples as columns and metadata samples as columns
2019-07-22 18:43:55 INFO::Formula for fixed effects: expr ~ SampleID + FeedingType + DeliveryMode + Comments
2019-07-22 18:43:55 INFO::Running selected normalization method: TSS
Error in FUN(newX[, i], ...) : invalid 'type' (character) of argument
In addition: Warning message:
In vegan::decostand(features_norm, method = "total", MARGIN = 1, :
input data contains negative entries: result may be non-sense

Order changes of category variables in x-axis

Thanks for a good tool.
I have four factors in a predictive variable; Q1, Q2, Q3, and Q4.
I've set Q1 as a reference.
However, I found that the order of the x-axis labels was Q1, Q3, Q2, and Q4 in scatter plots.
I tried to change the labels of the variable with A, B, C, and D, but the trouble was the same such as A, C, B, and D orders.
Could you check this situation?
The version of Maaslin2 I used is 1.8.0 in R.

Confusion over factors of the "value" column in significant_results.tsv

Hi,

I have been running Maaslin for a while and the values in my "value" column from the significant_results.tsv were always the different factors of the column. Recently, for some reason, the values are now changed to 1, 2, 3... instead. I don't know what the values 1, 2, 3 represent. I reran the code from before (where I got the actual names of my factors), however, am still getting this issue.

For example, this is an example of my code:

fit_data = Maaslin2(
input_data = input_data,
input_metadata = input_metadata,
normalization = "CSS",
standardize = FALSE ,
transform = "NONE",
analysis_method = "NEGBIN" ,
max_significance = 0.05,
output = "xxx",
fixed_effects = c("Sample"),
correction = "BH",
reference = c("Sample,aaa"),
min_abundance = 0,
min_prevalence = 0,
heatmap = TRUE,
plot_scatter = TRUE)

My four factors in Sample are aaa, bbb, ccc, ddd. aaa is my reference.

In the past, the "value" column of the significant_results.tsv would be:

value
bbb
bbb
ccc
ddd
bbb
ccc

Now, when I run Maaslin2, the value i get is:

value
2
2
3
4
2
3

Is there something I can change to return to bbb, ccc, ddd instead of 2, 3, 4?

I hope my explanation isn't confusing! Thank you for the great tool!

Carmen

NOTE: Instead of opening issues in github, please consider creating a new topic in https://forum.biobakery.org/. Please read for more details.

The bioBakery support forum provides software support and tutorials for methods for microbial community profiling developed by the Huttenhower lab. Please consider creating a new topic in the bioBakery support forum before opening issues in Github.

xtfrm.data.frame issue

I was running the ready-made example in the function Maaslin2 help page:

input_data <- system.file(
             'extdata','HMP2_taxonomy.tsv', package="Maaslin2")

input_metadata <-system.file(
             'extdata','HMP2_metadata.tsv', package="Maaslin2")

fit_data <- Maaslin2(
             input_data, input_metadata,'demo_output', transform = "AST",
             fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
             random_effects = c('site', 'subject'),
             normalization = 'NONE',
             reference = 'diagnosis,nonIBD',
             standardize = FALSE)

This leads to the following error:

....
2023-04-28 10:41:54.896843 INFO::Writing heatmap of significant results to file: demo_output/heatmap.pdf
Error in xtfrm.data.frame(x) : cannot xtfrm data frames
In addition: Warning messages:
1: Model failed to converge with 1 negative eigenvalue: -5.6e+00 
2: Model failed to converge with 1 negative eigenvalue: -1.1e+01 
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv,  :
  Model failed to converge with max|grad| = 0.00291214 (tol = 0.002, component 1)
4: Model failed to converge with 1 negative eigenvalue: -2.1e+02 
5: Model failed to converge with 1 negative eigenvalue: -2.2e+02

Information my R session:

> sessionInfo()
R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.2 LTS

Matrix products: default
BLAS:   /home/xxx/bin/R-4.3.0/lib/libRblas.so 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Mariehamn
tzcode source: system (glibc)

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] doRNG_1.8.6                     rngtools_1.5.2                 
 [3] foreach_1.5.2                   ANCOMBC_2.2.0                  
 [5] lubridate_1.9.2                 forcats_1.0.0                  
 [7] stringr_1.5.0                   dplyr_1.1.2                    
 [9] purrr_1.0.1                     readr_2.1.4                    
[11] tidyr_1.3.0                     tibble_3.2.1                   
[13] ggplot2_3.4.2                   tidyverse_2.0.0                
[15] knitr_1.42                      MicrobiomeStat_1.1             
[17] Maaslin2_1.13.0                 ALDEx2_1.32.0                  
[19] zCompositions_1.4.0-1           truncnorm_1.0-9                
[21] NADA_1.6-1.1                    survival_3.5-5                 
[23] MASS_7.3-59                     tidySummarizedExperiment_1.10.0
[25] patchwork_1.1.2.9000            mia_1.8.0                      
[27] MultiAssayExperiment_1.26.0     TreeSummarizedExperiment_2.8.0 
[29] Biostrings_2.68.0               XVector_0.40.0                 
[31] SingleCellExperiment_1.22.0     SummarizedExperiment_1.30.0    
[33] Biobase_2.60.0                  GenomicRanges_1.52.0           
[35] GenomeInfoDb_1.36.0             IRanges_2.34.0                 
[37] S4Vectors_0.38.0                BiocGenerics_0.46.0            
[39] MatrixGenerics_1.12.0           matrixStats_0.63.0             
[41] BiocStyle_2.28.0                rebook_1.9.0                   

loaded via a namespace (and not attached):
  [1] bitops_1.0-7                DirichletMultinomial_1.42.0
  [3] doParallel_1.0.17           httr_1.4.5                 
  [5] numDeriv_2016.8-1.1         backports_1.4.1            
  [7] tools_4.3.0                 utf8_1.2.3                 
  [9] R6_2.5.1                    vegan_2.6-4                
 [11] lazyeval_0.2.2              mgcv_1.8-42                
 [13] rhdf5filters_1.12.0         permute_0.9-7              
 [15] withr_2.5.0                 gridExtra_2.3              
 [17] cli_3.6.1.9000              logging_0.10-108           
 [19] biglm_0.9-2.1               sandwich_3.0-2             
 [21] mvtnorm_1.1-3               robustbase_0.95-1          
 [23] pbapply_1.7-0               proxy_0.4-27               
 [25] yulab.utils_0.0.6           foreign_0.8-84             
 [27] scater_1.28.0               decontam_1.20.0            
 [29] readxl_1.4.2                rstudioapi_0.14            
 [31] RSQLite_2.3.1               generics_0.1.3             
 [33] Matrix_1.5-4                biomformat_1.28.0          
 [35] ggbeeswarm_0.7.1            fansi_1.0.4                
 [37] DescTools_0.99.48           DECIPHER_2.28.0            
 [39] lifecycle_1.0.3             multcomp_1.4-23            
 [41] yaml_2.3.7                  rhdf5_2.44.0               
 [43] grid_4.3.0                  blob_1.2.4                 
 [45] crayon_1.5.2                dir.expiry_1.8.0           
 [47] lattice_0.21-8              beachmat_2.16.0            
 [49] CodeDepends_0.6.5           pillar_1.9.0               
 [51] optparse_1.7.3              statip_0.2.3               
 [53] boot_1.3-28.1               gld_2.6.6                  
 [55] estimability_1.4.1          codetools_0.2-19           
 [57] glue_1.6.2                  data.table_1.14.8          
 [59] Rdpack_2.4                  vctrs_0.6.2                
 [61] treeio_1.24.0               cellranger_1.1.0           
 [63] gtable_0.3.3                cachem_1.0.7               
 [65] xfun_0.39                   rbibutils_2.2.13           
 [67] Rfast_2.0.7                 coda_0.19-4                
 [69] pcaPP_2.0-3                 modeest_2.4.0              
 [71] timeDate_4022.108           iterators_1.0.14           
 [73] statmod_1.5.0               gmp_0.7-1                  
 [75] TH.data_1.1-2               ellipsis_0.3.2             
 [77] nlme_3.1-162                phyloseq_1.44.0            
 [79] bit64_4.0.5                 filelock_1.0.2             
 [81] fBasics_4022.94             irlba_2.3.5.1              
 [83] vipor_0.4.5                 rpart_4.1.19               
 [85] colorspace_2.1-0            DBI_1.1.3                  
 [87] Hmisc_5.0-1                 nnet_7.3-18                
 [89] ade4_1.7-22                 Exact_3.2                  
 [91] tidyselect_1.2.0            emmeans_1.8.5              
 [93] timeSeries_4021.105         bit_4.0.5                  
 [95] compiler_4.3.0              graph_1.78.0               
 [97] htmlTable_2.4.1             BiocNeighbors_1.18.0       
 [99] expm_0.999-7                DelayedArray_0.25.0        
[101] plotly_4.10.1               checkmate_2.2.0            
[103] scales_1.2.1                DEoptimR_1.0-12            
[105] spatial_7.3-16              digest_0.6.31              
[107] minqa_1.2.5                 rmarkdown_2.21.3           
[109] base64enc_0.1-3             htmltools_0.5.5            
[111] pkgconfig_2.0.3             lme4_1.1-33                
[113] sparseMatrixStats_1.12.0    lpsymphony_1.28.0          
[115] stabledist_0.7-1            fastmap_1.1.1              
[117] rlang_1.1.0                 htmlwidgets_1.6.2          
[119] DelayedMatrixStats_1.22.0   energy_1.7-11              
[121] zoo_1.8-12                  jsonlite_1.8.4             
[123] BiocParallel_1.34.0         BiocSingular_1.16.0        
[125] RCurl_1.98-1.12             magrittr_2.0.3             
[127] Formula_1.2-5               scuttle_1.10.0             
[129] GenomeInfoDbData_1.2.10     Rhdf5lib_1.22.0            
[131] munsell_0.5.0               Rcpp_1.0.10                
[133] ape_5.7-1                   viridis_0.6.2              
[135] RcppZiggurat_0.1.6          CVXR_1.0-11                
[137] stringi_1.7.12              rootSolve_1.8.2.3          
[139] stable_1.1.6                zlibbioc_1.46.0            
[141] plyr_1.8.8                  parallel_4.3.0             
[143] ggrepel_0.9.3               lmom_2.9                   
[145] splines_4.3.0               hash_2.2.6.2               
[147] multtest_2.56.0             hms_1.1.3                  
[149] igraph_1.4.2                reshape2_1.4.4             
[151] ScaledMatrix_1.7.1          rmutil_1.1.10              
[153] XML_3.99-0.14               evaluate_0.20              
[155] BiocManager_1.30.20         nloptr_2.0.3               
[157] tzdb_0.3.0                  getopt_1.20.3              
[159] clue_0.3-64                 rsvd_1.0.5                 
[161] xtable_1.8-4                Rmpfr_0.9-2                
[163] e1071_1.7-13                tidytree_0.4.2             
[165] viridisLite_0.4.1           class_7.3-21               
[167] gsl_2.1-8                   lmerTest_3.1-3             
[169] memoise_2.0.1               beeswarm_0.4.0             
[171] cluster_2.1.4               timechange_0.2.0

Use weight in the Maaslin2

Hi. I have a weighted population, which mean every person in my data may respresent different number of people. I wonder is it possible to use weight in Maaslin2? Like the "weights" parameter in the following command:

model1=glm(nodegree~treat,data=lalonde,family=binomial(),weights=iptw)

Install Maaslin2 in TSD

Hi! I am working within TSD and to install pacakges I need to import the [.tar.gz] file into TSD and thereby install it within the "safe cloud". I downloaded it from here, https://bioconductor.org/packages/release/bioc/html/Maaslin2.html, but it is the 1.2.0 version. Can you share the most recent version.tar.gz with me? That would be much appreciated. Thanks,
Maria

Add options to choose colour and/or shape of point in scatterplot

Dear Maaslin2 developers,

using the function, everything looks fine and nice. Very useful indeed, Thanks!

I'm using far less data than the one used in your example data sets, and I would find extremely useful to add parameters to (optionally) map one of the variables in metadata to points colour or shape in scatterplots. It could help to better interpret the microbes/continuous data relation
Maybe also the possibility to plot label could be useful.

Best,
Francesco.

rCLR transformation

Hi,

Thanks for this wonderful tool.

I had an inquiry regarding the transformation implemented in the package.

For the CLR, the imputation is calculated as half the min feature for each sample. Why did not you consider using the robust CLR? Did you find in your testing any issues with it?

Thanks!

Version numbers are inconsistent between Github and Bioconductor

I asked this the forum and got no answer.

It is not clear where to find the "latest" version of Maaslin2. Installing from Github results in a v1.7.3, and installing from Bioconductor results in v1.10.0.

However the Bioconductor version is missing code that is > 9 months old according to the git blame (example: https://github.com/biobakery/Maaslin2/blame/master/R/Maaslin2.R#L974) .

This poses a problem when trying to update packages as it claims that Maaslin2 is outdated if you installed the Github repo. This would probably be fixed with simply making a new release on Github with a higher version than v1.10.0

Please provide the reference for the variable 'diagnosis' which includes more than 2 levels: UC, CD, nonIBD

Hi,
thank for your package.
I've just done a small test and had the error:

fit_data = Maaslin2(

input_data = input_data,
input_metadata = input_metadata,
output = "demo_output",
fixed_effects = c("diagnosis", "dysbiosis"))
[1] "Creating output folder"
[1] "Creating output figures folder"
2021-04-23 16:56:06 INFO::Writing function arguments to log file
2021-04-23 16:56:06 INFO::Verifying options selected are valid
2021-04-23 16:56:06 INFO::Determining format of input files
2021-04-23 16:56:06 INFO::Input format is data samples as rows and metadata samples as rows
2021-04-23 16:56:06 INFO::Formula for fixed effects: expr ~ diagnosis + dysbiosis
Error in Maaslin2(input_data = input_data, input_metadata = input_metadata, :
Please provide the reference for the variable 'diagnosis' which includes more than 2 levels: UC, CD, nonIBD

Can you let me know how I can resolve this error before to use my own data.
Thank you,
Virgg

a

Hello MaAsLin2 Users,

Issue with windows system: cannot open file './Scratch/tmp_1/maaslin2.log

Hi, I am trying to run Maaslin2 on Windows system using following command

Maaslin2(
input_data = otu.tab,
input_metadata = metadata,
output = 'C:\Users\lenovo\Desktop\tmp'
#transform = "AST",
fixed_effects = c('HBP','X1','PA', 'age', 'dietscore'),
random_effects = c('study'),
#normalization = 'NONE',
plot_heatmap = F,
plot_scatter = F,
standardize = FALSE)

Where it fails due to:

Error in file(file, ifelse(append, "a", "w")) :
cannot open the connection
In addition: Warning message:
In file(file, ifelse(append, "a", "w")) :
cannot open file './Scratch/tmp_1/maaslin2.log': No such file or directory

However, it works if I run it on linux system, so I am wandering whether Maaslin2 cannot run on Windows system

ERROR: dependency 'lpsymphony' is not available for package 'Maaslin2'

Hi,

when installing with:

if(!requireNamespace("BiocManager", quietly = TRUE))
	install.packages("BiocManager")
BiocManager::install("Maaslin2")

I get the following error:

ERROR: dependency 'lpsymphony' is not available for package 'Maaslin2'

R version 4.2.2

Thanks,

Theo

Add `scatter_first_n`

Could you add scatter_first_n (particularly given that heatmap_first_n exists)?

Thanks,
Liam

"Please provide the reference for the variable" error when running Maaslin2

Hello!

I am trying to run Maaslin2 with the code:

input_data = read.table(file = "4Masslin2_input.data_kos.taxonomy.archaea.mt.2group.tsv",
                        header = TRUE, sep = "\t")
rownames(input_data) <- input_data$Geneid_ord
input_data$Geneid_ord = NULL

metadata = read.table(file = "4Masslin2_metadata_kos.taxonomy.archaea.mt.2group.tsv",
                      header = TRUE, sep = "\t")
rownames(metadata) <- metadata$Geneid_ord
metadata$Geneid_ord = NULL

# Create the 'Ctrl' column
metadata$Ctrl <- ifelse(metadata$Diagnosis == "Ctrl", "Yes", "No")

# Create the 'PD' column
metadata$PD <- ifelse(metadata$Diagnosis == "PD", "Yes", "No")

# Create the 'iRBD' column
metadata$iRBD <- ifelse(metadata$Diagnosis == "iRBD", "Yes", "No")

reference <- unique(metadata$S)
reference <- c("Methanobrevibacter_A smithii","Methanobrevibacter_A smithii_A","Methanosphaera stadtmanae","Methanomethylophilus alvus","DTU008 sp001421185","Methanomassiliicoccus luminyensis","MX-02 sp006954405","Coprobacillus cateniformis","Methanobrevibacter_C arboriphilus_A","Methanosphaera cuniculi")

Maaslin2(input_data = input_data,
         input_metadata = metadata,
         fixed_effects = c("Ctrl", "PD", "iRBD", "S"),
         reference = reference,
         min_prevalence = 0,
         output = "test",
         transform = "LOG",
         plot_heatmap = TRUE,
         plot_scatter = TRUE,
         heatmap_first_n = 50,
         max_significance = 1)

Examples of my metadata and input data are below:

metadata:

         Diagnosis       D                 P               C                       O                       F                    G
K00053_1      Ctrl Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae Methanobrevibacter_A
K00053_2      Ctrl Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae Methanobrevibacter_A
K00053_3      Ctrl Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae       Methanosphaera
K00053_4      Ctrl Archaea  Thermoplasmatota  Thermoplasmata Methanomassiliicoccales Methanomethylophilaceae Methanomethylophilus
K00053_5        PD Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae Methanobrevibacter_A
K00053_6        PD Archaea Methanobacteriota Methanobacteria      Methanobacteriales     Methanobacteriaceae Methanobrevibacter_A
                                      S Ctrl  PD iRBD
K00053_1   Methanobrevibacter_A smithii  Yes  No   No
K00053_2 Methanobrevibacter_A smithii_A  Yes  No   No
K00053_3      Methanosphaera stadtmanae  Yes  No   No
K00053_4     Methanomethylophilus alvus  Yes  No   No
K00053_5   Methanobrevibacter_A smithii   No Yes   No
K00053_6 Methanobrevibacter_A smithii_A   No Yes   No

input_data:

                tpm
K00053_1 166.502489
K00053_2 188.409788
K00053_3  69.970092
K00053_4   2.219452
K00053_5 642.522944
K00053_6 136.308126

As a result I receive an error:

2023-05-11 17:25:04 INFO::Writing function arguments to log file
2023-05-11 17:25:04 INFO::Verifying options selected are valid
2023-05-11 17:25:04 INFO::Determining format of input files
2023-05-11 17:25:04 INFO::Input format is data samples as rows and metadata samples as rows
2023-05-11 17:25:04 INFO::Formula for fixed effects: expr ~  Ctrl + PD + iRBD + S
Error in Maaslin2(input_data = input_data, input_metadata = metadata,  : 
  Please provide the reference for the variable 'S' which includes more than 2 levels: Methanobrevibacter_A smithii, Methanobrevibacter_A smithii_A, Methanosphaera stadtmanae, Methanomethylophilus alvus, Methanomassiliicoccus_A intestinalis, UBA71 sp905187815, DTU008 sp001421185, Methanomassiliicoccus luminyensis, MX-02 sp006954405, Coprobacillus cateniformis, Methanobrevibacter_C arboriphilus_A, Methanosphaera cuniculi, Methanobrevibacter ruminantium_A.

Could you please suggest a solution to the error and probably the source of it?

Update maaslin2 bioconda recipe

maaslin2 is a great tool and we are planning to include maaslin2 into Galaxy and using it in comparative-analysis focused trainings.
There is already a wrapper for it, but it needs to be fixed and updated: https://github.com/galaxyproject/tools-iuc/tree/main/tools/maaslin2 - we will do that.
Could you maybe update the bioconda recipe with the newest release: https://github.com/bioconda/bioconda-recipes/blob/master/recipes/maaslin2/meta.yaml
So we can also bump the version in the Galaxy wrapper to provide the users with the newest version.

If you would like to be involved in the planned training or can provide specific scenarios (longitudinal, fixed-effects ... ) you would like to see in the training, please feel free to reach out as well.

Filtering of features by abundance

Hi,

I'm having issues with the filtering by abundance when using different normalization methods. It seems Maaslin2 first runs the normalization of the data and then performs the filtering, however, it is hard to determine a number to set an abundance cut-off with normalized data. It would make more sense determine which features need to be filtered, normalize and then filter.

An example with the test data provided by Maaslin2

library(Maaslin2)

input_data <- system.file('extdata','HMP2_taxonomy.tsv', package="Maaslin2")

input_metadata <-system.file('extdata','HMP2_metadata.tsv', package="Maaslin2")

Model_1 <- Maaslin2(input_data, 
                    input_metadata, 
                    "/ebio/abt3_projects/small_projects/jdelacuesta/scratchpad", 
                    normalization = "CLR",
                    fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
                    random_effects = c('site', 'subject'),
                    standardize = FALSE)

Model_2 <- Maaslin2(input_data, 
                    input_metadata, 
                    "/ebio/abt3_projects/small_projects/jdelacuesta/scratchpad", 
                    normalization = "NONE",
                    fixed_effects = c('diagnosis', 'dysbiosisnonIBD','dysbiosisUC','dysbiosisCD', 'antibiotics', 'age'),
                    random_effects = c('site', 'subject'),
                    standardize = FALSE)

Using CLR transformation results in different number of filtered features:

From Model_1

2020-03-12 15:58:04 INFO::Writing function arguments to log file
2020-03-12 15:58:04 INFO::Verifying options selected are valid
2020-03-12 15:58:04 INFO::Determining format of input files
2020-03-12 15:58:04 INFO::Input format is data samples as rows and metadata samples as rows
2020-03-12 15:58:04 INFO::Formula for random effects: expr ~ (1 | site) + (1 | subject)
2020-03-12 15:58:04 INFO::Formula for fixed effects: expr ~  diagnosis + dysbiosisnonIBD + dysbiosisUC + dysbiosisCD + antibiotics + age
2020-03-12 15:58:04 INFO::Running selected normalization method: CLR
2020-03-12 15:58:04 INFO::Filter data based on min abundance and min prevalence
2020-03-12 15:58:04 INFO::Total samples in data: 1595
2020-03-12 15:58:04 INFO::Min samples required with min abundance for a feature not to be filtered: 159.500000
2020-03-12 15:58:04 INFO::Total filtered features: 51

From Model_2

2020-03-12 15:58:30 INFO::Writing function arguments to log file
2020-03-12 15:58:30 INFO::Verifying options selected are valid
2020-03-12 15:58:30 INFO::Determining format of input files
2020-03-12 15:58:30 INFO::Input format is data samples as rows and metadata samples as rows
2020-03-12 15:58:30 INFO::Formula for random effects: expr ~ (1 | site) + (1 | subject)
2020-03-12 15:58:30 INFO::Formula for fixed effects: expr ~  diagnosis + dysbiosisnonIBD + dysbiosisUC + dysbiosisCD + antibiotics + age
2020-03-12 15:58:30 INFO::Running selected normalization method: NONE
2020-03-12 15:58:30 INFO::Filter data based on min abundance and min prevalence
2020-03-12 15:58:30 INFO::Total samples in data: 1595
2020-03-12 15:58:30 INFO::Min samples required with min abundance for a feature not to be filtered: 159.500000
2020-03-12 15:58:30 INFO::Total filtered features: 0

Clarification for min_prevalence option

Hello,

It might be helpful to adjust the description of min_prevalence option from "The minimum percent of samples for which a feature is detected at minimum abundance" to "The minimum proportion (fraction) of samples for which a feature is detected at minimum abundance"

The way it is worded now sound like the default value "min_prevalence = 0.1 "is a percentage, but it's actually 10% not 0.1 percent.