Git Product home page Git Product logo

microbiomestat's Introduction

Caffery's GitHub stats Top Langs

📊 Statistics PhD Student | 🧬 Multi-Omics Analyst | 🖥 R Package Developer

Hello! I'm Caffery Yang👋

I am thrilled to be joining the Statistics PhD program at Texas A&M University in Fall 2024. As a recent graduate from Southern Medical University, where I majored in Biostatistics and achieved a 3.95/4.00 GPA (top 2% of my class), I am currently visiting Jian Yang's Lab at Westlake University to further enrich my research experience before starting my doctoral studies.

🎖 Achievements

  • Received the National Scholarship from the Ministry of Education of China.
  • Secured First Prize in the National College Student Mathematics Competition.
  • Honored with Meritorious Winner (M Award) in the Mathematical Contest in Modeling (MCM) in the USA.

🧪 Research Interests

Under the guidance of Prof. Jun Chen at Mayo Clinic and Prof. Liangliang Zhang at Case Western Reserve University, my research primarily revolves around microbiome data analysis.

  • Developed the MicrobiomeStat R package, writing over 17,000 lines of R code. Comprehensive documentation is available on the package wiki, which contains in-depth guidance and tutorials exceeding 20,000 words. Additionally, an interactive MicrobiomeStat Shiny application has been created to provide a user-friendly platform for longitudinal statistical analysis and visualization of microbiome data.
  • Pioneered the MicrobiomeGallery collaborative platform.
  • Independently developed the ggpicrust2 R package, which amassed 85 stars on GitHub and over 11,000 downloads on CRAN.

My future vision embraces a multi-omics perspective, synthesizing insights from genomics, proteomics, transcriptomics, and more.

📚 Publications

  • Sole first author on a paper accepted by Bioinformatics.
  • Preparing to submit another paper as sole first author to Microbiome.

🤝 Collaboration

I am extremely open to collaborations. If you have any projects related to microbiomics or multi-omics and think we could work together, please don't hesitate to reach out. In addition to being open to new projects, I am particularly enthusiastic about assisting others with their microbiome/multi-omics data analysis. I'm also keen on collaborating with researchers and authors who have utilized my pipelines, such as ggpicrust2 or MicrobiomeStat, in their work. If you are considering or have already used these tools in your paper, I would be thrilled to explore the possibility of co-authorship. Whether it's contributing to data analysis, interpretation, or manuscript preparation, I am ready to bring my expertise to your research. Let's connect and see how we can advance the field of microbiomics together!

🛠 Projects

1. Application of LLM in Microbiome Taxonomy

Claude's innovative approach using the entire NCBI database with the LLM offers direct taxonomic insights from FASTQ sequences. A potential game-changer when compared with traditional methods.

2. Performance of Microbiome DA Methods in Other Omics

Exploring how microbiome DA methods, known for processing the compositional nature of microbiome data, perform on other omics.

3. Generative AI in Microbiome Data Analysis and Accuracy Assessment

Assessing the LLM model's capability as a microbiome data analysis expert against real-world experts.

4. PI_find

A tool designed to assist graduate school aspirants in discovering PI names from specific professional journals.

5. Fine-Tuning Platform for Bioinformatics Repositories

Aiming to develop a platform to fine-tune ChatGPT with bioinformatics repositories, enhancing query results for newer tools.

📫 Connect with me

Twitter: CafferyYang

microbiomestat's People

Contributors

cafferychen777 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

microbiomestat's Issues

generate_taxa_test_pair() issue with time.var = NULL in random effect

Dear developer,
I am encountering an issue when using the generate_taxa_test_pair() function to test for random effect.

When I use the basic linda function as below it works:

linda.obj.all <- linda(asv_table[, ind], meta[ind, ], formula = '~SampleType1 + (1|SSF)',
feature.dat.type = 'count',
prev.filter = 0.1,
mean.abund.filter = 0.004 ,
is.winsor = TRUE, outlier.pct = 0.03,
p.adj.method = "BH", alpha = 0.05)

However, the generate_taxa_test_pair() that should work the same but used as below gives me an error:
#create test.list
test.list <- generate_taxa_test_pair(
data.obj = data.obj,
group.var = "sampletype1", adj.vars = NULL,
subject.var = "SSF", time.var = NULL,
feature.dat.type = "count",
feature.level = "Genus",
prev.filter = 0.1,
abund.filter = 0.004)

Output:
Rule 1 passed: data.obj is a list.
Rule 2 passed: meta.dat is a data.frame.
Rule 3 passed: The row names of feature.tab match the row names of feature.ann.
Rule 4 passed: The column names of feature.tab match the row names of meta.dat.
Rule 5 passed: feature.tab is a matrix.
Rule 6 passed: feature.ann is a matrix.
Please note: The data components should follow base R data.frame and matrix structures, not phyloseq's formal class.
Validation passed.
Note: Passing validation does not guarantee the absence of all data issues. Further data exploration may be needed.
Your data is in raw format ('Raw'). Normalization is crucial for further analyses. Now, 'mStat_normalize_data' function is automatically applying 'TSS' transformation.
Data has been successfully normalized using TSS method.
0 features are filtered!

The filtered data has 98 samples and 49 features that will be tested!

Fit linear mixed effects models ...
Completed.
Error in grepl(pattern = time.var, x = df_name) :
invalid 'pattern' argument

The generate_taxa_test_pair() function seems designed to optionally handle longitudinal data, given it accepts a time.var parameter, which I set to NULL. This design choice suggests it should be capable of running analyses without a time component, but the implementation does not gracefully handle a NULL time.var.

I write here a piece of the core code of the generate_taxa_test_pair() function:

Case where time.var is NULL

  if (is.null(time.var)) {
    if (is.null(group.var)) {
      fixed_effects <- adj.vars_str
      if (is.null(fixed_effects)) {
        fixed_effects <- "1"  # Intercept-only model
      }
    } else {
      if (!is.null(adj.vars_str)) {
        fixed_effects <- paste(adj.vars_str, "+", group.var)
      } else {
        fixed_effects <- group.var
      }
    }
    random_effects <- paste("(1 |", subject.var, ")")

Do you have any solution for this issue?

Thank you in advance I hope I can improve to implement the use of this package.
Best Regards,
Valentina

Wiki Example Code Seems Outdated

In Feature-level Analysis, I see generate_taxa_volcano_single example but function is not present in version 1.2 on CRAN.

> library(MicrobiomeStat)
> generate_taxa_volcano_single
Error: object 'generate_taxa_volcano_single' not found

Also, is there plan to add MA plot to package? It would be a nice addiiton.

Error: number of observations (=44) <= number of random effects (=44) for term (1 + time | subject)

First of all thank you for a very useful package!
It would be great if you can give some clarifications regarding the following error:
Error in linda: Error in fun(i): task 1 failed - "number of observations (=44) <= number of random effects (=44) for term (1 + time | subject); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable"
This error occurs frequently when generate_taxa_test_pair() function is used, including dataset provided together with the MicrobiomeStat package(peerj32.obj).
Despite the error, the function produce results, however I am not sure if this results are reliable.
Can you please clarify this issue?

strata.var in mStat_generate_report_long is not optional right now

If strata.var=NULL, then the params_data has a different row amount. A line need to be added before the params_data:
custom_strata_status <- ifelse(is.null(strata.var), 'NULL', toString(strata.var))
and then strata.var in the params_data$Value changed to custom_strata_status.

GMPR transformation function is missing

Describe the Bug
When I try to normalise the data using the GMPR method, I get an error that says that GMPR function is missing

Reproducible Example
Sorry, to reproduce the data is too complicated. But given that MicrobiomeData is a MicrobiomeStat object...

MicrobiomeData.gmpr <- mStat_normalize_data(MicrobiomeData, "GMPR")

Expected Behavior
Outputs transformed values

Actual Behavior
Outputs and error:

Error in GMPR(otu_tab) : could not find function "GMPR"

Screenshots
N/A

Environment Information:

  • Operating System: Windows 11
  • R Version: 4.3.2
  • Package Version: 1.13

Additional Context
N/A

phyloseq to microbiomestat-object

Hi!
I am not able to use the "mStat_convert_phyloseq_to_data_obj". Any work around this?
I have loaded the libraries phyloseq, microbiomestat and microbiome.

Thanks,
Maria

Impossible to find the function "GUniFrac"

Describe the Bug
When I use the function generate_beta_change_test_long to calculate BC, Jaccard, UniFrac distances, I get the error "Impossible to find the function "GUniFrac""

Screenshots

error_unifrac

Environment Information:

  • R Version: 4.3.1
  • Package Version: 1.1.3

Attempted Solutions
I updated MicrobiomeStats, the error also triggers when using other functions like generate_beta_trend_test_long or generate_beta_volatility_test_long.
The data.obj contains a tree object (rooted), I managed to compute Unifrac distances with the ordinate function from phyloseq package.

generate_taxa_areaplot_long “No shared levels found between names(values)..."

Hi I'm trying to run the function generate_taxa_areaplot_long and get the below error:

Rule 1 passed: data.obj is a list.

Rule 2 passed: meta.dat is a data.frame.

Rule 3 passed: The row names of feature.tab match the row names of feature.ann.

Rule 4 passed: The column names of feature.tab match the row names of meta.dat.

Rule 5 passed: feature.tab is a matrix.

Rule 6 passed: feature.ann is a matrix.

Please note: The data components should follow base R data.frame and matrix structures, not phyloseq's formal class.

Validation passed.

Note: Passing validation does not guarantee the absence of all data issues. Further data exploration may be needed.

Warning message:
“No shared levels found between names(values) of the manual scale and the data's colour
values.”
Warning message:
“No shared levels found between names(values) of the manual scale and the data's colour
values.”

ps_to_ms <- mStat_convert_phyloseq_to_data_obj(ps)

generate_taxa_areaplot_long(
  data.obj = ps_to_ms,
  subject.var = "Sample.ID", 
  time.var = "Site_Num",
  group.var = "Site.Status",
  feature.level = "Family",
  feature.dat.type = "proportion",  
  feature.number = 10,
  base.size = 10,
  theme.choice = "bw",
  palette = NULL,
  pdf = TRUE,
  file.ann = NULL
)

A figure is printed with the Family key, but the actual area plot is empty with no taxonomy in the main figure.

Screenshots

Screenshot 2024-04-02 at 1 58 02 PM

Environment Information:

  • Operating System:
    R version 4.3.2 (2023-10-31)
    Platform: x86_64-apple-darwin13.4.0 (64-bit)
    Running under: macOS Big Sur ... 10.16

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] qiime2R_0.99.6 pivottabler_1.5.5 vegan_2.6-4
[4] lattice_0.22-5 permute_0.9-7 lubridate_1.9.3
[7] forcats_1.0.0 stringr_1.5.1 dplyr_1.1.4
[10] purrr_1.0.2 readr_2.1.4 tidyr_1.3.1
[13] tidyverse_2.0.0 microbiome_1.24.0 phyloseq_1.46.0
[16] ggplot2_3.5.0 MicrobiomeStat_1.1.3 tibble_3.2.1
[19] rlang_1.1.3

Any advice is greatly appreciated.

Maintain Sample Order Consistency in mStat_aggregate_by_taxonomy() Output

Current Feature
The current mStat_aggregate_by_taxonomy() outputs where the sample order of the table is different from the sample order in the input table. This behavior is not documented, which can lead to errors in subsequent analyses.

Proposed Enhancement
Modify the function to ensure that the sample order in the output table matches the sample order in the input table.

Benefits of the Enhancement
Ensuring the sample order consistency between the input and output tables will:

  • Prevent potential errors in subsequent analyses, such as incorrect calculations when combining results with the original table.
  • Improve the usability and reliability of the function for users.
  • Reduce the need for users to manually check and reorder samples, saving time and reducing the risk of human error.

If a user calculates a metric like the F/B ratio base on the outputs of mStat_aggregate_by_taxonomy(data.obj, feature.level = "phylum") and directly cbind()s this result to the original table without noticing the order change, the resulting table will have mismatched data.

Ensuring the output order matches the input order will help users avoid such pitfalls.

Thank you for considering this enhancement.

"$ operator is invalid for atomic vectors" when using mStat_import_biom_as_data_obj()

Describe the Bug
Hello! I've been using this package as a huge resource in my project. It's incredibly easy to use and the graphs look great. However, I ran into an issue with atomic vectors when creating a mStat object from a biom object.

Reproducible Example
I use the biomformat package to create my BiomData object as below:

BiomData <- make_biom(data = otu_df_biom,
                      sample_metadata = metadata_df,
                      observation_metadata = taxonomy_df_biom,
                      id = "axiom_data",
                      matrix_element_type = "int")
write_biom(BiomData, biom_file = "data_file.biom") #I truncated the filename, there is usually a full path

Then, I use the following code to create the mStat object:

biom_from_mStat <- mStat_import_biom_as_data_obj(BIOMfilename  = "data_file.biom")

Error Message
Error in i$metadata$taxonomy : $ operator is invalid for atomic vectors

Environment Information:

  • R Version: 4.2.3
  • Package Version: 1.1.3

`MicrobiomeStat::generate_beta_test_single` represents errors

Hello, when I ran the One-Click Reports Generation by using mStat_generate_report_single, it showed the error information.

in `colnames<-`:
! attempt to set 'colnames' on an object with less than two dimensions
Backtrace:
 1. MicrobiomeStat::generate_beta_test_single(...)
 2. GUniFrac::PermanovaG2(formula, data = data.obj$meta.dat)
 3. GUniFrac::adonis3(as.formula(paste("Y", "~", rhs)), data, ...)
 4. base::`colnames<-`(`*tmp*`, value = colnames(lhs))
                                                                                                                                                                                      
Error in `colnames<-`(`*tmp*`, value = colnames(lhs)) :

after debuging the code, the error actually was from GUniFrac::adonis3. I have no idea on how to solve it. Could you please give me some help?

The following codes were that I ran

data(peerj32.obj)

data.obj = peerj32.obj

# Specify variable names
group.var = "group" # Variable used for grouping samples, primary variable of interest
vis.adj.vars = c("sex") # Covariates whose effects need to be removed before visualization
test.adj.vars = c("sex") # Covariates whose effects need to be adjusted in statistical tests
subject.var = "subject" # Variable used for subject identification
time.var = "time" # Variable used for time points

# Specify diversity indices
alpha.name = c("shannon", "observed_species") # Alpha diversity indices to calculate
dist.name = c("BC",'Jaccard') # Beta diversity indices to calculate

# Specify feature levels for visualization and testing
vis.feature.level = c("Phylum", "Family", "Genus") # Feature levels for visualization to have an overview of the data
test.feature.level = "Family" # Feature level to use for testing

# Specify other parameters
feature.dat.type = "count" # Type of the feature data
theme.choice = "bw" # Theme choice for the plots
base.size = 20 # Base size for the plots

# Parameters for multiple testing. Following setting is just for illustration. In real data analysis, multiple testing correction should always be applied.  
feature.mt.method = "none" # Multiple testing method for differential feature analysis
feature.sig.level = 0.2 # Significance level cutoff for highlighting differential features 

# Specify output file
output.file = "mStat_generate_report_single_example.pdf" # Replace with your own file path for the output report

# Specify optional parameters
dist.obj = NULL # Replace with a pre-computed distance matrix if available
alpha.obj = NULL # Replace with a pre-computed alpha diversity matrix if available
depth = NULL # Replace with a desired rarefaction depth, if NULL, minimum depth will be used
t.level = "1" # Replace with a desired time level if time points have multiple levels
feature.box.axis.transform = "sqrt" # Axis transformation for feature boxplots
strata.var = "sex" # Variable to stratify on in visualization

# Specify parameters for feature retention
bar.area.feature.no = 20 # Number of top abundant features to retain in barplot and areaplot
heatmap.feature.no = 20 # Number of top abundant features to retain in heatmap
dotplot.feature.no = 40 # Number of top abundant features to retain in dotplot

# Run the function
mStat_generate_report_single(
   data.obj = data.obj,
   dist.obj = dist.obj,
   alpha.obj = alpha.obj,
   group.var = group.var,
   vis.adj.vars = vis.adj.vars,
   test.adj.vars = test.adj.vars,
   subject.var = subject.var,
   time.var = time.var,
   alpha.name = alpha.name,
   depth = depth,
   dist.name = dist.name,
   t.level = t.level,
   feature.box.axis.transform = feature.box.axis.transform,
   strata.var = strata.var,
   vis.feature.level = vis.feature.level,
   test.feature.level = test.feature.level,
   feature.dat.type = feature.dat.type,
   bar.area.feature.no = bar.area.feature.no,
   heatmap.feature.no = heatmap.feature.no,
   dotplot.feature.no = dotplot.feature.no,
   feature.mt.method = feature.mt.method,
   feature.sig.level = feature.sig.level,
   theme.choice = theme.choice,
   base.size = base.size,
   output.file = output.file
 )

My Environment Information

R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.2

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.3      forcats_1.0.0        stringr_1.5.1        dplyr_1.1.4          purrr_1.0.2          readr_2.1.5         
 [7] tidyr_1.3.1          ggplot2_3.5.1        tidyverse_2.0.0      MicrobiomeStat_1.1.3 tibble_3.2.1         rlang_1.1.3         

loaded via a namespace (and not attached):
 [1] utf8_1.2.4          generics_0.1.3      stringi_1.8.3       lattice_0.22-6      hms_1.1.3           statip_0.2.3       
 [7] lme4_1.1-35.1       magrittr_2.0.3      timechange_0.3.0    grid_4.3.3          iterators_1.0.14    foreach_1.5.2      
[13] Matrix_1.6-5        modeest_2.4.0       fansi_1.0.6         scales_1.3.0        stabledist_0.7-1    codetools_0.2-19   
[19] numDeriv_2016.8-1.1 cli_3.6.2           timeSeries_4032.109 munsell_0.5.0       splines_4.3.3       withr_3.0.0        
[25] tools_4.3.3         rmutil_1.1.10       parallel_4.3.3      tzdb_0.4.0          nloptr_2.0.3        stable_1.1.6       
[31] minqa_1.2.6         colorspace_2.1-0    boot_1.3-30         rpart_4.1.23        vctrs_0.6.5         R6_2.5.1           
[37] matrixStats_1.2.0   lifecycle_1.0.4     clue_0.3-65         MASS_7.3-60.0.1     cluster_2.1.6       pkgconfig_2.0.3    
[43] pillar_1.9.0        lmerTest_3.1-3      gtable_0.3.4        glue_1.7.0          Rcpp_1.0.12         xfun_0.43          
[49] tidyselect_1.2.1    rstudioapi_0.16.0   knitr_1.45          spatial_7.3-17      nlme_3.1-164        fBasics_4032.96    
[55] timeDate_4032.109   compiler_4.3.3     

error while running the generate_beta_trend_test_long function

Describe the Bug
I encountered an error while running the generate_beta_trend_test_long function with my dataset. The error message is as follows:

error: number of observations (=108) <= number of random effects (=108) for term (1 + visitWeek | subjectID); the random-effects parameters and the residual variance (or scale parameter) are probably unidentifiable

Data Background

My dataset consists of two groups, each with approximately 30 samples and data collected at 3 different time points.

I used the function with the following parameters:

res = generate_beta_trend_test_long(
  data.obj = ms.obj,
  subject.var = "subjectID",
  time.var = "visitWeek",
  group.var = "group1",
  dist.name = "BC",
  adj.vars = NULL
)

Additional Information

Validation checks passed without issues.
The time.var variable (visitWeek) is coded as numeric.
Data components (data.obj, meta.dat, feature.tab, feature.ann) conform to base R data structures as required.

Problem Explanation

The error suggests that the number of observations (108) equals the number of random effects (108) specified for the term (1 + visitWeek | subjectID). This results in unidentifiability of random-effects parameters and residual variance.

Request

I would appreciate guidance on resolving this issue or recommendations on adjusting the model specification to ensure proper identification of random effects.

Datatype of feature.tab not matrix after normalization

Describe the Bug
When I use mStat_normalize_data() with the Rarefy-TSS method, the mStat_validate_data() function no longer passes because it doesn't recognize feature.tab as a matrix (Rule 5). When I don't use mStat_normalize_data(), all the tests pass.

Example

The following code fails at step 5 (Rule 5 failed: feature.tab should be a matrix.)

MicrobiomeData <- list(feature.tab = otu_table_matrix, 
                       meta.dat = metadata_df, 
                       feature.ann = taxonomy_matrix)

#normalize the data using rarefaction and total sum scaling
MicrobiomeData <- mStat_normalize_data(data.obj = MicrobiomeData, method = "Rarefy-TSS")
MicrobiomeData$data.obj.norm$feature.tab <- as.matrix(MicrobiomeData$data.obj.norm$feature.tab)

mStat_validate_data(MicrobiomeData)

However, the following code passes all validations. Furthermore, when I rarefy the data with mStat_rarefy_data(data.obj = MicrobiomeData) prior to validation, all validations pass.

MicrobiomeData <- list(feature.tab = otu_table_matrix, 
                       meta.dat = metadata_df, 
                       feature.ann = taxonomy_matrix)

#normalize the data using rarefaction and total sum scaling
#MicrobiomeData <- mStat_normalize_data(data.obj = MicrobiomeData, method = "Rarefy-TSS")
#MicrobiomeData$data.obj.norm$feature.tab <- as.matrix(MicrobiomeData$data.obj.norm$feature.tab)

mStat_validate_data(MicrobiomeData)

Environment Information:

  • R Version: 4.2.3
  • Package Version: 1.1.3

Example "Inspecting Paired Samples: Taxa Analysis" results in GUniFrac error

Hello,

Thanks to the team for this package, the help/wiki is very straightforward, easy to follow and clear.

I ran into an error trying to reproduce the tutorial "Paired Samples Analysis", specifically the part "Inspecting Paired Samples: Taxa Analysis".
The generate_taxa_change_test_pair code

generate_taxa_change_test_pair(
  data.obj = peerj32.obj,
  subject.var = "subject",
  time.var = "time",
  group.var = "group",
  adj.vars = c("sex"),
  change.base = "1",
  change.func = "lfc",
  feature.level = "original",
  prev.filter = 0.01,
  abund.filter = 0.01,
  feature.dat.type = "count"
)

resulted in the following error:

Error in GUniFrac::ZicoSeq(meta.dat = sorted_unique_meta_tab, feature.dat = na.omit(value_diff_matrix),  :
  unused argument (change.func = "lfc")

Not sure where this comes from, here is my session info:

R version 4.3.1 (2023-06-16)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS/LAPACK: /home/sciliciumtheo/miniconda3/envs/r_microbiomestats/lib/libopenblasp-r0.3.23.so;  LAPACK version 3.11.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] vegan_2.6-4          lattice_0.21-8       permute_0.9-7
[4] MicrobiomeStat_1.1.1 tibble_3.2.1         rlang_1.1.1

loaded via a namespace (and not attached):
  [1] remotes_2.4.2.1     magrittr_2.0.3      clue_0.3-64
  [4] matrixStats_1.0.0   compiler_4.3.1      mgcv_1.9-0
  [7] systemfonts_1.0.4   callr_3.7.3         vctrs_0.6.3
 [10] rmutil_1.1.10       stringr_1.5.0       profvis_0.3.8
 [13] pkgconfig_2.0.3     crayon_1.5.2        fastmap_1.1.1
 [16] backports_1.4.1     ellipsis_0.3.2      labeling_0.4.3
 [19] pander_0.6.5        utf8_1.2.3          modeest_2.4.0
 [22] promises_1.2.1      rmarkdown_2.24      sessioninfo_1.2.2
 [25] ps_1.7.5            nloptr_2.0.3        ragg_1.2.5
 [28] purrr_1.0.2         xfun_0.40           cachem_1.0.8
 [31] highr_0.10          later_1.3.1         broom_1.0.5
 [34] parallel_4.3.1      prettyunits_1.1.1   cluster_2.1.4
 [37] R6_2.5.1            RColorBrewer_1.1-3  stringi_1.7.12
 [40] boot_1.3-28.1       pkgload_1.3.2.1     rpart_4.1.19
 [43] numDeriv_2016.8-1.1 Rcpp_1.0.11         iterators_1.0.14
 [46] knitr_1.43          usethis_2.2.2       httpuv_1.6.11
 [49] Matrix_1.6-1        splines_4.3.1       tidyselect_1.2.0
 [52] yaml_2.3.7          timeDate_4022.108   codetools_0.2-19
 [55] miniUI_0.1.1.1      curl_5.0.2          processx_3.8.2
 [58] pkgbuild_1.4.2      lmerTest_3.1-3      shiny_1.7.5
 [61] withr_2.5.0         evaluate_0.21       stable_1.1.6
 [64] desc_1.4.2          urlchecker_1.0.1    pillar_1.9.0
 [67] foreach_1.5.2       generics_0.1.3      rprojroot_2.0.3
 [70] ggplot2_3.4.3       munsell_0.5.0       scales_1.2.1
 [73] minqa_1.2.5         timeSeries_4031.107 xtable_1.8-4
 [76] glue_1.6.2          statip_0.2.3        pheatmap_1.0.12
 [79] tools_4.3.1         lme4_1.1-34         spatial_7.3-17
 [82] fBasics_4022.94     fs_1.6.3            grid_4.3.1
 [85] ape_5.7-1           tidyr_1.3.0         devtools_2.4.5
 [88] colorspace_2.1-0    nlme_3.1-163        cli_3.6.1
 [91] textshaping_0.3.6   fansi_1.0.4         dplyr_1.1.2
 [94] gtable_0.3.4        ggh4x_0.2.6         stabledist_0.7-1
 [97] digest_0.6.33       ggrepel_0.9.3       htmlwidgets_1.6.2
[100] farver_2.1.1        memoise_2.0.1       htmltools_0.5.6
[103] lifecycle_1.0.3     statmod_1.5.0       mime_0.12
[106] GUniFrac_1.7        MASS_7.3-60

generate_alpha_trend_test_long giving fatal error, aborting R session

Thank you for this great package for longitudinal microbiome analyses. Every time I try to run the generate_alpha_trend_test_long function my R is crashing. it says "R encountered a fatal error, The session was terminated". I have tried both with my data and with the example data (subset_T2D), and it crashes on that function each time.

Code:
data("subset_T2D.obj")
alpha_trend_test_results <- generate_alpha_trend_test_long(
data.obj = subset_T2D.obj,
alpha.name = c("shannon","observed_species"),
time.var = "visit_number_num",
subject.var = "subject_id",
group.var = "subject_race",
adj.vars = NULL
)

and also my data:
alpha_trend_test_results <- generate_alpha_trend_test_long(
data.obj = mbstat.obj,
alpha.name = c("shannon","simpson","observed_species"),
time.var = "time.num",
subject.var = "ID",
group.var = "Study.Group",
adj.vars = NULL
)
Screenshot 2024-02-08 145801

I have tried to use debug() and find the spot in the code where it is crashing and it seems to be after the alpha_df line. I have tried this many times and can get all other functions in the alpha diversity longitudinal analysis to run except for this one.

  • Operating System: Windows 11
  • R Version: 4.3.1
  • Package Version: 1.1.3

generate_beta_pc_boxplot_long not plotting all paired lines

Hi Chen,

This is my plot, I have paired data inputted. But it seems to not plot paired lines for all. I have tried it with peerj32.obj and it worked fine. Wondering if you can help?

Thanks
Kenny

generate_beta_pc_boxplot_long(

  • data.obj = object,
  • dist.obj = NULL,
  • pc.obj = NULL,
  • subject.var = "ID",
  • time.var = "Tech",
  • group.var = "Sample_type",
  • t0.level = "1",
  • ts.levels = "2",
  • strata.var = NULL,
  • dist.name = c('BC'),
  • base.size = 20,
  • theme.choice = "bw",
  • custom.theme = NULL,
  • palette = NULL,
  • pdf = FALSE,
  • file.ann = NULL,
  • pdf.wid = 11,
  • pdf.hei = 8.5
  • )
    Using table 'meta.dat' in 'data.obj'.
    Using table 'feature.tab' in 'data.obj'.
    Using table 'feature.ann' in 'data.obj'.
    Using table 'meta.dat' in 'data.obj'.
    Initializing distance objects...
    Calculating Bray-Curtis dissimilarity...
    All calculations complete.
    No pc.obj provided, using MDS (PCoA) for dimension reduction by default.
    If you prefer other methods such as NMDS, t-SNE or UMAP, you can use the mStat_calculate_PC function with a specified method.
    Calculating PC...
    Processing BC distance...
    Calculating MDS...
    Calculation complete.

Rplot

Create Graphs/Analyses without Performing Rarefaction

Hello,
I acknowledge that I have submitted several posts to this GitHub Discussions in the past few weeks, and I always appreciate your prompt and helpful replies! This package and your replies have been incredibly helpful for my analysis.

Feedback
I am analyzing presence/absence microbiome data, so rarefaction is not appropriate for my project. However, I noticed that rarefaction automatically occurs when I run any analyses or create graphs. Is there a way to turn off the automatic rarefaction? If not, adding that option would be incredibly useful.

calculating alpha diversity pr feature, not sample

Helllo,

I used the function "mSTAT_convert_phyloseq_to_data_obj" to convert from phyloseq to ms-object. This works well.
However, when I am exploring alpha diversity, by mStat_calculate_alpha_diveristy, it by defaults calcluate by feature, not sample.
Testing the other alpha_div functions doesnt work, and I am suspecting it to be because of this.

It becomes correct if I do: msdata_rare.alpha = mStat_calculate_alpha_diversity(t(msdata_rare$feature.tab), "shannon").

generate_alpha_test_pair: Error in t(do.call(sparseMatrix, do.call(rbind, lapply(seq_along(blist), : invalid 'lazy' to R_sparse_transpose()

Hi Chen,

I ran the codes below as per the tutorial. However I got this error "Error in t(do.call(sparseMatrix, do.call(rbind, lapply(seq_along(blist), : invalid 'lazy' to R_sparse_transpose()"

data(peerj32.obj)
alpha_test_results <- generate_alpha_test_pair(
data.obj = peerj32.obj,
alpha.obj = NULL,
time.var = "time",
alpha.name = c("shannon"),
subject.var = "subject",
group.var = "group",
adj.vars = "sex"
)

Error
Using table 'feature.tab' in 'data.obj'.
Using table 'feature.tab' in 'data.obj'.
Calculating shannon diversity...
Diversity calculations complete.
Using table 'meta.dat' in 'data.obj'.
Error in t(do.call(sparseMatrix, do.call(rbind, lapply(seq_along(blist), :
invalid 'lazy' to R_sparse_transpose()

Hope you can help

thanks
Kenny

Zero-handling Doesn't Work if a Sample is All Zero

I have some samples which have zero species of bacteria detected. For example, Oral 2.

Browse[1]> Y[1:5, 1:5]
                          Oral_1-N Oral_2-N Oral_3-N Oral_4-N Oral_5-N
Fusobacterium nucleatum       0.00        0    0.038    0.000     0.00
Cutibacterium acnes           0.00        0    0.113    0.102     0.69
Prevotella melaninogenica     0.00        0    0.000    0.000     0.00
Streptococcus mitis           0.67        0    0.000    0.013     0.20
Prevotella intermedia         0.00        0    0.000    0.000     0.00

The line temp <- t(t(Y) / colSums(Y)) causes problems soon after.

NA  features are filtered!
Error in if (any(colSums(Y) == 0)) { : missing value where TRUE/FALSE needed

I think it is biologically informative data and I wish to keep such samples for linear modelling. Can LinDA be modified to handle it? Or, should a dummy row named UNCLASSIFIED be present with 1.00 in it for samples without any species?

mStat_generate_report_single() does not work with compositional data

I tried to run the function mStat_generate_report_single() on an mStat object containing relative abundance data (feature.dat.type = "proportion").
mStat_generate_report_single() fails due to vegan needing integer data:

processing file: file197c2743955b.Rmd
  |.....                          |  16% [object-pre-calculation]
Quitting from lines 123-166 [object-pre-calculation] (file197c2743955b.Rmd)
Error in `vegan::rrarefy()`:
! function is meaningful only for integers (counts)
Backtrace:
 1. MicrobiomeStat::mStat_normalize_data(...)
 3. vegan::rrarefy(t(otu_tab), sample = rarefy_depth)

Error installing from github

Hi,

I have been trying to install the MetagenomStat package in rstudio and keep coming accross the following error:

devtools::install_github("cafferychen777/MicrobiomeStat")
Downloading GitHub repo cafferychen777/MicrobiomeStat@HEAD
── R CMD build ───────────────────────────────────────────────────────────────
✔ checking for file 'C:\Users\warrenf\AppData\Local\Temp\RtmpAZmkKW\remotes2ed41c6b2a8b\cafferychen777-MicrobiomeStat-4c387fb/DESCRIPTION' (513ms)
─ preparing 'MicrobiomeStat': (7s)
✔ checking DESCRIPTION meta-information ...
─ checking for LF line-endings in source and make files and shell scripts (1.3s)
─ checking for empty or unneeded directories
─ building 'MicrobiomeStat_1.1.3.tar.gz'

Installing package into ‘C:/Users/warrenf/AppData/Local/R/win-library/4.2’
(as ‘lib’ is unspecified)

  • installing source package 'MicrobiomeStat' ...
    ** using staged installation
    ** R
    ** data
    *** moving datasets to lazyload DB
    ** inst
    ** byte-compile and prepare package for lazy loading
    Note: wrong number of arguments to 'sqrt'
    ** help
    *** installing help indices
    *** copying figures
    ** building package indices
    ** testing if installed package can be loaded from temporary location
    Error: package or namespace load failed for 'MicrobiomeStat' in namespaceExport(ns, exports):
    undefined exports: generate_taxa_latticeplot_single
    Error: loading failed
    Execution halted
    ERROR: loading failed
  • removing 'C:/Users/warrenf/AppData/Local/R/win-library/4.2/MicrobiomeStat'

I am not sure what is causing this error, I have installed all fo the dependencies.

Thanks

Fred

microbiomestat ts.levels not plotting correctly in generate_taxa_areaplot_long

Hello,

Thank you for the great tool!

I am trying to visualise the change in taxas across multiple timepoints using generate_taxa_areaplot_long. However, in the visualisation, I am missing some (1)timepoints listed in ts.levels and (2) subject.var (please see screenshot included).

My initial timepoint is "0", and the following timepoints are "1", "2" , "3" , "5" ,"6" , "7", "8" , "10", "12", "18" ,"24". However not all timepoints are present in all samples, and I think this is where the problem lies. For example, one sample might have months 6, 10, 12, 18, 24 and another has 0, 6, 12, 18, etc.

Here are the binary breakdown of the presence/absence of timepoints in each sample_donation.
sample_donation - each sample which has repeated measurements
sample_id - each sample_donation comes from an individual; not balanced, some individuals may have more sample_donations

sample_donation sample 0 1 2 3 5 6 7 8 10 12 18 24
Individual_1 a 0 0 0 0 0 1 0 0 1 1 1 1
Individual_1 b 0 0 0 0 1 0 0 0 1 1 1 1
Individual_2 c 0 0 0 0 1 0 0 0 1 1 1 0
Individual_3 d 0 0 0 1 0 0 0 1 0 1 1 0
Individual_4 e 0 1 0 0 0 0 1 0 0 1 0 0
Individual_4 f 0 1 0 0 0 1 0 0 0 1 1 0
Individual_5 g 1 0 0 0 0 1 0 0 0 1 1 0
Individual_6 h 0 1 0 0 0 1 0 0 0 1 0 0
Individual_6 I 0 1 0 0 0 1 0 0 0 1 0 0
Individual_6 j 0 1 0 0 0 1 0 0 0 1 0 0
Individual_6 k 0 1 0 0 0 1 0 0 0 1 0 0
Individual_6 l 0 1 0 0 0 1 0 0 0 1 0 0
Individual_7 m 0 1 0 0 0 1 0 0 0 1 0 0
Individual_7 n 0 1 0 0 0 1 0 0 0 1 0 0
Individual_7 o 1 0 0 0 0 1 0 0 0 1 0 0
Individual_7 p 0 0 1 0 0 1 0 0 0 1 0 0

My code is as follows:

data.obj <- mStat_convert_phyloseq_to_data_obj(physeq)  #my microbiomestat data object converted from a phyloseq object

generate_taxa_areaplot_long(
  data.obj = data.obj,   #
  subject.var = "sample_donation",    #16 levels. each sample_donation has repeating measurements from different months
  time.var ="time",   #numeric, time in months
  group.var = "sample_donation",    #16 levels. each sample_donation has repeating measurements from different months
  strata.var = "sample_id",   #7 levels, each sample_id may have multiple sample_donations (basically sample_donations is nested within sample_id)
  feature.level = "Genus",
  feature.dat.type = "proportion",  
  feature.number = 20,
  t0.level = "0",
  ts.levels = c("1",  "2" , "3" , "5"  ,"6" , "7",  "8" , "10", "12", "18" ,"24"),
  base.size = 12,
  theme.choice = "bw",
  palette = NULL,
  pdf = TRUE,
  file.ann = NULL
)

I expected an areaplot with all the different time (months) on the x-axis for each subject.var

Instead, I see only times 10. 12, 18, or 24 shown. Additionally, only the first 6 sample_donations can be seen. The others are blank.

test

Attempted Solutions
I have tried to enter the values into a vector and use ts.levels = later_tp, but it did not work.
later_tp <- c("1", "2" , "3" , "5" ,"6" , "7", "8" , "10", "12", "18" ,"24")

I have tried to convert the values to numeric, but that also did not work (c(1, 2 , 3 , 5 ,6 , 7, 8, 10, 12, 18 ,24) )

I tried to save the image wider,as I thought there was not enough space.

If you have any suggestions, please let me know. Thank you so much (:

Wishing you a wonderful 2024 !

Format time.var in generate_alpha_trend_test_long() function

Dear Chen Yang,

I have a question regarding the use of the generate_alpha_trend_test_long() function. I am working on the microbiome communities of three coral species. Individual colonies of each species were monitored over a three year period. I would like to test for differences in alpha diversity across timepoints.

Here is the structure of my data object:
Capture d’écran 2024-02-29 à 11 05 29

I have precalculated alpha diversity indices and passed them to the alpha.obj parameter:
Capture d’écran 2024-02-29 à 11 12 09

I would like to use my column “Date_format” as time.var since it includes information on both year and month but I am not sure which format I should use. I tried to make it numeric using:

data.obj$meta.dat$Date_format<-as.numeric(data.obj$meta.dat$Date_format)
Warning message:
NAs introduced by coercion

All rows were replaced by NAs

I did the same using my column “m_format” just as a test. And got the following output:
Capture d’écran 2024-02-29 à 11 04 05

My questions is thus:
How should I write my year + month factor so it can be used as time.var in the generate_alpha_trend_test_long() function?

Thanks for your help!
Gaëlle

Problems with MicrobiomeStat.rdb

Dear Chen Yang,
Thank You for creating the MicrobiomeStat tools. I would really like to process my data using your package and generate_beta_change_test_long() , but I am facing a problem Warning: restarting interrupted promise evaluationWarning: internal error -3 in R_decompress1Error in FUN(X[[i]], ...) :
The lazy-load database '/R/x86_64-pc-linux-gnu-library/4.2/MicrobiomeStat/R/MicrobiomeStat.rdb' is corrupted.
My code looks like:
image
Downloaded using devtools::install_github("cafferychen777/MicrobiomeStat")
Many thanks in advance,
Polina

Level of factor missing in output generate_alpha_trend_test_long() function.

Dear Chen Yang,

Following issue #35, I have converted my Date_format column to a factor to be used as time.var in the generate_alpha_trend_test_long() function and it allowed for the function to run.

Here is the output I got:
Capture d’écran 2024-03-01 à 17 19 02

My concern is that the third species, M_exa does not appear in the output table and I am not sure why as it is definitely present in the data.obj:
Capture d’écran 2024-03-01 à 17 19 11

I have tried running the function again after converting Sp_field and Colony_ID as factors and got completely different results.

My two questions would then be:
-Should Sp_field and Colony_ID also be converted as factors?
-Why is there a missing level of Sp_field in the output?

Sorry if this all sounds trivial. I am used to phyloseq so all this a new to me. I really want to use your package because I am interested in looking at the microbiome dynamics across time series and MicrobiomeStat seems the best choice for that.

Many thanks in advance,
Gaëlle

Error in midpoint(data.obj$tree) : could not find function "midpoint"

Dear developer,

Congratulations for this great tool.
I have a problem while trying to convert my phyloseq object to microbiomeStat object.
It seems it does not found midpoint function.

I installed all the dependencies following the installation guide. Then, I execute the code below.
Do you have any clue of what is going on here?

Thank you for your time!

library(MicrobiomeStat)
library(tidyverse)
library(phyloseq)

Required package

library(microbiome)
library(ape)

setwd("/Users/virginia/Dropbox/rrr/emu_phyloseq/") # mac

################################################################################

load Phyloseq object

################################################################################
ps <- readRDS(file= "./phylo_obj/ps.rds")

Check if taxa are rows in the Phyloseq object

phyloseq::taxa_are_rows(ps)
str(ps)

Convert the Phyloseq object to a MicrobiomeStat data object

data.obj <- mStat_convert_phyloseq_to_data_obj(ps)

output:

``

phyloseq::taxa_are_rows(ps)
[1] TRUE
str(ps)
Formal class 'phyloseq' [package "phyloseq"] with 5 slots
..@ otu_table:Formal class 'otu_table' [package "phyloseq"] with 2 slots
.. .. ..@ .Data : num [1:275, 1:24] 0 0 14 0 0 0 0 0 0 0 ...
.. .. .. ..- attr(, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:275] "100176" "1042156" "105841" "106588" ...
.. .. .. .. ..$ : chr [1:24] "AC.16" "TC.16" "DC.16" "AC.18" ...
.. .. ..@ taxa_are_rows: logi TRUE
.. .. ..$ dim : int [1:2] 275 24
.. .. ..$ dimnames:List of 2
.. .. .. ..$ : chr [1:275] "100176" "1042156" "105841" "106588" ...
.. .. .. ..$ : chr [1:24] "AC.16" "TC.16" "DC.16" "AC.18" ...
..@ tax_table:Formal class 'taxonomyTable' [package "phyloseq"] with 1 slot
.. .. ..@ .Data: chr [1:275, 1:7] "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
.. .. .. ..- attr(
, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:275] "100176" "1042156" "105841" "106588" ...
.. .. .. .. ..$ : chr [1:7] "Kingdom" "Phylum" "Class" "Order" ...
.. .. ..$ dim : int [1:2] 275 7
.. .. ..$ dimnames:List of 2
.. .. .. ..$ : chr [1:275] "100176" "1042156" "105841" "106588" ...
.. .. .. ..$ : chr [1:7] "Kingdom" "Phylum" "Class" "Order" ...
..@ sam_data :'data.frame': 24 obs. of 6 variables:
Formal class 'sample_data' [package "phyloseq"] with 4 slots
.. .. ..@ .Data :List of 6
.. .. .. ..$ : Factor w/ 3 levels "AC","TC","DC": 1 2 3 1 2 3 1 2 3 1 ...
.. .. .. ..$ : num [1:24] 16 16 16 18 18 18 1 1 1 2 ...
.. .. .. ..$ : chr [1:24] "AC.16" "TC.16" "DC.16" "AC.18" ...
.. .. .. ..$ : Factor w/ 3 levels "stage1","stage2",..: 3 3 3 3 3 3 1 1 1 1 ...
.. .. .. ..$ : Factor w/ 2 levels "distal","proximal": 2 1 1 2 1 1 2 1 1 2 ...
.. .. .. ..$ : chr [1:24] "1" "2" "3" "4" ...
.. .. ..@ names : chr [1:6] "Compartment" "Day" "label" "Stage" ...
.. .. ..@ row.names: chr [1:24] "AC.16" "TC.16" "DC.16" "AC.18" ...
.. .. ..@ .S3Class : chr "data.frame"
..@ phy_tree :List of 4
.. ..$ edge : int [1:547, 1:2] 276 276 276 277 278 296 298 341 341 298 ...
.. ..$ edge.length: num [1:547] 0.0811 0.1241 0.0112 0.0166 0.0134 ...
.. ..$ tip.label : chr [1:275] "100176" "1042156" "105841" "106588" ...
.. ..$ Nnode : int 273
.. ..- attr(, "class")= chr "phylo"
.. ..- attr(
, "order")= chr "cladewise"
..@ refseq : NULL

Convert the Phyloseq object to a MicrobiomeStat data object

data.obj <- mStat_convert_phyloseq_to_data_obj(ps)
Root the tree by midpointing ...
Error in midpoint(data.obj$tree) : could not find function "midpoint"
``

generate_alpha_change_test_pair(): Error in `dplyr::group_by()`: ! Must group by variables found in `.data`. ✖ Column `time` is not found.

This is my script:

generate_alpha_change_test_pair(
  data.obj = MicrobiomeData,
  alpha.obj = NULL,
  alpha.name = c("shannon", "simpson", "observed_species", "chao1", "ace", "pielou"),
  time.var = "TimePoint",
  subject.var = "Individual",
  group.var = "Treatment",
  adj.vars = NULL,
  change.base = 1,
  alpha.change.func = "absolute change"
)

My time.var is "TimePoint". I got a error show that "Column time is not found."
I have check the script of generate_alpha_change_test_pair() function, and I see the variable time first appears on line 30; before that, it was not defined (time.var was not assigned to it).

qiime2 style metadata import error

When importing a qiime2 style metadata file, the function mStat_import_qiime2_as_data_obj() does not remove the annotation row "q2:types" specifing the data type

data.obj <- mStat_import_qiime2_as_data_obj(
otu_qza = "qiime2_mp_data/rarefied_table.qza",
taxa_qza = "qiime2_mp_data/taxonomy.qza",
sam_tab = "qiime2_mp_data/sample_metadata.tsv",
tree_qza = "qiime2_mp_data/rooted-tree.qza"
)

head(data.obj$meta.dat)
sample.id barcode.sequence body.site year month day subject reported.antibiotic.usage days.since.experiment.start
#q2:types #q2:types categorical categorical numeric numeric numeric categorical categorical numeric
L1S8 L1S8 AGCTGACTAGTC gut 2008 10 28 subject-1 Yes 0
L1S57 L1S57 ACACACTATGGC gut 2009 1 20 subject-1 No 84
L1S76 L1S76 ACTACGTGTGGT gut 2009 2 17 subject-1 No 112
L1S105 L1S105 AGTGCGATGCGT gut 2009 3 17 subject-1 No 140
L2S155 L2S155 ACGATGCGACCA left palm 2009 1 20 subject-1 No 84

Unspecified dependency: ggsci

I installed MicrobiomeStat through devtools.

The example for function generate_taxa_barplot_pair

 generate_taxa_barplot_pair(
  data.obj = peerj32.obj,
  subject.var = "subject",
  time.var = "time",
  group.var = "group",
  strata.var = NULL,
  feature.level = "Family",
  feature.dat.type = c("count"),
  feature.number = 8,
  base.size = 10,
  theme.choice = "bw",
  custom.theme = NULL,
  palette = ggsci::pal_npg()(9),
  pdf = TRUE,
  file.ann = NULL,
  pdf.wid = 11,
  pdf.hei = 8.5
)

failed with the following error:
Error in loadNamespace(x) : there is no package called ‘ggsci’

Installing ggsci solved the issue.

I suggest that you add ggsci to the list of dependencies on https://www.microbiomestat.wiki/setting-up-microbiomestat-installation-and-data-preparation/kick-start-your-journey-the-microbiomestat-installation-guide

Feasibility of Interaction Specification Using generate_taxa_test Functions

Let's say I have a clinical data set with Age (Young or Old) and Gender (Male or Female) factors. Can the Age*Gender interaction be specified to generate_taxa_test_single? Or is it impossible and a model formula should be written for linda function? Our idea is that Young Female patients have a special cancer microbiome that neither the Young (overall) nor Female (overall) group has.

Unable to install zipped-version

Hi,
thank you for your wonderful package.

I want to install the github version of it inside TSD. I then download the zip file from github, import it into TSD and there install it. However, when I try to install it I get the error message:

Warning messages:

  1. In loadNamespace(package, ..) : package "microbiomeStat" has no "package.rd" in Meta/
  2. S3 methods "as.grob.aplot", "is.na.null" were declared in NAMESPACE but not found

Any workaround this?
Thanks again,
Maria

mStat_generate_report_single(): error in linda() call

mStat_generate_report_single() fails due invalid call to linda() in MicrobiomeStat_1.1.3:

processing file: file10828c3bf4e.Rmd
  |........................       |  78% [taxa-test-execution]
Quitting from lines 615-626 [taxa-test-execution] (file10828c3bf4e.Rmd)
Error in `linda()`:
! unused arguments (feature.sig.level = 0.2, feature.mt.method = "fdr")
Backtrace:
 1. MicrobiomeStat::generate_taxa_test_single(...)
 2. base::lapply(...)
 3. MicrobiomeStat (local) FUN(X[[i]], ...)

When I replace linda function with a modified version accepting accessory arguments (this code is a copy of https://github.com/cafferychen777/MicrobiomeStat/blob/main/R/linda.R, only difference with the current linda() version is the accessory argument ... after verbose = TRUEin function call), mStat_generate_report_single() runs normally until the end.

Modified linda version:

linda_modified <- function(feature.dat, meta.dat, phyloseq.obj = NULL, formula, feature.dat.type = c("count", "proportion","other"),
                  prev.filter = 0, mean.abund.filter = 0, max.abund.filter = 0,
                  is.winsor = TRUE, outlier.pct = 0.03,
                  adaptive = TRUE, zero.handling = c("pseudo-count", "imputation"),
                  pseudo.cnt = 0.5, corr.cut = 0.1,
                  p.adj.method = "BH", alpha = 0.05,
                  n.cores = 1, verbose = TRUE, ...) {
  ############################################
  # only difference with linda() is the accessory argument `...` after verbose = TRUE
  ############################################
  feature.dat.type <- match.arg(feature.dat.type)

  if (!is.null(phyloseq.obj)) {
    feature.dat.type <- "count"

    feature.dat <- phyloseq.obj@otu_table %>%
      as.data.frame() %>%
      as.matrix()

    meta.dat <- phyloseq.obj@sam_data %>% as.matrix() %>%
      as.data.frame()
  }

  if (any(is.na(feature.dat))) {
    stop(
      "The feature table contains NA values. Please remove or handle them before proceeding.\n"
    )
  }

  allvars <- all.vars(as.formula(formula))
  Z <- as.data.frame(meta.dat[, allvars])

  ###############################################################################
  # Filter sample
  keep.sam <- which(rowSums(is.na(Z)) == 0)
  Y <- feature.dat[, keep.sam]
  Z <- as.data.frame(Z[keep.sam, ])
  names(Z) <- allvars

  # Filter features
  temp <- t(t(Y) / colSums(Y))

  if (feature.dat.type == "other" & (max.abund.filter != 0 | mean.abund.filter != 0 | prev.filter != 0 )){
    message("Note: Since feature.dat.type is set to 'other', all filters (max.abund.filter, mean.abund.filter, and prev.filter) are reset to 0.")
    max.abund.filter <- 0
    mean.abund.filter <- 0
    prev.filter <- 0
  }

  keep.tax <- rowMeans(temp != 0) >= prev.filter & rowMeans(temp) >= mean.abund.filter & matrixStats::rowMaxs(temp) >= max.abund.filter
  names(keep.tax) <- rownames(Y)
  rm(temp)
  if (verbose) {
    message(
      sum(!keep.tax), " features are filtered!\n"
    )
  }
  Y <- Y[keep.tax, ]

  n <- ncol(Y)
  m <- nrow(Y)

  ## some samples may have zero total counts after screening taxa
  if (any(colSums(Y) == 0)) {
    ind <- which(colSums(Y) > 0)
    Y <- Y[, ind]
    Z <- as.data.frame(Z[ind, ])
    names(Z) <- allvars
    keep.sam <- keep.sam[ind]
    n <- ncol(Y)
  }

  if (verbose) {
    message(
      "The filtered data has ", n, " samples and ", m, " features that will be tested!\n"
    )
  }

  if (sum(rowSums(Y != 0) <= 2) != 0) {
    warning(
      "Some features have less than 3 nonzero values!\n",
      "They have virtually no statistical power. You may consider filtering them in the analysis!\n"
    )
  }

  ###############################################################################
  ## scaling numerical variables
  ind <- sapply(1:ncol(Z), function(i) is.numeric(Z[, i]))
  Z[, ind] <- scale(Z[, ind])

  ## winsorization
  if (is.winsor) {
    Y <- winsor.fun(Y, 1 - outlier.pct, feature.dat.type)
  }

  ##
  if (grepl("\\(", formula)) {
    random.effect <- TRUE
  } else {
    random.effect <- FALSE
  }

  if (is.null(rownames(feature.dat))) {
    taxa.name <- (1:nrow(feature.dat))[keep.tax]
  } else {
    taxa.name <- rownames(feature.dat)[keep.tax]
  }
  if (is.null(rownames(meta.dat))) {
    samp.name <- (1:nrow(meta.dat))[keep.sam]
  } else {
    samp.name <- rownames(meta.dat)[keep.sam]
  }

  ## handling zeros
  if (feature.dat.type == "count") {
    if (any(Y == 0)) {
      N <- colSums(Y)
      if (adaptive) {
        logN <- log(N)
        if (random.effect) {
          tmp <- lmer(as.formula(paste0("logN", formula)), Z)
        } else {
          tmp <- lm(as.formula(paste0("logN", formula)), Z)
        }
        corr.pval <- coef(summary(tmp))[-1, "Pr(>|t|)"]
        if (any(corr.pval <= corr.cut)) {
          if (verbose) {
            message("Imputation approach is used.")
          }
          zero.handling <- "Imputation"
        } else {
          if (verbose) {
            message("Pseudo-count approach is used.")
          }
          zero.handling <- "Pseudo-count"
        }
      }
      if (zero.handling == "imputation") {
        N.mat <- matrix(rep(N, m), nrow = m, byrow = TRUE)
        N.mat[Y > 0] <- 0
        tmp <- N[max.col(N.mat)]
        Y <- Y + N.mat / tmp
      } else {
        Y <- Y + pseudo.cnt
      }
    }
  }

  if (feature.dat.type == "proportion") {
    if (any(Y == 0)) {
      # Half minimum approach

      Y <- t(apply(Y, 1, function(x) {
        x[x == 0] <- 0.5 * min(x[x != 0])
        return(x)
      }))
      colnames(Y) <- samp.name
      rownames(Y) <- taxa.name
    }

  }


  ## CLR transformation
  logY <- log2(Y)
  W <- t(logY) - colMeans(logY)

  ## linear regression

  # 	oldw <- getOption('warn')
  # 	options(warn = -1)
  if (!random.effect) {
    if (verbose) {
      message("Fit linear models ...")
    }
    suppressMessages(fit <- lm(as.formula(paste0("W", formula)), Z))
    res <- do.call(rbind, coef(summary(fit)))
    df <- rep(n - ncol(model.matrix(fit)), m)
  } else {
    if (verbose) {
      message("Fit linear mixed effects models ...")
    }
    fun <- function(i) {
      w <- W[, i]
      suppressMessages(fit <- lmer(as.formula(paste0("w", formula)), Z))
      coef(summary(fit))
    }
    if (n.cores > 1) {
      res <- mclapply(c(1:m), function(i) fun(i), mc.cores = n.cores)
    } else {
      suppressMessages(res <- foreach(i = 1:m) %do% fun(i))
    }
    res <- do.call(rbind, res)
  }
  # 	options(warn = oldw)

  res.intc <- res[which(rownames(res) == "(Intercept)"), ]
  rownames(res.intc) <- NULL
  baseMean <- 2^res.intc[, 1]
  baseMean <- baseMean / sum(baseMean) * 1e6

  output.fun <- function(x) {
    res.voi <- res[which(rownames(res) == x), ]
    rownames(res.voi) <- NULL

    if (random.effect) {
      df <- res.voi[, 3]
    }

    log2FoldChange <- res.voi[, 1]
    lfcSE <- res.voi[, 2]
    # 		oldw <- getOption('warn')
    # 		options(warn = -1)
    suppressMessages(bias <- modeest::mlv(sqrt(n) * log2FoldChange,
      method = "meanshift", kernel = "gaussian"
    ) / sqrt(n))
    # 		options(warn = oldw)
    log2FoldChange <- log2FoldChange - bias
    stat <- log2FoldChange / lfcSE

    pvalue <- 2 * pt(-abs(stat), df)
    padj <- p.adjust(pvalue, method = p.adj.method)
    reject <- padj <= alpha
    output <- cbind.data.frame(baseMean, log2FoldChange, lfcSE, stat, pvalue, padj, reject, df)
    rownames(output) <- taxa.name
    return(list(bias = bias, output = output))
  }

  variables <- unique(rownames(res))[-1]
  variables.n <- length(variables)
  bias <- rep(NA, variables.n)
  output <- list()
  for (i in 1:variables.n) {
    tmp <- output.fun(variables[i])
    output[[i]] <- tmp[[2]]
    bias[i] <- tmp[[1]]
  }
  names(output) <- variables

  rownames(Y) <- taxa.name
  colnames(Y) <- samp.name
  rownames(Z) <- samp.name
  if (verbose) {
    message("Completed.")
  }
  return(list(variables = variables, bias = bias, output = output, feature.dat.use = Y, meta.dat.use = Z))
}

Replacing linda version in the MicrobiomeStat namespace:

assignInNamespace("linda", linda_modified, ns="MicrobiomeStat")

I also need to define a winsor.fun in the current environment -this is just a copy of the winsor.fun found in https://github.com/cafferychen777/MicrobiomeStat/blob/main/R/linda.R-

winsor.fun <- function(Y, quan, feature.dat.type) {
  # 如果feature.dat.type是 "count"
  if (feature.dat.type == "count") {
    # 计算矩阵Y每列的和,保存在变量N中
    N <- colSums(Y)

    # 将Y矩阵中的每个元素除以其相应列的和,得到矩阵P
    # 对矩阵P进行两次转置以得到与Y相同的维度
    P <- t(t(Y) / N)

    # 对P矩阵的每行应用quantile函数,以quan为参数,得到一个向量cut
    cut <- apply(P, 1, quantile, quan)

    # 复制cut向量以创建一个与Y矩阵相同维度的矩阵Cut
    Cut <- matrix(rep(cut, ncol(Y)), nrow(Y))

    # 创建一个逻辑矩阵ind
    # 其中每个元素为TRUE,如果相应的P矩阵元素大于相应的Cut矩阵元素
    ind <- P > Cut

    # 将P矩阵中大于Cut矩阵中对应元素的值,用Cut矩阵中的对应元素替换
    P[ind] <- Cut[ind]

    # 四舍五入P矩阵中的每个元素,并将结果保存在Y矩阵中
    Y <- round(t(t(P) * N))
  }

  # 如果feature.dat.type是 "proportion"
  if (feature.dat.type == "proportion") {
    # 对Y矩阵的每行应用quantile函数,以quan为参数,得到一个向量cut
    cut <- apply(Y, 1, quantile, quan)

    # 复制cut向量以创建一个与Y矩阵相同维度的矩阵Cut
    Cut <- matrix(rep(cut, ncol(Y)), nrow(Y))

    # 创建一个逻辑矩阵ind
    # 其中每个元素为TRUE,如果相应的Y矩阵元素大于相应的Cut矩阵元素
    ind <- Y > Cut

    # 将Y矩阵中大于Cut矩阵中对应元素的值,用Cut矩阵中的对应元素替换
    Y[ind] <- Cut[ind]
  }

  # 返回修改后的Y矩阵
  return(Y)
}

sessionInfo():

R version 4.3.2 (2023-10-31)
Platform: x86_64-conda-linux-gnu (64-bit)
Running under: Ubuntu 20.04.5 LTS

Matrix products: default
BLAS/LAPACK: /home/sciliciumtheo/miniconda3/envs/R_microbiomestat/lib/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0

locale:
 [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8
 [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C
[10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C

time zone: Europe/Paris
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] MicrobiomeStat_1.1.3 tibble_3.2.1         rlang_1.1.3
[4] microbiome_1.24.0    ggplot2_3.4.4        phyloseq_1.46.0

loaded via a namespace (and not attached):
  [1] bitops_1.0-7            inline_0.3.19           permute_0.9-7
  [4] magrittr_2.0.3          clue_0.3-65             ade4_1.7-22
  [7] matrixStats_1.2.0       compiler_4.3.2          mgcv_1.9-1
 [10] systemfonts_1.0.5       vctrs_0.6.5             reshape2_1.4.4
 [13] rmutil_1.1.10           stringr_1.5.1           pkgconfig_2.0.3
 [16] crayon_1.5.2            fastmap_1.1.1           XVector_0.42.0
 [19] labeling_0.4.3          pander_0.6.5            utf8_1.2.4
 [22] modeest_2.4.0           rmarkdown_2.25          nloptr_2.0.3
 [25] ragg_1.2.7              tinytex_0.49            purrr_1.0.2
 [28] xfun_0.41               cachem_1.0.8            zlibbioc_1.48.0
 [31] aplot_0.2.2             GenomeInfoDb_1.38.1     jsonlite_1.8.8
 [34] biomformat_1.30.0       rhdf5filters_1.14.1     Rhdf5lib_1.24.0
 [37] parallel_4.3.2          cluster_2.1.6           R6_2.5.1
 [40] RColorBrewer_1.1-3      stringi_1.8.3           boot_1.3-28.1
 [43] rpart_4.1.23            numDeriv_2016.8-1.1     Rcpp_1.0.12
 [46] iterators_1.0.14        knitr_1.45              IRanges_2.36.0
 [49] Matrix_1.6-5            splines_4.3.2           igraph_1.6.0
 [52] tidyselect_1.2.0        yaml_2.3.8              vegan_2.6-4
 [55] timeDate_4032.109       codetools_0.2-19        lattice_0.22-5
 [58] lmerTest_3.1-3          plyr_1.8.9              Biobase_2.62.0
 [61] withr_3.0.0             evaluate_0.23           stable_1.1.6
 [64] Rtsne_0.17              gridGraphics_0.5-1      survival_3.5-7
 [67] Biostrings_2.70.1       pillar_1.9.0            foreach_1.5.2
 [70] stats4_4.3.2            ggfun_0.1.4             generics_0.1.3
 [73] RCurl_1.98-1.14         S4Vectors_0.40.2        munsell_0.5.0
 [76] scales_1.3.0            minqa_1.2.6             timeSeries_4032.109
 [79] glue_1.7.0              statip_0.2.3            pheatmap_1.0.12
 [82] tools_4.3.2             data.table_1.14.10      lme4_1.1-35.1
 [85] spatial_7.3-17          fBasics_4032.96         forcats_1.0.0
 [88] fs_1.6.3                rhdf5_2.46.1            grid_4.3.2
 [91] tidyr_1.3.1             ape_5.7-1               colorspace_2.1-0
 [94] patchwork_1.2.0         nlme_3.1-164            GenomeInfoDbData_1.2.11
 [97] cli_3.6.2               textshaping_0.3.7       fansi_1.0.6
[100] dplyr_1.1.4             ggh4x_0.2.8             gtable_0.3.4
[103] yulab.utils_0.1.4       stabledist_0.7-1        digest_0.6.34
[106] BiocGenerics_0.48.1     ggrepel_0.9.5           ggplotify_0.1.2
[109] farver_2.1.1            memoise_2.0.1           htmltools_0.5.7
[112] multtest_2.58.0         lifecycle_1.0.4         statmod_1.5.0
[115] GUniFrac_1.8            MASS_7.3-60

Weights for Samples of LinDA Modelling

I notice that the top-ranked bacteria are sometimes caused by outliers of linda analysis. This happens when some samples have five species and other species have fifty species detected, for instance. I would like linda to have an option for weighting of samples inversely to the number of species detected for each sample. lm has weights parameter. linda should also support weights.

mStat_generate_report_long() : Error in update_data_obj_count()

Dear Chen,

I hope this message finds you well.

I have encountered an issue while using the mStat_generate_report_long() function from the MicrobiomeStat package. Below is the script I am using, which is based on the tutorial script provided. Despite following the tutorial closely, I am receiving the following error message:

Error in update_data_obj_count():
! The provided count table must have row and column names.
Backtrace:
MicrobiomeStat::generate_taxa_areaplot_long(...)
MicrobiomeStat::mStat_normalize_data(data.obj, method = "TSS")
MicrobiomeStat:::update_data_obj_count(...)
Quitting from lines 195-215 [taxa-areaplot-longitudinal-generation]

Here is the script I am using:

setwd("/Users/tongfiles/Downloads/FLX micro")

library(MicrobiomeStat)
library(ape)

# Import the data
feature.tab <- read.csv("/Users/tongfiles/Downloads/FLX micro/otu_table.csv", header = TRUE, row.names = 1)
feature.tab <- feature.tab[1:10000,]
feature.tab[] <- lapply(feature.tab, function(x) as.numeric(as.character(x)))

meta.dat <- read.csv("/Users/tongfiles/Downloads/FLX micro/groupmap_FLX.csv", header = TRUE, row.names = 1)
feature.ann <- read.csv("/Users/tongfiles/Downloads/FLX micro/otu_taxa_table2.csv", header = TRUE, row.names = 1)
feature.ann <- feature.ann[1:10000,]

tree <- read.tree("/Users/tongfiles/Downloads/FLX micro/phylogeny.tre")

# Create data object
data.obj <- list(
  feature.tab = as.matrix(feature.tab),
  meta.dat = meta.dat,
  feature.ann = as.matrix(feature.ann),
  tree = tree
)

# Specify variable names
group.var = "Group"
subject.var = "Subject"
time.var = "Timepoint"
strata.var = NULL

# Specify diversity indices
alpha.name = c("shannon", "observed_species")
dist.name = c("BC", "Jaccard")

# Specify feature levels for visualization and testing
vis.feature.level = c("Genus")
test.feature.level = c("Genus")

# Specify other parameters
feature.dat.type = "count"
theme.choice = "bw"
base.size = 20
feature.mt.method = "none"
feature.sig.level = 0.3
feature.box.axis.transform = "sqrt"

# Specify output file
output.file = "Omics Analysis Report.pdf"

# Specify parameters for feature retention
bar.area.feature.no = 10
heatmap.feature.no = 40

# Specify optional parameters
dist.obj = NULL
alpha.obj = NULL
feature.change.func = "relative change"

# Run the function
mStat_generate_report_long(
  data.obj = data.obj,
  group.var = group.var,
  test.adj.vars = NULL,
  vis.adj.vars = NULL,
  strata.var = strata.var,
  subject.var = subject.var,
  time.var = time.var,
  t0.level = NULL,
  ts.levels = NULL,
  alpha.obj = alpha.obj,
  alpha.name = alpha.name,
  dist.obj = dist.obj,
  dist.name = dist.name,
  feature.change.func = feature.change.func,
  vis.feature.level = vis.feature.level,
  test.feature.level = test.feature.level,
  bar.area.feature.no = bar.area.feature.no,
  heatmap.feature.no = heatmap.feature.no,
  feature.dat.type = feature.dat.type,
  feature.mt.method = feature.mt.method,
  feature.sig.level = feature.sig.level,
  feature.box.axis.transform = feature.box.axis.transform,
  theme.choice = theme.choice,
  base.size = base.size,
  output.file = output.file
)

Here is the structure of my object:

```r
str(data.obj)
List of 4
 $ feature.tab: num [1:10000, 1:37] 24 0 14 44 38 51 134 4 50 9 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:10000] "OTU1" "OTU2" "OTU3" "OTU4" ...
  .. ..$ : chr [1:37] "pre_10" "pre_1" "pre_2" "pre_3" ...
 $ meta.dat   :'data.frame':	37 obs. of  3 variables:
  ..$ Subject  : chr [1:37] "m10" "m1" "m2" "m3" ...
  ..$ Group    : chr [1:37] "FLX" "FLX" "FLX" "control" ...
  ..$ Timepoint: chr [1:37] "baseline" "baseline" "baseline" "baseline" ...
 $ feature.ann: chr [1:10000, 1:7] "d__Bacteria" "d__Bacteria" "d__Bacteria" "d__Bacteria" ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:10000] "OTU1" "OTU2" "OTU3" "OTU4" ...
  .. ..$ : chr [1:7] "Kingdom" "Phylum" "Class" "Order" ...
 $ tree       :List of 5
  ..$ edge       : int [1:77485, 1:2] 38754 38755 38755 38754 38756 38756 38754 38757 38758 38758 ...
  ..$ edge.length: num [1:77485] 0.00011 0.00171 0.00085 0.00014 0.00428 0.00428 0.00014 0.00014 0.00343 0.00086 ...
  ..$ Nnode      : int 38733
  ..$ node.label : chr [1:38733] "" "0.902" "0.978" "0.929" ...
  ..$ tip.label  : chr [1:38753] "OTU11301" "OTU7973" "OTU5811" "OTU2336" ...
  ..- attr(*, "class")= chr "phylo"
  ..- attr(*, "order")= chr "cladewise"

I have verified that my count table (feature.tab) and other data frames (meta.dat and feature.ann) have both row and column names. Here are the checks I performed to ensure that:

# Check for NA values
if (any(is.na(feature.tab))) {
  stop("feature.tab contains NA values.")
}

# Check if row names and column names are character strings
if (!all(sapply(rownames(feature.tab), is.character)) || !all(sapply(colnames(feature.tab), is.character))) {
  stop("Row names and column names of feature.tab must be character strings.")
}

# Print first few rows and columns to visually inspect the data
print(head(feature.tab))
print(colnames(feature.tab))
print(rownames(feature.tab))

All checks passed successfully, yet the issue persists. I would greatly appreciate any guidance or suggestions you may have to resolve this problem.

Thank you very much for your time and assistance.

Parameter Names Incorrect in mStat_import_biom_as_data_obj

Documentation Section
SETTING UP MICROBIOMESTAT: INSTALLATION AND DATA PREPARATION --> Creating the MicrobiomeStat Data Object --> Importing Data from BIOM into MicrobiomeStat
The specific page is linked here.

Issue with Current Documentation
On the webpage linked above, the parameter names for the mStat_import_biom_as_data_obj() function are listed as biom_filename and tree_filename. However, in R, they are BIOMfilename and treefilename.

Screenshot from R
image

Error "contrasts can be applied only to factors with 2 or more levels" while using generate_alpha_change_test_pair()

Hello,
First of all, thank you for your work in MicrobiomeStat package. I have successfully created the MicrobiomeStat data object importing several .qza from QIIME2. While using generate_alpha_change_test_pair() , the following error occurs: Error in contrasts<-(*tmp*, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels. As far as I know this is caused by the presence of a column containing only one value (hence generate_alpha_change_test_pair() cannot fit the data into a linear regression model)

I have tried to troubleshoot the problem running : values_count <- sapply(lapply(data.obj, unique), length)
values_count to check for columns in my data object containing only 1 value but have not work so far.

This is the workflow I have used (I am afraid I cannot attach any .qza for you to reproduce the analysis, but If necessary I will be happy to provide it via email...)

Define paths to your QIIME2 files

otuqza_file <- "/path/merged-table.qza"
taxaqza_file <- "/path/taxonomy.qza"
sample_file <- "/path/metadata2.csv"
treeqza_file <- "/path/rooted-tree.qza"

Import QIIME2 data into a MicrobiomeStat data object

data.obj <- mStat_import_qiime2_as_data_obj(
otu_qza = otuqza_file,
taxa_qza = taxaqza_file,
sam_tab = sample_file,
tree_qza = treeqza_file
)

#Create alpha box plot
AlphaChangeTest <- generate_alpha_change_test_pair(
data.obj,
alpha.obj = NULL,
alpha.name = "shannon",
depth = NULL,
time.var = "time",
subject.var = "subject",
group.var = "group",
adj.vars = "Plastic.Type",
change.base ="1",
alpha.change.func = "log fold change"
)

Again, thank you very much for your work

Duplicate records used in the linear model within the "generate_beta_change_test_pair" function

Dear MicrobiomeStat team,

Thank you for developing this package and the examples, very useful!

Describe the Bug
Prior to my colleague discovering your package, we were analysing paired microbiome data. The approach we took to the comparison of groups for beta diversity was very similar to your "generate_beta_change_test_pair" function.

After discovering your package, we compared results and noticed some discrepancies. Looking at your function source code, it looks like the issue might be caused by the left join on line 132. It looks like this attempts to identify distinct metadata records for each subject (1 per subject) after excluding "time.var". However, without also excluding the "sample" variable created earlier, this retains records for both timepoints per subject. So the number of records when subsequently fitting the linear model is double what it should be.

Reproducible Example

### Load peerj32.obj data ###
library(microbiome)
library(MicrobiomeStat)
library(dplyr)
data(peerj32.obj)
data.obj <- peerj32.obj

### Define variables required in the "generate_beta_change_test_pair" function ###
time.var = "time"
subject.var = "subject"
group.var = "group"
adj.vars = c("sex")
change.base = "1"
dist.name = "BC"

### Run selected parts of the "generate_beta_change_test_pair" function code ###
dist.obj <-
  mStat_calculate_beta_diversity(data.obj = data.obj, dist.name = dist.name)
metadata <- data.obj$meta.dat %>% dplyr::select(all_of(c(subject.var,group.var,time.var, adj.vars))) %>% rownames_to_column("sample")

change.after <-
  unique(metadata %>% dplyr::select(all_of(c(time.var))))[unique(metadata %>% dplyr::select(all_of(c(time.var)))) != change.base]

dist.df <- as.matrix(dist.obj[[dist.name]]) %>%
  as.data.frame() %>%
  rownames_to_column("sample")

# The following has 1 row per subject, 22 rows total
long.df <- dist.df %>%
  tidyr::gather(key = "sample2", value = "distance", -sample) %>%
  dplyr::left_join(metadata, by = "sample") %>%
  dplyr::left_join(metadata, by = c("sample2" = "sample"), suffix = c(".subject", ".sample")) %>%
  filter(!!sym(paste0(subject.var, ".subject")) == !!sym(paste0(subject.var, ".sample"))) %>%
  dplyr::group_by(!!sym(paste0(subject.var, ".subject"))) %>%
  filter(!!sym(paste0(time.var,".sample")) == change.base) %>%
  filter(!!sym(paste0(time.var,".subject")) != !!sym(paste0(time.var,".sample"))) %>%
  dplyr::ungroup() %>%
  dplyr::select(!!sym(paste0(subject.var, ".subject")), !!sym(paste0(time.var, ".subject")), distance) %>%
  dplyr::rename(!!sym(subject.var) := !!sym(paste0(subject.var, ".subject")), !!sym(time.var) := !!sym(paste0(time.var, ".subject")))
dim(long.df) 

# Running the current code (line 132 of source code), this now contains 2 rows per subject, 44 rows total
long.df <- long.df %>% dplyr::left_join(metadata %>% dplyr::select(-any_of(c(time.var))) %>% dplyr::distinct(), by = subject.var)
dim(long.df)
head(long.df)

Expected Behavior
Using your "peerj32.obj" dataset, I believe "long.df" should have 22 rows, 1 per subject.

Actual Behavior
Because the variable "sample" is not excluded when selecting unique records, "long.df" has 44 rows, 2 per subject (1 per timepoint per subject).
Screenshot (14)

Environment Information:

  • Operating System: Windows 10 Education (version 22H2)
  • R Version: 4.2.1
  • Package Version: microbiome 1.23.1; phyloseq 1.42.0; ggplot2 3.4.3; rlang 1.1.1; tibble 3.2.1; MicrobiomeStat 1.1.3; dplyr 1.1.2

Attempted Solutions
I think modifying line 132 as follows should correct this:

long.df <- long.df %>% dplyr::left_join(metadata %>% dplyr::select(-any_of(c("sample",paste(time.var)))) %>% dplyr::distinct(), by = subject.var)

Screenshot (15)

One further issue for the specific example we were working on was that we had covariates that changed for subjects between baseline and the post-baseline visit. So to ensure we only used the baseline covariates for adjustment, we changed line 132 to the following:

long.df <- long.df %>% dplyr::left_join(metadata %>% dplyr::filter(!!sym(time.var)  == change.base) %>% dplyr::select(-any_of(c("sample",paste(time.var)))), by = subject.var)

Could you please have a look and see if this is indeed an issue, and see if either of the suggested solutions above correct this?
If it is an issue, I'm not sure if there is a similar problem in any other functions - this is the only one I have come across.

Many thanks!

Jamie

Cannot acces to data with generate_beta_ordination_pair object

Good morning/afternoon everyone,

I have used MicrobiomeStat for a longitudinal study, especially I've used the method generate_beta_ordination_pair to plot the Bray-Curtis distance overtime compared to inclusion in two groups, and I observed differences between these two biological groups overtime.

I wanted to access to the data (Bray-Curtis disance calculated by MicrobiomeStat), and I cannot find the data attached to the object.

data_spag

Will this option will be included in the future ?

Error when running generate_beta_trend_test_long() function

Dear Chen Yang,

Thank you for your work and your time in trying to answer all the questions users are aiming at you. I know you are very busy and I know I have already one question pending (issue #35) but I keep moving forward with my microbial analyses and new questions arise.

I am now trying to quantify temporal changes in microbial community composition using the generate_beta_trend_test_long() function but when running, I get the following error message:
Capture d’écran 2024-03-07 à 10 14 01

I have tried running the function using a dist.obj with my own distance matrices (as above) and also not using my own - dist.obj=NULL and dist.name= c("BC") - I get the same error in both cases.

I have also tried using a subset of my data with less samples and clean data (no missing values) to see if that would help but I still got the same error.

I usually try my best to solve most of my issues myself looking at forums on line but since your package is so new, there isn’t much available, that is why I am contacting you.

Hope you can help,
Best,
Gaëlle

time.var = NULL, in mStat_generate_report_single () causes an error and stops rendering

Hi, I'm running into an issue with the mStat_generate_report_single() when I am using data without a time variable. Hoping you can help.
Cheers
Rachele

Describe the Bug

Trying to run the report without a time variable, but the report fails.
If the time.var field = NULL in mStat_generate_report_single() , an error appears and the report stops rendering.

The error may be related to issue #15 - seems similar

Reproducible Example
Using the data set code below generates the error

data(ecam.obj)

mStat_generate_report_single(
  data.obj = ecam.obj,
  dist.obj = NULL,
  alpha.obj = NULL,
  group.var = "delivery",
  vis.adj.vars = "diet",
  test.adj.vars = "diet",
  subject.var = "subject.id",
  time.var = NULL,
  alpha.name = c("shannon", "observed_species"),
  depth = NULL,
  dist.name = c("BC",'Jaccard'),
  t.level = NULL,
  feature.box.axis.transform = "sqrt",
  strata.var = "antiexposedall",
  vis.feature.level = c("Phylum", "Family", "Genus"),
  test.feature.level = "Family",
  feature.dat.type = "proportion",
  theme.choice = "bw",
  base.size = 20,
  feature.mt.method = "none",
  feature.sig.level = 0.2,
  output.file = "C:/Scratch/Testecam.obj_report.pdf"
)

Expected Behavior
I expected the report to run either with NULL in the field or with it removed. The documentation does not make it clear what the options are.

Actual Behavior

The function will not run if it is removed ("argument "time.var" is missing, with no default"), and fails if NULL is used with the error message below:

processing file: file57fc6c274895.Rmd
|.... | 4% [input-parameters-summary]
Quitting from lines 18-105 [input-parameters-summary] (file57fc6c274895.Rmd)
Error in data.frame():
! arguments imply differing number of rows: 34, 33
Backtrace:

  1. base::data.frame(...)

Screenshots
Traceback:
image

Environment Information:

sessionInfo()
R version 4.3.2 (2023-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 11 x64 (build 22621)

Matrix products: default

locale:
[1] LC_COLLATE=English_Australia.utf8 LC_CTYPE=English_Australia.utf8 LC_MONETARY=English_Australia.utf8 LC_NUMERIC=C
[5] LC_TIME=English_Australia.utf8

time zone: Australia/Sydney
tzcode source: internal

attached base packages:
[1] stats graphics grDevices utils datasets methods base

loaded via a namespace (and not attached):
[1] digest_0.6.33 styler_1.10.2 fastmap_1.1.1 xfun_0.41 magrittr_2.0.3 glue_1.6.2 R.utils_2.12.3
[8] knitr_1.45 htmltools_0.5.7 lifecycle_1.0.4 cli_3.6.1 R.methodsS3_1.8.2 vctrs_0.6.5 reprex_2.0.2
[15] data.table_1.14.10 withr_2.5.2 compiler_4.3.2 R.oo_1.25.0 R.cache_0.16.0 purrr_1.0.2 rstudioapi_0.15.0
[22] tools_4.3.2 clipr_0.8.0 rlang_1.1.2 fs_1.6.3 htmlwidgets_1.6.4

Additional Context
Add any other context about the problem here, e.g., is this issue sporadic or consistent? Did it work in previous versions?

All samples were excluded after subsetting data

Hi I'm new to using this great tool. I have a small dataset of 20 samples. Including two groups and 10 timepoints for each group. I was trying to use the generate_alpha_test_long function to do alpha diversity analysis. I also made sure to make the Subject, Group, and Timepoint variables a factor.

my metadata:
image

The code:

MicrobiomeData <- list(
feature.tab = feature.tab,
meta.dat = meta.dat,
feature.ann = feature.ann,
tree = tree
)

MicrobiomeData$meta.dat$Timepoint <- factor(as.factor(MicrobiomeData$meta.dat$Timepoint), levels = c("T1", "T2", "T3", "T4","T5","T6", "T7", "T8", "T9", "T10"))

MicrobiomeData$feature.tab <- as.data.frame(MicrobiomeData$feature.tab) %>%
as.matrix()

MicrobiomeData$meta.dat <- as.matrix(MicrobiomeData$meta.dat) %>%
as.data.frame()

MicrobiomeData$feature.ann <- MicrobiomeData$feature.ann %>%
as.data.frame() %>%
as.matrix()

alpha_test_results <- generate_alpha_test_long(
data.obj = MicrobiomeData,
alpha.name = c("shannon", "simpson", "observed_species", "chao1", "ace", "pielou"),
time.var = "Timepoint",
t0.level = unique(MicrobiomeData$meta.dat$Timepoint)[1],
ts.levels = unique(MicrobiomeData$meta.dat$Timepoint)[-1],
group.var = "Group"
)

But I got these explanations and an empty list at the end:

Rule 1 passed: data.obj is a list.
Rule 2 passed: meta.dat is a data.frame.
Rule 3 passed: The row names of feature.tab match the row names of feature.ann.
Rule 4 passed: The order of rows in meta.dat has been adjusted to match feature.tab.
Rule 5 passed: feature.tab is a matrix.
Rule 6 passed: feature.ann is a matrix.
Please note: The data components should follow base R data.frame and matrix structures, not phyloseq's formal class.
Validation passed.
Note: Passing validation does not guarantee the absence of all data issues. Further data exploration may be needed.
Diversity analysis needs rarefaction! Call 'mStat_rarefy_data' to rarefy the data!
No depth specified, using the smallest column sum: 332393
Removed 1 rows from the OTU table and feature annotations due to all-zero counts after rarefication. The removed rows are: 1547 .
Data has been subsetted based on the provided samIDs.
Updated metadata to match the subsetted data.
The following samples were excluded: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
Updated feature table to match the subsetted data.
Updated feature annotation to match the subsetted data.
Data subsetting complete. Returning updated data object.
Calculating shannon diversity...
Calculating simpson diversity...
Calculating observed_species diversity...
Calculating chao1 diversity...
Calculating ace diversity...
Calculating pielou diversity...
Diversity calculations complete.

I'm not sure why all samples are excluded. Is samID the same as subject or individual samples' IDs? Could you please explain the step in which some samples are excluded.

Thanks,

Could not find function "mStat_convert_phyloseq_to_data_obj"

Hello,

I am excited to use the MicrobioStat package in order to test for differences in alpha diversity accross timepoints.
I have been working with phyloseq object so far and I would like to convert my Phyloseq object to a MicrobiomeStat data object.

I have been following the installation guide but somehow I keep getting the following error:
Error in mStat_convert_phyloseq_to_data_obj(peerj32.phy) :
could not find function "mStat_convert_phyloseq_to_data_obj"

I am working on Mac and I am using R version 4.3.1 (2023-06-16)

Here is the script I used:

#Install MicroBiomStat
install.packages("MicrobiomeStat")
#For complete set of functionalities
install.packages("devtools")
devtools::install_github("cafferychen777/MicrobiomeStat")

#List of packages to install
packages_to_install <- c(
"rlang", ....) # I have copied the list of packages as indicated in the installation guide
#Installing packages
install.packages(packages_to_install)

#Required package
library(microbiome)
library(MicrobiomeStat)

#Load the dataset
data(peerj32)
peerj32.phy <- peerj32$phyloseq

data.obj <- mStat_convert_phyloseq_to_data_obj(peerj32.phy)
Error in mStat_convert_phyloseq_to_data_obj(peerj32.phy) :
could not find function "mStat_convert_phyloseq_to_data_obj"

I am very sorry for the trivial question and for struggling on the very first step of using MicrobioStat.
Hope you can help.
Gaëlle

LinDA Unusual p-value Distribution

Using the data set shared previously via e-mail and fitting

adjNormalCancerFit <- linda(bacteriaMatrix, clinicalTablePairs, "~ Age * `Tissue Type` + Smoking * `Tissue Type` +
                            Gender * `Tissue Type` + `Primary Site` + (1|Patient)", "proportion", is.winsor = FALSE)

I get a strange-looking p-value distribution for many of the coefficients. For example,

image

How can it be made to be more uniform, as expected by statistics theory?

handling the false discovery rate ("fdr")

Dear MicrobiomeStat developer,
I have a question regarding the false discovery rate (FDR) correction.

I see on the documentation that the linda() function already takes into account FDR. However, in the function generate_taxa_volcano_single() there is an option called feature.mt.method = "none" or "fdr".

"feature.mt.method: There are two options available for this parameter: "fdr" (False Discovery Rate) and "none". Regardless of how this parameter is set, it's crucial to note that the generate_taxa_test_single function always performs adjustments post-testing. However, the feature.mt.method specifically influences the visualization in the volcano plot"

I am wondering what is the reason of adding another fdr layer on top of the LinDa calculation.
Could you provide an explanation on this?

Best regards

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.