Git Product home page Git Product logo

qadabra's Introduction

Main CI

Qadabra: Quantitative Analysis of Differential Abundance Ranks

(Pronounced ka-da-bra)

Qadabra is a Snakemake workflow for running and comparing several differential abundance (DA) methods on the same microbiome dataset.

Importantly, Qadabra focuses on both FDR corrected p-values and feature ranks and generates visualizations of differential abundance results.

Schematic

Please note this software is currently a work in progress. Your patience is appreciated as we continue to develop and enhance its features. Please leave an issue on GitHub should you run into any errors.

Installation

Option 1: Pip install from PyPI

pip install qadabra

Qadabra requires the following dependencies:

  • snakemake
  • click
  • biom-format
  • pandas
  • numpy
  • cython
  • iow

Check out the tutorial for more in-depth instructions on installation.

Option 2: Install from source (this GitHub repository)

Prerequisites

Before you begin, ensure you have Git and the necessary build tools installed on your system.

Clone the Repository

git clone https://github.com/biocore/qadabra.git

Navigate to repo root directory where the setup.py file is located and then install QADABRA in editable mode

cd qadabra
pip install -e .

Usage

1. Creating the workflow directory

Qadabra can be used on multiple datasets at once. First, we want to create the workflow directory to perform differential abundance with all methods:

qadabra create-workflow --workflow-dest <directory_name>

This command will initialize the workflow, but we still need to point to our dataset(s) of interest.

2. Adding a dataset

We can add datasets one-by-one with the add-dataset command:

qadabra add-dataset \
    --workflow-dest <directory_name> \
    --table <directory_name>/data/table.biom \
    --metadata <directory_name>/data/metadata.tsv \
    --tree <directory_name>/data/my_tree.nwk \
    --name my_dataset \
    --factor-name case_control \
    --target-level case \
    --reference-level control \
    --confounder confounding_variable(s) <confounding_var> \
    --verbose

Let's walkthrough the arguments provided here, which represent the inputs to Qadabra:

  • workflow-dest: The location of the workflow that we created earlier
  • table: Feature table (features by samples) in BIOM format
  • metadata: Sample metadata in TSV format
  • tree: Phylogenetic tree in .nwk or other tree format (optional)
  • name: Name to give this dataset
  • factor-name: Metadata column to use for differential abundance
  • target-level: The value in the chosen factor to use as the target
  • reference-level: The reference level to which we want to compare our target
  • confounder: Any confounding variable metadata columns (optional)
  • verbose: Flag to show all preprocessing performed by Qadabra

Your dataset should now be added as a line in my_qadabra/config/datasets.tsv.

You can use qadabra add-dataset --help for more details. To add another dataset, just run this command again with the new dataset information.

3. Running the workflow

The previous commands will create a subdirectory, my_qadabra in which the workflow structure is contained. From the command line, execute the following to start the workflow:

snakemake --use-conda --cores <number of cores preferred> <other options>

Please read the Snakemake documentation for how to run Snakemake best on your system.

When this process is completed, you should have directories figures, results, and log. Each of these directories will have a separate folder for each dataset you added.

4. Generating a report

After Qadabra has finished running, you can generate a Snakemake report of the workflow with the following command:

snakemake --report report.zip

This will create a zipped directory containing the report. Unzip this file and open the report.html file to view the report containing results and visualizations in your browser.

Tutorial

See the tutorial page for a walkthrough on using Qadabra workflow with a microbiome dataset.

FAQs

Coming soon: An FAQs page of commonly asked question on the statistics and code pertaining to Qadabra.

Citation

The manuscript for Qadabra is currently in progress. Please cite this GitHub page if Qadabra is used for your analysis. This project is licensed under the BSD-3 License. See the license file for details.

qadabra's People

Contributors

amandabirmingham avatar gibsramen avatar yangchen2 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

qadabra's Issues

Possible syntax error regarding a HoverTool attribute and use of iteritems on a Series

Below is a snippet from the log of a recent Qadabra run that seems to have ended due to 2 errors. From what I can tell, the first one occurs in plot_rank_comparison.py in the line hover = HoverTool(mode="mouse", names=["points"], attachment="below") because it was expecting name instead of names. The second seems to occur in plot_pca.py because iteritems is called on a Series instead of items. I checked the log files but they did not shed any more light on the errors.

Given that these might be relatively quick fixes if they are indeed what caused it to error out, could you please let me know if this issue has been encountered before?

Here is the snippet from the log:

Activating conda environment: .snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_
Activating conda environment: .snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_
Traceback (most recent call last):
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/scripts/tmpzhmmn4p4.plot_rank_comparison.py", line 52, in <module>
    hover = HoverTool(mode="mouse", names=["points"], attachment="below")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_/lib/python3.11/site-packages/bokeh/models/tools.py", line 1279, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_/lib/python3.11/site-packages/bokeh/models/tools.py", line 346, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_/lib/python3.11/site-packages/bokeh/models/tools.py", line 256, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_/lib/python3.11/site-packages/bokeh/models/tools.py", line 181, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_/lib/python3.11/site-packages/bokeh/model/model.py", line 110, in __init__
    super().__init__(**kwargs)
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_/lib/python3.11/site-packages/bokeh/core/has_props.py", line 302, in __init__
    setattr(self, name, value)
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_/lib/python3.11/site-packages/bokeh/core/has_props.py", line 340, in __setattr__
    self._raise_attribute_error_with_matches(name, properties)
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_/lib/python3.11/site-packages/bokeh/core/has_props.py", line 375, in _raise_attribute_error_with_matches
    raise AttributeError(f"unexpected attribute {name!r} to {self.__class__.__name__}, {text} attributes are {nice_join(matches)}")
AttributeError: unexpected attribute 'names' to HoverTool, similar attributes are name
[Wed Jan 17 16:35:46 2024]
Finished job 19.
115 of 152 steps (76%) done
Select jobs to execute...

[Wed Jan 17 16:35:47 2024]
rule plot_pca:
    input: results/environment_comparisons/pca/pca_features.tsv, results/environment_comparisons/pca/pca_tools.tsv, results/environment_comparisons/pca/proportion_explained.tsv
    output: figures/environment_comparisons/pca.svg
    log: log/environment_comparisons/plot_pca.log
    jobid: 30
    reason: Missing output files: figures/environment_comparisons/pca.svg; Input files updated by another job: results/environment_comparisons/pca/pca_tools.tsv, results/environment_comparisons/pca/proportion_explained.tsv, results/environment_comparisons/pca/pca_features.tsv
    wildcards: dataset=environment_comparisons
    resources: tmpdir=/var/folders/_w/22jzw70d6jx1ngxrkkj2p5g80000gp/T

Activating conda environment: .snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_
[Wed Jan 17 16:35:47 2024]
Error in rule plot_rank_comparison:
    jobid: 29
    input: results/environment_comparisons/concatenated_differentials.tsv
    output: figures/environment_comparisons/rank_comparisons.html
    log: log/environment_comparisons/plot_rank_comparison.log (check log file(s) for error details)
    conda-env: /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_

RuleException:
CalledProcessError in file /Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/site-packages/qadabra/workflow/rules/visualization.smk, line 149:
Command 'source /Users/aphillip/anaconda3/bin/activate '/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_'; set -euo pipefail;  python /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/scripts/tmpzhmmn4p4.plot_rank_comparison.py' returned non-zero exit status 1.
  File "/Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/site-packages/qadabra/workflow/rules/visualization.smk", line 149, in __rule_plot_rank_comparison
  File "/Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Activating conda environment: .snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_
[Wed Jan 17 16:35:48 2024]
Finished job 139.
116 of 152 steps (76%) done
[Wed Jan 17 16:35:50 2024]
Finished job 28.
117 of 152 steps (77%) done
Traceback (most recent call last):
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/scripts/tmp1th9bjlw.plot_pca.py", line 106, in <module>
    for i, x in prop_exp.iteritems()
                ^^^^^^^^^^^^^^^^^^
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_/lib/python3.11/site-packages/pandas/core/generic.py", line 6204, in __getattr__
    return object.__getattribute__(self, name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Series' object has no attribute 'iteritems'
[Wed Jan 17 16:35:52 2024]
Error in rule plot_pca:
    jobid: 30
    input: results/environment_comparisons/pca/pca_features.tsv, results/environment_comparisons/pca/pca_tools.tsv, results/environment_comparisons/pca/proportion_explained.tsv
    output: figures/environment_comparisons/pca.svg
    log: log/environment_comparisons/plot_pca.log (check log file(s) for error details)
    conda-env: /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_

RuleException:
CalledProcessError in file /Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/site-packages/qadabra/workflow/rules/visualization.smk, line 131:
Command 'source /Users/aphillip/anaconda3/bin/activate '/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/conda/b93b3e3540dfc9b230066ff51de94cd0_'; set -euo pipefail;  python /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/prok_slc_environments/.snakemake/scripts/tmp1th9bjlw.plot_pca.py' returned non-zero exit status 1.
  File "/Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/site-packages/qadabra/workflow/rules/visualization.smk", line 131, in __rule_plot_pca
  File "/Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-01-17T161851.931421.snakemake.log

Update Songbird output

Currently the output is differentials.tsv which is not actually called in the shell command. Normally this isn't a problem because Songbird will make it in the correct location. However, this becomes an issue when using Qadabra as a module, for example in https://github.com/gibsramen/qadabra-analyses. Could probably use the directory() command rather than the file.

AttributeError: unexpected attribute 'names' to HoverTool, similar attributes are name

the snakemake run got stuck after 115/152 jobs done with the following error (see below). I am not familiar with Hoover Tool but the error seems to suggest that it takes name instead of names(?) Could you help looking into this, thank you!

Activating conda environment: .snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_
Activating conda environment: .snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_
[Wed Nov 29 19:05:36 2023]
Finished job 19.
115 of 152 steps (76%) done
Select jobs to execute...

[Wed Nov 29 19:05:37 2023]
rule plot_rank_comparison:
    input: results/my_gutb/concatenated_differentials.tsv
    output: figures/my_gutb/rank_comparisons.html
    log: log/my_gutb/plot_rank_comparison.log
    jobid: 29
    reason: Missing output files: figures/my_gutb/rank_comparisons.html; Input files updated by another job: results/my_gutb/concatenated_differentials.tsv
    wildcards: dataset=my_gutb
    resources: tmpdir=/var/folders/p3/2jjw49sd6jz7zxhgh3c4q3m40000gp/T

Activating conda environment: .snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_
Activating conda environment: .snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_
Traceback (most recent call last):
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/scripts/tmpmvh3_enj.plot_rank_comparison.py", line 52, in <module>
    hover = HoverTool(mode="mouse", names=["points"], attachment="below")
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_/lib/python3.11/site-packages/bokeh/models/tools.py", line 1266, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_/lib/python3.11/site-packages/bokeh/models/tools.py", line 345, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_/lib/python3.11/site-packages/bokeh/models/tools.py", line 255, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_/lib/python3.11/site-packages/bokeh/models/tools.py", line 180, in __init__
    super().__init__(*args, **kwargs)
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_/lib/python3.11/site-packages/bokeh/model/model.py", line 110, in __init__
    super().__init__(**kwargs)
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_/lib/python3.11/site-packages/bokeh/core/has_props.py", line 302, in __init__
    setattr(self, name, value)
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_/lib/python3.11/site-packages/bokeh/core/has_props.py", line 340, in __setattr__
    self._raise_attribute_error_with_matches(name, properties)
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_/lib/python3.11/site-packages/bokeh/core/has_props.py", line 375, in _raise_attribute_error_with_matches
    raise AttributeError(f"unexpected attribute {name!r} to {self.__class__.__name__}, {text} attributes are {nice_join(matches)}")
AttributeError: unexpected attribute 'names' to HoverTool, similar attributes are name
[Wed Nov 29 19:05:39 2023]
Error in rule plot_rank_comparison:
    jobid: 29
    input: results/my_gutb/concatenated_differentials.tsv
    output: figures/my_gutb/rank_comparisons.html
    log: log/my_gutb/plot_rank_comparison.log (check log file(s) for error details)
    conda-env: /Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_

RuleException:
CalledProcessError in file /Users/y1weng/miniconda3/envs/snakemake/lib/python3.11/site-packages/qadabra/workflow/rules/visualization.smk, line 149:
Command 'source /Users/y1weng/miniconda3/bin/activate '/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/conda/19fe2a2a4b9842ba47b6f7aa7be3b1e4_'; set -euo pipefail;  python /Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t12_qadabra_out/.snakemake/scripts/tmpmvh3_enj.plot_rank_comparison.py' returned non-zero exit status 1.
  File "/Users/y1weng/miniconda3/envs/snakemake/lib/python3.11/site-packages/qadabra/workflow/rules/visualization.smk", line 149, in __rule_plot_rank_comparison
  File "/Users/y1weng/miniconda3/envs/snakemake/lib/python3.11/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2023-11-29T185655.958966.snakemake.log

Issue adding the tutorial dataset

I created workflow/dataset directories and installed Qadabra as specified in the tutorial (using conda instead of mamba), but when I try to add the dataset I keep getting an error stating that the table IDs don't match the metadata IDs. I checked the IDs from the .biom and the .tsv file and they seem to match (shown below), so I was just wondering if this issue has come up for anyone else before. If not, are there any alternative datasets that you would recommend for the tutorial?

I also tried doing this in a different environment in which python 3.9 wasn't specified and conda was used to install everything except Qadabra, which was installed using pip. I ran into the same issue.

Note: qadabra_env_2 was created as specified in the tutorial, and qadabra_env was created without explicitly using python 3.9 and using pip to install Qadabra and conda to install all dependencies.

My directory structure:

.
├── check_metadata_coverage.py <-- custom script, everything else is following the tutorial
└── my_qadabra
    ├── config
    │   ├── config.yaml
    │   └── qadabra.mplstyle
    ├── data
    │   ├── qadabra_tutorial_metadata.tsv
    │   └── qadabra_tutorial_table.biom
    └── workflow
        └── Snakefile

Attempting to add the dataset to the workflow:

(qadabra_env_2) aphilliplt-osx:qadabra_tutorial aphillip$ qadabra add-dataset \
>     --workflow-dest my_qadabra \
>     --table my_qadabra/data/qadabra_tutorial_table.biom \
>     --metadata my_qadabra/data/qadabra_tutorial_metadata.tsv \
>     --name skin_microbiome \
>     --factor-name group \
>     --target-level Day_90 \
>     --reference-level Baseline \
>     --verbose
[2024-01-10 12:28:16 - INFO] :: Validating input...
[2024-01-10 12:28:16 - INFO] :: Loading metadata...
[2024-01-10 12:28:16 - INFO] :: Making sure factor & levels are all present in metadata...
[2024-01-10 12:28:16 - INFO] :: Factor counts:
group
Baseline    19
Day_90      19
Name: count, dtype: int64
[2024-01-10 12:28:16 - INFO] :: Making sure confounders are all metadata columns...
[2024-01-10 12:28:16 - INFO] :: Loading table...
[2024-01-10 12:28:16 - INFO] :: Table shape: (11, 38)
Traceback (most recent call last):
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/bin/qadabra", line 8, in <module>
    sys.exit(qadabra())
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/qadabra/qadabra.py", line 148, in add_dataset
    _validate_input(logger, table, metadata, factor_name, target_level,
  File "/Users/aphillip/anaconda3/envs/qadabra_env_2/lib/python3.9/site-packages/qadabra/utils.py", line 44, in _validate_input
    raise ValueError("Table IDs are not a subset of metadata IDs!")
ValueError: Table IDs are not a subset of metadata IDs!

Checking the table IDs and the metadata IDs:

(qadabra_env) aphilliplt-osx:qadabra_tutorial aphillip$ biom table-ids -i my_qadabra/data/qadabra_tutorial_table.biom
25
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
52
53
54
55
56
57
58
59
60
61
62
63
64
(qadabra_env) aphilliplt-osx:qadabra_tutorial aphillip$ python check_metadata_coverage.py
Error: Biom IDs without corresponding metadata entries: {'43', '64', '63', '45', '32', '33', '42', '44', '47', '38', '39', '52', '28', '49', '50', '48', '27', '34', '62', '57', '37', '54', '30', '61', '29', '31', '60', '55', '56', '36', '25', '35', '58', '59', '53', '41', '46', '40'}

Confounders not accepted as string

Got this error when testing qadabra yesterday:

Building DAG of jobs... InputFunctionException in rule ancombc in file /home/adilmore/Y/envs/qadabra_env/lib/python3.9/site-packages/qadabra/workflow/rules/diffab.smk, line 23: Error: TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' Wildcards:

Fixed locally by changing np.isnan to pd.isnull here and here.

Target and Reference Level Issue

There is an issue when the target-level and reference-level equal True/False, 1/0, "True"/"False".
The error is "“ValueError: 1 not found in $factor-name values!”

n_splits=5 cannot be greater than the number of members in each class.

I might have missed this in the documentation - what are the minimum number of samples for control and treatment required for the analysis?

Follow up: error disappeared after including more samples. Would be helpful to add the minimum number of samples required in the documentation

Traceback (most recent call last):
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t1_qadabra_out/.snakemake/scripts/tmp9w0fsvnp.logistic_regression.py", line 54, in <module>
    folds = [(train, test) for train, test in cv.split(X, y)]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t1_qadabra_out/.snakemake/scripts/tmp9w0fsvnp.logistic_regression.py", line 54, in <listcomp>
    folds = [(train, test) for train, test in cv.split(X, y)]
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t1_qadabra_out/.snakemake/conda/ae102f57fe1fb6ba91271bbd0ace3284_/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 1527, in split
    for train_index, test_index in cv.split(X, y, groups):
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t1_qadabra_out/.snakemake/conda/ae102f57fe1fb6ba91271bbd0ace3284_/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 377, in split
    for train, test in super().split(X, y, groups):
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t1_qadabra_out/.snakemake/conda/ae102f57fe1fb6ba91271bbd0ace3284_/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 108, in split
    for test_index in self._iter_test_masks(X, y, groups):
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t1_qadabra_out/.snakemake/conda/ae102f57fe1fb6ba91271bbd0ace3284_/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 770, in _iter_test_masks
    test_folds = self._make_test_folds(X, y)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/y1weng/Downloads/20231129_qadabra_gutb/mi04t1_qadabra_out/.snakemake/conda/ae102f57fe1fb6ba91271bbd0ace3284_/lib/python3.11/site-packages/sklearn/model_selection/_split.py", line 732, in _make_test_folds
    raise ValueError(
ValueError: n_splits=5 cannot be greater than the number of members in each class.

error when installing scikit-bio

Had issue with skbio specifically on install

ERROR: Failed building wheel for scikit-bio
Failed to build scikit-bio
ERROR: Could not build wheels for scikit-bio, which is required to install pyproject.toml-based projects

Fixed by forcing pip install scikit-bio == 0.5.6

Strangely, this did not happen on a local install, but on a barnacle2 install

Qadabra Snakemake file using unexpected p-values from corncob and metagenomeseq

I was reviewing the Snakemake file that shows which fields from each tool are being used for the differentials/p-values when I noticed something unexpected regarding the values used from corncob and metagenomeseq. It seems as though thepvalues field from metagenomeseq is being used instead of adjPvalues and the fit.p field from corncob is being used instead of adjusted_p_values. At first I thought this could be because Qadabra performs FDR correction internally, but in that case I'd expect the p-value fields for the other tools to be different as well.

Here is a link to the Snakemake file I'm referring to: https://github.com/biocore/qadabra/blob/main/qadabra/workflow/rules/common.smk .
Here's the snippet that shows where the p-values are coming from:

def get_pvalue_tool_columns(wildcards):
    d = datasets.loc[wildcards.dataset].to_dict()
    covariate = d["factor_name"]
    target = d["target_level"]
    reference = d["reference_level"]

    columns = {
        "edger": "PValue",
        "deseq2": "pvalue",
        "ancombc": "pvals",
        "aldex2": f"model.{covariate}{target} Pr(>|t|)",
        "maaslin2": "pval",
        "metagenomeseq": "pvalues",
        "corncob": "fit.p",
    }
    return columns[wildcards.tool]

Can someone please explain what's going on there? I'll continue to look into the documentation for each tool and update this issue if I find anything relevant.

Wrong LICENSE referenced in README

README mentions MIT license but we are using BSD-3.

The manuscript for Qadabra is currently in progress. Please cite this GitHub page if Qadabra is used for your analysis. This project is licensed under the MIT License. See the [license](LICENSE) file for details.

Add better logging

Currently a lot of the logging is pretty sparse. This will likely be a problem when debugging issues. We should log simple messages like "Running model..." to determine where bugs occur (especially in R).

Move PCA from interactive app to own rules

Can try out PC1 as "consensus" ranks to see if it outperforms individual tools

Also maybe we scrap the whole interactive app altogether and just keep the interactive tool rank comparisons as an HTML to include in the report.

Issue using corncob and metagenomeseq

I encountered an issue when running qadabra, it has to do with two of the tools that are used by qadabra; corncob and metagenomeseq. I checked both log files (see underneath), but it was not clear to me if the error has to do with my input files and how to resolve?
Tue Feb 13 15:11:19 2024]
Error in rule metagenomeseq:
jobid: 15
input: /Users/bpessemier/my_qadabra3/data/metaphlan4species_readstats_transposed.biom, /Users/bpessemier/my_qadabra3/data/metadata_Psoriasis.tsv
output: results/skin_microbiome/tools/metagenomeseq/differentials.tsv, results/skin_microbiome/tools/metagenomeseq/results.rds
log: log/skin_microbiome/metagenomeseq.log (check log file(s) for error details)
conda-env: /Users/bpessemier/my_qadabra3/.snakemake/conda/656a891cb8f6dd51f64ad50a1c6b0f26_
RuleException:
CalledProcessError in file /Users/bpessemier/anaconda3/envs/qadabra_env/lib/python3.9/site-packages/qadabra/workflow/rules/diffab.smk, line 131:
Command 'source /Users/bpessemier/anaconda3/envs/qadabra_env/bin/activate '/Users/bpessemier/my_qadabra3/.snakemake/conda/656a891cb8f6dd51f64ad50a1c6b0f26_'; set -euo pipefail; Rscript --vanilla /Users/bpessemier/my_qadabra3/.snakemake/scripts/tmpyrssflqd.metagenomeseq.R' returned non-zero exit status 1.
File "/Users/bpessemier/anaconda3/envs/qadabra_env/lib/python3.9/site-packages/qadabra/workflow/rules/diffab.smk", line 131, in rule_metagenomeseq
File "/Users/bpessemier/anaconda3/envs/qadabra_env/lib/python3.9/concurrent/futures/thread.py", line 58, in run
[Tue Feb 13 15:11:39 2024]
Finished job 7.
25 of 163 steps (15%) done
[Tue Feb 13 15:12:05 2024]
Error in rule corncob:
jobid: 17
input: /Users/bpessemier/my_qadabra3/data/metaphlan4species_readstats_transposed.biom, /Users/bpessemier/my_qadabra3/data/metadata_Psoriasis.tsv
output: results/skin_microbiome/tools/corncob/differentials.tsv, results/skin_microbiome/tools/corncob/results.rds
log: log/skin_microbiome/corncob.log (check log file(s) for error details)
conda-env: /Users/bpessemier/my_qadabra3/.snakemake/conda/656a891cb8f6dd51f64ad50a1c6b0f26

RuleException:
CalledProcessError in file /Users/bpessemier/anaconda3/envs/qadabra_env/lib/python3.9/site-packages/qadabra/workflow/rules/diffab.smk, line 147:
Command 'source /Users/bpessemier/anaconda3/envs/qadabra_env/bin/activate '/Users/bpessemier/my_qadabra3/.snakemake/conda/656a891cb8f6dd51f64ad50a1c6b0f26
'; set -euo pipefail; Rscript --vanilla /Users/bpessemier/my_qadabra3/.snakemake/scripts/tmpfdrzbei1.corncob.R' returned non-zero exit status 1.
File "/Users/bpessemier/anaconda3/envs/qadabra_env/lib/python3.9/site-packages/qadabra/workflow/rules/diffab.smk", line 147, in __rule_corncob
File "/Users/bpessemier/anaconda3/envs/qadabra_env/lib/python3.9/concurrent/futures/thread.py", line 58, in run
Removing output files of failed job corncob since they might be corrupted:
results/skin_microbiome/tools/corncob/results.rds
[Tue Feb 13 15:13:08 2024]
Finished job 9.
26 of 163 steps (16%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-02-13T150501.551322.snakemake.log

This was the log output of corncob;
[1] "Loading table..."
[1] "Loading metadata..."
[1] "Harmonizing table and metadata samples..."
[1] "Converting to phyloseq..."
[1] "Creating design formula..."
~Pool
[1] "Running corncob..."
[1] "Saved RDS!"
[1] "Aggregating models..."
Error: $ operator is invalid for atomic vectors
Execution halted

This were the the logoutput of metagenomeseq;
[1] "Loading table..."
[1] "Loading metadata..."
[1] "Harmonizing table and metadata samples..."
Default value being used.
[1] "Creating design formula..."
[1] "Running metagenomeSeq..."
it= 0, nll=210.47, log10(eps+1)=Inf, stillActive=738
it= 1, nll=210.34, log10(eps+1)=Inf, stillActive=253
it= 2, nll=206.62, log10(eps+1)=Inf, stillActive=242
it= 3, nll=204.02, log10(eps+1)=Inf, stillActive=194
it= 4, nll=202.55, log10(eps+1)=Inf, stillActive=142
it= 5, nll=202.81, log10(eps+1)=Inf, stillActive=74
it= 6, nll=203.25, log10(eps+1)=Inf, stillActive=49
it= 7, nll=203.77, log10(eps+1)=Inf, stillActive=38
it= 8, nll=203.71, log10(eps+1)=Inf, stillActive=38
it= 9, nll=203.66, log10(eps+1)=Inf, stillActive=38
Error in if (max(df.residual) == 0) stop("No residual degrees of freedom in linear model fits") :
missing value where TRUE/FALSE needed
Calls: -> .do_fitZig -> -> .ebayes
Execution halted

Error regarding MaAsLin2 log ratios

Context

I tried running Qadabra on some prokaryotic species-level cluster (PSLC) counts data recently and ran into the error below. At first I thought it could have had something to do with the formatting of the index names in the .biom file I used as the table, but when I re-ran it I encountered the same error.

Specifically, there seems to be something awry when it comes to the calculation of log ratios for MaAsLin2. From what I could tell, there were 2 errors regarding the log_ratio function used to create results/environment_comparisons/ml/maaslin2/log_ratios/log_ratios.pctile_15.tsv and its pctile_20 counterpart (shown in the error message). The log_ratios.pctile_20.tsv file has 274 out of 687 total PSLCs in my original dataset/table, but the error message states that 137 are not in the columns (I'm not sure which columns it's referring to). I'm having trouble locating the tmpflekjq_3.get_log_ratios.py script mentioned in the error message so I figured I'd report this and ask to see if something like this has come up before.

My questions

  1. I pasted the header of the the log_ratios.pctile_20.tsv file at the bottom; are the values of the 3 rightmost columns supposed to be the same regardless of the feature_id? I don't know if that's causing the error but I figured I'd check.
  2. How can I find the script that's mentioned in the error message so I can find out what the columns it's referring to are? When I ls -a the .snakemake/scripts/ directory (looking for tmpflekjq_3.get_log_ratios.py) I don't see anything in it.
  3. Have you encountered a similar issue?

Thank you and I'm happy to provide more info if needed.

Code

Adding the dataset:

(qadabra_env) aphilliplt-osx:differential_abundance_analysis aphillip$ qadabra add-dataset \
> --workflow-dest my_qadabra \
> --table my_qadabra/data/slc_prok.biom \
> --metadata my_qadabra/data/environment_metadata.tsv \
> --name environment_comparisons\
> --factor-name group \
> --target-level sediment \
> --reference-level water \
> --verbose

Running the workflow:

(qadabra_env) aphilliplt-osx:differential_abundance_analysis aphillip$ cd my_qadabra/
(qadabra_env) aphilliplt-osx:my_qadabra aphillip$ snakemake --use-conda --cores 4

Portion of the error message:

Activating conda environment: .snakemake/conda/398bc325f983272b15d35276018e95bd_
Traceback (most recent call last):
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/scripts/tmpflekjq_3.get_log_ratios.py", line 41, in <module>
    lr_df = log_ratio(table, top_feats, bot_feats).reset_index()
...
File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/conda/398bc325f983272b15d35276018e95bd_/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['PSLC.198', 'PSLC.643', 'PSLC.340', 'PSLC.357', 'PSLC.32', 'PSLC.662',\n       'PSLC.38', 'PSLC.375', 'PSLC.389', 'PSLC.23',\n       ...\n       'PSLC.53', 'PSLC.668', 'PSLC.39', 'PSLC.40', 'PSLC.1', 'PSLC.270',\n       'PSLC.303', 'PSLC.291', 'PSLC.195', 'PSLC.683'],\n      dtype='object', length=137)] are in the [columns]"
[Fri Jan 12 17:30:38 2024]
Error in rule log_ratios:
    jobid: 143
    input: /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/data/slc_prok.biom, results/environment_comparisons/ml/maaslin2/pctile_feats/pctile_20.tsv
    output: results/environment_comparisons/ml/maaslin2/log_ratios/log_ratios.pctile_20.tsv
    log: log/environment_comparisons/log_ratios.maaslin2.pctile_20.log (check log file(s) for error details)
    conda-env: /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/conda/398bc325f983272b15d35276018e95bd_

RuleException:
CalledProcessError in file /Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/site-packages/qadabra/workflow/rules/ml.smk, line 55:
Command 'source /Users/aphillip/anaconda3/bin/activate '/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/conda/398bc325f983272b15d35276018e95bd_'; set -euo pipefail;  python /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/scripts/tmpflekjq_3.get_log_ratios.py' returned non-zero exit status 1.
  File "/Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/site-packages/qadabra/workflow/rules/ml.smk", line 55, in __rule_log_ratios
  File "/Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Traceback (most recent call last):
  File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/scripts/tmppagdql8a.get_log_ratios.py", line 41, in <module>
    lr_df = log_ratio(table, top_feats, bot_feats).reset_index()
...
File "/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/conda/398bc325f983272b15d35276018e95bd_/lib/python3.11/site-packages/pandas/core/indexes/base.py", line 6176, in _raise_if_missing
    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Index(['PSLC.198', 'PSLC.643', 'PSLC.340', 'PSLC.357', 'PSLC.32', 'PSLC.662',\n       'PSLC.38', 'PSLC.375', 'PSLC.389', 'PSLC.23',\n       ...\n       'PSLC.52', 'PSLC.665', 'PSLC.318', 'PSLC.685', 'PSLC.276', 'PSLC.349',\n       'PSLC.309', 'PSLC.632', 'PSLC.630', 'PSLC.359'],\n      dtype='object', length=103)] are in the [columns]"
[Fri Jan 12 17:30:39 2024]
Error in rule log_ratios:
    jobid: 122
    input: /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/data/slc_prok.biom, results/environment_comparisons/ml/maaslin2/pctile_feats/pctile_15.tsv
    output: results/environment_comparisons/ml/maaslin2/log_ratios/log_ratios.pctile_15.tsv
    log: log/environment_comparisons/log_ratios.maaslin2.pctile_15.log (check log file(s) for error details)
    conda-env: /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/conda/398bc325f983272b15d35276018e95bd_

RuleException:
CalledProcessError in file /Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/site-packages/qadabra/workflow/rules/ml.smk, line 55:
Command 'source /Users/aphillip/anaconda3/bin/activate '/Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/conda/398bc325f983272b15d35276018e95bd_'; set -euo pipefail;  python /Users/aphillip/Library/CloudStorage/Desktop/Projects/counts_tables/differential_abundance_analysis/my_qadabra/.snakemake/scripts/tmppagdql8a.get_log_ratios.py' returned non-zero exit status 1.
  File "/Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/site-packages/qadabra/workflow/rules/ml.smk", line 55, in __rule_log_ratios
  File "/Users/aphillip/anaconda3/envs/qadabra_env/lib/python3.10/concurrent/futures/thread.py", line 58, in run
[Fri Jan 12 17:30:40 2024]
Finished job 92.
57 of 152 steps (38%) done
[Fri Jan 12 17:32:03 2024]
Finished job 9.
58 of 152 steps (38%) done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2024-01-12T171542.544654.snakemake.log

Header of the log_ratios.pctile_20.tsv file:

feature_id	location	pctile	num_feats
PSLC.198	numerator	20	137
PSLC.643	numerator	20	137
PSLC.340	numerator	20	137
PSLC.357	numerator	20	137

Rework repeated K-fold CV

Probably better to use something like StratifiedShuffleSplit. Otherwise can run into the case where unbalanced datasets will cause problems if no truth samples are found.

Reading biom files in R causes an X to be added to column names that start with #

Hey Gibs,

I tried running the workflow and ran into an error that I believe is being caused by the way you are reading in the biom tables in R. When the biom tables are read in R if the OTUs/ASVs have #s at the start. I.e. if the original ASV was called:

"6d5af1db8934361889a6fb8d80c70836"

it will be changed into the R table to
"X6d5af1db8934361889a6fb8d80c70836"

This doesn't cause an issue when running the tools, however, i believe it is causing downstream issues when the original biom file is then read in by the log_ratios python script which to my knowledge doesn't add an X character to column names that start with a #.

rule log_ratios:
input: ../data/Chemerin_ASVs_table.biom, results/ml/maaslin2/pctile_feats/pctile_20.tsv
output: results/ml/maaslin2/log_ratios/log_ratios.pctile_20.tsv
log: log/log_ratios.maaslin2.pctile_20.log
jobid: 139
reason: Missing output files: results/ml/maaslin2/log_ratios/log_ratios.pctile_20.tsv
wildcards: tool=maaslin2, pctile=20
resources: tmpdir=/tmp

It would be easiest to fix this if you can turn off the checkname option in the table read of the R scripts. However looking at the read_biom function it doesn't seem to be an option.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.