Git Product home page Git Product logo

aviary's Introduction

install with bioconda DOI

Aviary

An easy to use for wrapper for a robust snakemake pipeline for metagenomic short-read, long-read, and hybrid assembly. Aviary also performs binning, annotation, strain diversity analyses,a nd provides users with an easy way to combine and dereplicate many aviary results with rapidity. The pipeline currently includes a series of distinct, yet flexible, modules that can seamlessly communicate with each other. Each module can be run independently or as a single pipeline depending on provided input.

Please refer to the full docs here

Quick Installation

Your conda channels should be configured ideally in this order:

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Your resulting .condarc file should look something like:

channels:
  - conda-forge
  - bioconda
  - defaults

Option 1: Install from Bioconda

Conda can handle the creation of the environment for you directly:

conda create -n aviary -c bioconda aviary

Or install into existing environment:

conda install -c bioconda aviary

Option 2: Install from pip

Create the environment using the aviary.yml file then install from pip:

conda env create -n aviary -f aviary.yml
conda activate aviary
pip install aviary-genome

Option 3: Install from source

Initial requirements for aviary can be downloaded using the aviary.yml:

git clone https://github.com/rhysnewell/aviary.git
cd aviary
conda env create -n aviary -f aviary.yml
conda activate aviary
pip install -e .

The aviary executable can then be run from any directory. Since the code in this directory is then used for running, any updates made there will be immediately available. We recommend this mode for developing and debugging aviary.

Checking installation

Whatever option you choose, running aviary --help should return the following output:

                    ......:::::: AVIARY ::::::......

           A comprehensive metagenomics bioinformatics pipeline

Metagenome assembly, binning, and annotation:
        assemble  - Perform hybrid assembly using short and long reads, 
                    or assembly using only short reads
        recover   - Recover MAGs from provided assembly using a variety 
                    of binning algorithms 
        annotate  - Annotate MAGs using EggNOG and GTBD-tk
        genotype  - Perform strain diversity analysis of MAGs using Lorikeet
        complete  - Runs each stage of the pipeline: assemble, recover, 
                    annotate, genotype in that order.
        cluster   - Combines and dereplicates the MAGs from multiple Aviary runs
                    using Galah

Isolate assembly, binning, and annotation:
        isolate   - Perform isolate assembly **PARTIALLY COMPLETED**
        
Utility modules:
        configure - Set or overwrite the environment variables for future runs.

Databases

Aviary uses programs which require access to locally stored databases. These databases can be quite large, as such we recommend setting up one instance of Aviary and these databases per machine or machine cluster.

The required databases are as follows:

Installing databases

Aviary can handle the download and installation of these databases via use of the --download flag. Using --download will download and install the databases into the folders corresponding to their associated environment variables. Aviary will ask you to set these environment variables upon first running and if they are not already available. Otherwise, users can use the aviary configure subcommand to reset the environment variables:

aviary configure -o logs/ --eggnog-db-path /shared/db/eggnog/ --gtdb-path /shared/db/gtdb/ --checkm2-db-path /shared/db/checkm2db/ --singlem-metapackage-path /shared/db/singlem/ --download

This command will check if the databases exist at those given locations, if they don't then aviary will download and change the conda environment variables to match those paths.

N.B. Again, these databases are VERY large. Please talk to your sysadmin/bioinformatics specialist about setting a shared location to install these databases to prevent unnecessary storage use. Additionally, the --download flag can be used within any aviary module to check that databases are configured properly.

Environment variables

Upon first running Aviary, you will be prompted to input the location for several database folders if they haven't already been provided. If at any point the location of these folders change you can use the the aviary configure module to update the environment variables used by aviary.

These environment variables can also be configured manually, just set the following variables in your .bashrc file:

export GTDBTK_DATA_PATH=/path/to/gtdb/gtdb_release207/db/ # https://gtdb.ecogenomic.org/downloads
export EGGNOG_DATA_DIR=/path/to/eggnog-mapper/2.1.8/ # https://github.com/eggnogdb/eggnog-mapper/wiki/eggNOG-mapper-v2.1.5-to-v2.1.8#setup
export SINGLEM_METAPACKAGE_PATH=/path/to/singlem_metapackage.smpkg/
export CHECKM2DB=/path/to/checkm2db/
export CONDA_ENV_PATH=/path/to/conda/envs/

Workflow

Aviary workflow

Citations

If you use aviary then please be aware that you are using a great number of other programs and aviary wrapping around them. You should cite all of these tools as well, or whichever tools you know that you are using. To make this easy for you we have provided the following list of citations for you to use in alphabetical order. This list will be updated as new modules are added to aviary.

A constantly updating list of citations can be found in the Citations document.

License

Code is GPL-3.0

aviary's People

Contributors

aroneys avatar julianzaugg avatar rhysnewell avatar sternp avatar wwood avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

aviary's Issues

Installation of GTDB-Tk environment: nothing provides boost >=1.70.0...needed by fastani-1.32-he1c1bb9_0

A student I am assisting had the following error when Aviary tried to create the GTDB-Tk environment.

Thoughts on what change to the yaml file is required to address this?

05/30/2022 09:26:33 AM INFO: Executing: snakemake --snakefile /srv/home/user/ace_software/aviary_0.3.3__200522/aviary/modules/Snakefile --directory /srv/projects4/mgII/20.user_plastic/05.aviary_metaspades/SD0970_S97 --jobs 60 --rerun-incomplete --configfile '/srv/projects4/mgII/20.user_plastic/05.aviary_metaspades/SD0970_S97/config.yaml' --nolock  --conda-frontend mamba --use-conda --conda-prefix /srv/home/user/.conda/envs/  recover_mags
Building DAG of jobs...
Creating conda environment /srv/home/user/ace_software/aviary_0.3.3__200522/aviary/modules/annotation/../../envs/gtdbtk.yaml...
Downloading and installing remote packages.
CreateCondaEnvironmentException:
Could not create conda environment from /srv/home/user/ace_software/aviary_0.3.3__200522/aviary/modules/annotation/../../envs/gtdbtk.yaml:
Command:
mamba env create --quiet --file "/srv/home/user/.conda/envs/e0a7141d5bfac1416ebc7128f8703c97.yaml" --prefix "/srv/home/user/.conda/envs/e0a7141d5bfac1416ebc7128f8703c97"
Output:
Encountered problems while solving:
  - nothing provides boost >=1.70.0,<1.70.1.0a0 needed by fastani-1.32-he1c1bb9_0

EDIT: This was using the version from the 20/05/22

install

Thanks for this nice tool! Do I understand right? Before I can use this pipeline, I need to first install each required tool firstly in my system? The current installation guideline did not give the instruction which tools we should install. Right?

TypeError in Aviary assemble with multiple short-reads

Aviary v0.5.0 coassembly error after spades finishes.

TypeError: '>' not supported between instances of 'TBDString' and 'int'

Simplified command:

aviary assemble \
  -1 sample_1.1.fq.gz sample_2.1.fq.gz \
  -2 sample_1.2.fq.gz sample_2.2.fq.gz \
  --output coassembly_4/assemble \
  -n 64 \
  -m 500
Thank you for using SPAdes!
[Wed Oct 12 03:00:51 2022]
Finished job 2.
1 of 3 steps (33%) done
Removing temporary output data/short_read_assembly.
Select jobs to execute...
Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/__init__.py", line 730, in snakemake
    success = workflow.execute(
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/workflow.py", line 1074, in execute
    raise e
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/workflow.py", line 1070, in execute
    success = self.scheduler.schedule()
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 492, in schedule
    run = self.job_selector(needrun)
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 639, in job_selector_ilp
    return self.job_selector_greedy(jobs)
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 825, in job_selector_greedy
    a = list(map(self.job_weight, jobs))  # resource usage of jobs
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 899, in job_weight
    return [
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 900, in <listcomp>
    self.calc_resource(name, res.get(name, 0)) for name in self.global_resources
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/lib/python3.10/site-packages/snakemake/scheduler.py", line 879, in calc_resource
    if value > gres:
TypeError: '>' not supported between instances of 'TBDString' and 'int'
10/12/2022 03:01:00 AM CRITICAL: Command 'snakemake --snakefile /mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.0/aviary/aviary/modules/Snakefile --directory results/cockatoo/coassemble/20220929/palsa/coassemble/coassemble/coassembly_4/assemble --jobs 64 --rerun-incomplete --configfile 'results/cockatoo/coassemble/20220929/palsa/coassemble/coassemble/coassembly_4/assemble/config.yaml' --nolock  --conda-frontend mamba --default-resources "tmpdir='/data1/pbs.3020244.pbs'" --resources mem_mb=512000   --use-conda --conda-prefix /mnt/hpccs01/work/microbiome/conda  complete_assembly' returned non-zero exit status 1.

pigz = 2.4

Hey Rhys,

I went to install via conda as per instructions but could not mamba env create because "pigz 2.4 ... is excluded by strict repo priority". I fixed this by putting anaconda as the top priority channel, but I don't think that is probably the right thing to do. Is there some reason pigz is pinned?

conda problem?

Hey,

Running into this, after creating a new conda env from scratch

(/home/woodcrob/e/binsnek-dev) lyra04:20201114:~/m/msingle-tmp-ben/mess/45_r95_on_wierd_L11/possible_finds/BinSnek$ snakemake --use-conda --cores 128 recover_mags
Building DAG of jobs...
Creating conda environment envs/maxbin2.yaml...
Downloading and installing remote packages.
CreateCondaEnvironmentException:
Could not create conda environment from /work2/microbiome/msingle-tmp-ben/mess/45_r95_on_wierd_L11/possible_finds/BinSnek/envs/maxbin2.yaml:
Collecting package metadata (repodata.json): ...working... failed

# >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<<

    Traceback (most recent call last):
      File "/home/woodcrob/e/binsnek-dev/lib/python3.7/site-packages/conda/gateways/disk/update.py", line 107, in touch
        mkdir_p_sudo_safe(dirpath)
      File "/home/woodcrob/e/binsnek-dev/lib/python3.7/site-packages/conda/gateways/disk/__init__.py", line 84, in mkdir_p_sudo_safe
        os.mkdir(path)
    FileExistsError: [Errno 17] File exists: '/home/woodcrob/e/binsnek-dev/pkgs'

    During handling of the above exception, another exception occurred:
...

Any ideas?
Thanks.

Type error when using Aviary cluster

Different to Sam's type error, but you might be able to reproduce this one..

aviary cluster \
-t 24 \
-n 24 \
--ani 95 \
--precluster-ani 90 \
--max-contamination 10 \
--min-completeness 90 \
--use-checkm2-scores TRUE \
-i /work/microbiome/human/cognobiome/barlowk/swinburne/2022_july_26/output/BBL5911-PFH5-LFH5_binning/bins/final_bins \
/work/microbiome/human/cognobiome/barlowk/swinburne/2022_july_26/output/BBL4572-PFG9-LFG9_binning/bins/final_bins

gives

10/12/2022 02:04:18 PM INFO: Time - 14:04:18 12-10-2022
10/12/2022 02:04:18 PM INFO: Command - /mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/bin/aviary cluster -t 24 -n 24 --ani 95 --precluster-ani 90 --max-contamination 10 --min-completeness 90 --use-checkm2-scores TRUE -i /work/microbiome/human/cognobiome/barlowk/swinburne/2022_july_26/output/BBL5911-PFH5-LFH5_binning/bins/final_bins /work/microbiome/human/cognobiome/barlowk/swinburne/2022_july_26/output/BBL4572-PFG9-LFG9_binning/bins/final_bins /work/microbiome/human/cognobiome/barlowk/swinburne/2022_july_26/output/BBL4573-PFG9-LFG9_binning/bins/final_bins /work/microbiome/human/cognobiome/barlowk/swinburne/2022_july_26/output/BBL4496-PFG9-LFG9_binning/bins/final_bins /work/microbiome/human/cognobiome/barlowk/swinburne/2022_july_26/output/BBL4557-PFG9-LFG9_binning/bins/final_bins
10/12/2022 02:04:18 PM INFO: Version - 0.5.0
Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.0/bin/aviary", line 33, in <module>
    sys.exit(load_entry_point('aviary-genome', 'console_scripts', 'aviary')())
  File "/mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.0/aviary/aviary/aviary.py", line 1059, in main
    processor = Processor(args)
  File "/mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.0/aviary/aviary/modules/processor.py", line 235, in __init__
    self.precluster_ani = fraction_to_percent(args.precluster_ani)
  File "/mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.0/aviary/aviary/modules/processor.py", line 550, in fraction_to_percent
    if val <= 1:
TypeError: '<=' not supported between instances of 'str' and 'int

coverm failure

(/home/woodcrob/e/binsnek-dev) cl5n012:20210108:~/m/wierdbin1/ben/57_sample_binning/2_binsnek$ binsnek recover --assembly ~/m/wierdbin1/coassembly/coassembly.spades/scaffolds.fasta --paired_reads_1 ~/m/wierdbin1/coassembly/[78]*R1.fastq.gz --paired_reads_2 ~/m/wierdbin1/coassembly/[78]*R2.fastq.gz --output . 
01/08/2021 01:27:46 PM INFO: Time - 13:27:46 08-01-2021
01/08/2021 01:27:46 PM INFO: Command - /home/woodcrob/e/binsnek-dev/bin/binsnek recover --assembly /home/woodcrob/m/wierdbin1/coassembly/coassembly.spades/scaffolds.fasta --paired_reads_1 /home/woodcrob/m/wierdbin1/coassembly/788.normal.R1.fastq.gz /home/woodcrob/m/wierdbin1/coassembly/794.normal.R1.fastq.gz /home/woodcrob/m/wierdbin1/coassembly/795.normal.R1.fastq.gz /home/woodcrob/m/wierdbin1/coassembly/805.normal.R1.fastq.gz --paired_reads_2 /home/woodcrob/m/wierdbin1/coassembly/788.normal.R2.fastq.gz /home/woodcrob/m/wierdbin1/coassembly/794.normal.R2.fastq.gz /home/woodcrob/m/wierdbin1/coassembly/795.normal.R2.fastq.gz /home/woodcrob/m/wierdbin1/coassembly/805.normal.R2.fastq.gz --output .
01/08/2021 01:27:46 PM INFO: Configuration file written to ./template_config.yaml
You may want to edit it using any text editor.
01/08/2021 01:27:46 PM INFO: Executing: snakemake --snakefile /home/woodcrob/git/BinSnek/binsnek/Snakefile --directory . --jobs 16 --rerun-incomplete --configfile './template_config.yaml' --nolock   --use-conda --conda-prefix ~/.conda/envs/   recover_mags   
Building DAG of jobs...
Creating conda environment /home/woodcrob/git/BinSnek/binsnek/envs/coverm.yaml...
Downloading and installing remote packages.
Environment for ../../../../../../home/woodcrob/git/BinSnek/binsnek/envs/coverm.yaml created (location: ../../../../../../home/woodcrob/.conda/envs/58a5c6e7)
Creating conda environment /home/woodcrob/git/BinSnek/binsnek/envs/concoct.yaml...
Downloading and installing remote packages.
Environment for ../../../../../../home/woodcrob/git/BinSnek/binsnek/envs/concoct.yaml created (location: ../../../../../../home/woodcrob/.conda/envs/935c2aae)
Creating conda environment /home/woodcrob/git/BinSnek/binsnek/envs/das_tool.yaml...
Downloading and installing remote packages.
Environment for ../../../../../../home/woodcrob/git/BinSnek/binsnek/envs/das_tool.yaml created (location: ../../../../../../home/woodcrob/.conda/envs/4759ad87)
Creating conda environment /home/woodcrob/git/BinSnek/binsnek/envs/checkm.yaml...
Downloading and installing remote packages.
Environment for ../../../../../../home/woodcrob/git/BinSnek/binsnek/envs/checkm.yaml created (location: ../../../../../../home/woodcrob/.conda/envs/2eef4795)
Creating conda environment /home/woodcrob/git/BinSnek/binsnek/envs/maxbin2.yaml...
Downloading and installing remote packages.
Environment for ../../../../../../home/woodcrob/git/BinSnek/binsnek/envs/maxbin2.yaml created (location: ../../../../../../home/woodcrob/.conda/envs/a9a788ef)
Creating conda environment /home/woodcrob/git/BinSnek/binsnek/envs/metabat2.yaml...
Downloading and installing remote packages.
Environment for ../../../../../../home/woodcrob/git/BinSnek/binsnek/envs/metabat2.yaml created (location: ../../../../../../home/woodcrob/.conda/envs/ce6ed2a2)
Creating conda environment /home/woodcrob/git/BinSnek/binsnek/envs/gtdbtk.yaml...
Downloading and installing remote packages.
Environment for ../../../../../../home/woodcrob/git/BinSnek/binsnek/envs/gtdbtk.yaml created (location: ../../../../../../home/woodcrob/.conda/envs/3a1e3607)
Using shell: /bin/bash
Provided cores: 16
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	checkm
	1	concoct_binning
	1	das_tool
	1	get_abundances
	1	get_bam_indices
	1	gtdbtk
	1	maxbin_binning
	1	metabat_binning_2
	1	prepare_binning_files
	1	recover_mags
	10
busco_folder does not point to a folder

[Fri Jan  8 13:53:11 2021]
rule prepare_binning_files:
    input: /home/woodcrob/m/wierdbin1/coassembly/coassembly.spades/scaffolds.fasta
    output: data/maxbin.cov.list, data/coverm.cov
    jobid: 8
    threads: 16

Activating conda environment: /home/woodcrob/.conda/envs/58a5c6e7
[2021-01-08T03:53:16Z INFO  coverm] CoverM version 0.6.0
[2021-01-08T03:53:16Z INFO  coverm] Setting single read percent identity threshold at 0.97 for MetaBAT adjusted coverage, and not filtering out supplementary, secondary and improper pair alignments
[2021-01-08T03:53:16Z INFO  coverm] Using min-covered-fraction 0%
[2021-01-08T03:53:16Z INFO  bird_tool_utils::external_command_checker] Found minimap2 version 2.17-r941 
[2021-01-08T03:53:17Z INFO  bird_tool_utils::external_command_checker] Found samtools version 1.9 
[2021-01-08T03:53:17Z INFO  coverm] Creating cache directory data/binning_bams/
[2021-01-08T03:53:17Z INFO  coverm::mapping_index_maintenance] Generating MINIMAP2_SR index for /home/woodcrob/m/wierdbin1/coassembly/coassembly.spades/scaffolds.fasta ..
[2021-01-08T03:53:45Z ERROR coverm::mapping_index_maintenance] Error when running MINIMAP2_SR index process.
[2021-01-08T03:53:45Z ERROR coverm::mapping_index_maintenance] The STDERR was: ""
[2021-01-08T03:53:45Z ERROR coverm::mapping_index_maintenance] Cannot continue after MINIMAP2_SR index failed.
Traceback (most recent call last):
  File "/work2/microbiome/wierdbin1/ben/57_sample_binning/2_binsnek/.snakemake/scripts/tmp3jxc4yev.get_coverage.py", line 103, in <module>
    for i in range(len(cov_list[0])):
IndexError: list index out of range
[Fri Jan  8 13:53:45 2021]
Error in rule prepare_binning_files:
    jobid: 8
    output: data/maxbin.cov.list, data/coverm.cov
    conda-env: /home/woodcrob/.conda/envs/58a5c6e7

RuleException:
CalledProcessError in line 94 of /home/woodcrob/git/BinSnek/binsnek/Snakefile:
Command 'source /home/woodcrob/e/woodcrob/bin/activate '/home/woodcrob/.conda/envs/58a5c6e7'; set -euo pipefail;  python /work2/microbiome/wierdbin1/ben/57_sample_binning/2_binsnek/.snakemake/scripts/tmp3jxc4yev.get_coverage.py' returned non-zero exit status 1.
  File "/home/woodcrob/git/BinSnek/binsnek/Snakefile", line 94, in __rule_prepare_binning_files
  File "/home/woodcrob/e/binsnek-dev/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Removing output files of failed job prepare_binning_files since they might be corrupted:
data/maxbin.cov.list, data/coverm.cov
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
An error occurred
Complete log: /work2/microbiome/wierdbin1/ben/57_sample_binning/2_binsnek/.snakemake/log/2021-01-08T132746.939540.snakemake.log

assembly folder name can be misleading

For a coassembly of short reads, the assembly is done with megahit, but the output folder is "spades_assembly".

One quick fix could be to rename it "short_read_assembly" instead.

Running from outside root dir

Activating conda environment: /home/woodcrob/.conda/envs/92870acd
python: can't open file 'scripts/write_vamb_bins.py': [Errno 2] No such file or directory
[Fri Jan 15 16:25:36 2021]
EError in rule vamb_make_bins:
    jobid: 9
    output: data/vamb_bins/done
    conda-env: /home/woodcrob/.conda/envs/92870acd
    shell:
        python scripts/write_vamb_bins.py --reference data/vamb_bams/renamed_contigs.fasta --clusters data/vamb_bins/clusters.tsv --output data/vamb_bins/; touch data/vamb_bins/done
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

checkm.out strange 0 completeness and marker genes for large bins

Hi Rhys,

I was successfully able to run aviary very easily on a test sample (it was much easier to install, lorikeet issue notwithstanding, and runs a lot smoother for me than atlas for what it is worth).

t's awesome and I love the output from rosella so thanks for that too - the UMAPs make the bins very clear!

I am parsing the output files now and I noticed possible issues with the check.out output file. I have uploaded it. I have also uploaded the equivalent from atlas from the same sample. (fwiw, aviary found 3 extra small bins vs atlas, which is nice!)

Essentially, I think there must be some issue with checkm because I am getting 0 or near 0 completeness for bins that I am sure are quite complete (based off the atlas output).

9 of the bins are over 1MB, with 8 around 2MB, and also I have blasted large chunks of the contigs just to confirm that they are indeed the correct species/genera. But they all seem to have 0 completeness and 0 marker genes found in the checkm.out file, which seems very wrong to me. So I am thinking it is likely an issue with checkm.

checkm.out.txt

atlas_completeness.txt

George

aviary recover with short reads

Hi!
I wanted to use aviary to recover bins using the combined binnig strategies, specially Rosella and vamb.
For the moment I only have short read libraries. When I try to run aviary recover without specifying --longreads I get the following error:

MissingInputException in line 10 of /tools/aviary/aviary/modules/quality_control/qc.smk:
Missing input files for rule link_reads:
    output: data/long_reads.fastq.gz
    affected files:
        none

Is it possible to run aviary on a "short-read" mode only?

Best,
Jeronimo

citations

For future, be good to keep a citations section on github and/or within software so people have a list of the software used within that they can easily cite. Might be quite a few actually.

Is the list just the ones in the flow diagram figure shown in the readme?
Thanks.

Offline Installation

Hi Rhys,

Aviary looks like a fantastic tool (I watched Ben's Abacus talk, the improvement over atlas, which I have used for my dataset, looks amazing).

My question is how is the best way to install aviary (namely all the snakemake environments and databases) on a HPC where I cannot access internet on the compute nodes, but only on the login node?

George

singleM output. Filesystem latency?

Building DAG of jobs...

Using shell: /bin/bash
Provided cores: 100
Rules claiming more threads will be scaled down.
Job counts:
	count	jobs
	1	recover_mags
	1
busco_folder does not point to a folder

[Thu Feb  4 12:43:12 2021]
rule recover_mags:
    input: data/das_tool_bins/done, data/gtdbtk/done, data/checkm.out, data/coverm_abundances.tsv
    output: data/singlem_out/singlem_appraise.svg, data/done
    jobid: 0
    threads: 100

Activating conda environment: /home/sternesp/.conda/envs/d7c39ec7
[2021-02-04T02:43:14Z INFO  bird_tool_utils::clap_utils] coverm version 0.3.0
[2021-02-04T02:43:14Z INFO  bird_tool_utils::external_command_checker] Found fastANI version 1.32 
[2021-02-04T02:43:14Z INFO  galah::cluster_argument_parsing] Reading CheckM tab table ..
[2021-02-04T02:43:14Z INFO  galah::cluster_argument_parsing] Calculating num_contigs etc. for genome quality assessment ..
[2021-02-04T02:43:14Z INFO  galah::cluster_argument_parsing] Read in genome qualities for 52 genomes. 52 passed quality thresholds
[2021-02-04T02:43:14Z INFO  galah::cluster_argument_parsing] Creating output-representative-fasta-directory ..
[2021-02-04T02:43:14Z INFO  galah::cluster_argument_parsing] Clustering 52 genomes ..
[2021-02-04T02:43:14Z INFO  galah::finch] Sketching MinHash representations of each genome with finch ..
[2021-02-04T02:43:14Z INFO  galah::finch] Finished sketching genomes
[2021-02-04T02:43:14Z INFO  galah::clusterer] Preclustering ..
[2021-02-04T02:43:14Z INFO  galah::clusterer] Found 52 preclusters. The largest contained 1 genomes
[2021-02-04T02:43:14Z INFO  galah::clusterer] Finding representative genomes and assigning all genomes to these ..
[2021-02-04T02:43:14Z INFO  galah::cluster_argument_parsing] Found 52 genome clusters
[2021-02-04T02:43:14Z INFO  galah::cluster_argument_parsing] Finished printing genome clusters
Waiting at most 5 seconds for missing files.
MissingOutputException in line 460 of /home/sternesp/git/aviary/aviary/Snakefile:
Missing files after 5 seconds:
data/singlem_out/singlem_appraise.svg
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Removing output files of failed job recover_mags since they might be corrupted:
data/done
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
An error occurred
Complete log: /lustre/scratch/microbiome/sternesp/preeclampsia/aviary_output/M360w28/.snakemake/log/2021-02-04T124312.637218.snakemake.log
02/04/2021 12:43:19 PM CRITICAL: Command 'snakemake --snakefile /home/sternesp/git/aviary/aviary/Snakefile --directory aviary_output/M360w28 --jobs 100 --rerun-incomplete --configfile 'aviary_output/M360w28/template_config.yaml' --nolock   --use-conda --conda-prefix ~/.conda/envs/   recover_mags   ' returned non-zero exit status 1

It recommends trying with --latency-wait when trying to output data/singlem_out/singlem_appraise.svg. However I can't figure out which command accepts --latency-wait as as parameter.

Also, the busco_folder as specified in the config doesn't exist. Is busco required for this pipeline?

Thanks!

configure: Namespace error

This is with current main branch

(aviary_after_0.4.3_conda_env)cl5n008:20220828:~/m/red_sea/4_individual_assemblies$ aviary configure --gtdb-path ~/m/db/gtdb/gtdb_release207_v2/ --eggnog-db-path ~/m/db/eggnog-mapper/2.1.9 --checkm2-db-path ~/m/db/CheckM2_database/
08/28/2022 06:54:29 AM INFO: Time - 06:54:29 28-08-2022
08/28/2022 06:54:29 AM INFO: Command - /mnt/hpccs01/work/microbiome/red_sea/4_individual_assemblies/aviary_after_0.4.3_conda_env/bin/aviary configure --gtdb-path /home/woodcrob/m/db/gtdb/gtdb_release207_v2/ --eggnog-db-path /home/woodcrob/m/db/eggnog-mapper/2.1.9 --checkm2-db-path /home/woodcrob/m/db/CheckM2_database/
08/28/2022 06:54:29 AM INFO: Version - 0.4.3
Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/red_sea/4_individual_assemblies/aviary/aviary/modules/processor.py", line 161, in __init__
    self.short_percent_identity = args.short_percent_identity
AttributeError: 'Namespace' object has no attribute 'short_percent_identity'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/mnt/hpccs01/work/microbiome/red_sea/4_individual_assemblies/aviary_after_0.4.3_conda_env/bin/aviary", line 33, in <module>
    sys.exit(load_entry_point('aviary-genome', 'console_scripts', 'aviary')())
  File "/mnt/hpccs01/work/microbiome/red_sea/4_individual_assemblies/aviary/aviary/aviary.py", line 1037, in main
    processor = Processor(args)
  File "/mnt/hpccs01/work/microbiome/red_sea/4_individual_assemblies/aviary/aviary/modules/processor.py", line 176, in __init__
    logging.info(f"Exception {args.pe1} {args.pe2}")
AttributeError: 'Namespace' object has no attribute 'pe1'

misspecified input for rule gtdbtk

rule gtdbtk expects done input at data/final_bins/done:

rule gtdbtk:
    input:
        done_file = "data/final_bins/done",
        dereplicated_bin_folder = "bins/final_bins/"

However, after latest updates, this should now be:

rule gtdbtk:
    input:
        done_file = "bins/final_bins/done",
        dereplicated_bin_folder = "bins/final_bins/"

Lorikeet Installation Fails 0.7.1

Hi Rhys,

I managed to install aviary and all Conda environments (and everything else has run nicely on a test sample!) other than lorikeet. I get the following error:

CreateCondaEnvironmentException:
Could not create conda environment from /hpcfs/users/a1667917/myconda/envs/aviary/lib/python3.10/site-packages/aviary/modules/strain_analysis/envs/lorikeet.yaml:
Command:
conda env create --quiet --file "/hpcfs/users/a1667917/myconda/envs/aviary/fcd3e55b2a2091296105e8c2f42ee964.yaml" --prefix "/hpcfs/users/a1667917/myconda/envs/aviary/fcd3e55b2a2091296105e8c2f42ee964"
Output:
Collecting package metadata (repodata.json): ...working... done
Solving environment: ...working... failed

ResolvePackageNotFound:

  • lorikeet-genome[version='>=0.7.1']

When I manually change lorikeet.yaml to 0.7.0 installation works, but then Lorikeet fails (probably why 0.7.1 is needed!)

George

Moving DAS Tool output may break other rules when binning

I have not tested, but I noticed that galah_dereplicate will move the binning output from DAS Tool data/das_tool_bins/das_tool_DASTool_bins) to bins/non_dereplicated_bins.

rule galah_dereplicate:
    input:
        checkm = 'data/checkm.out',
        das_tool = 'data/das_tool_bins/done'
    output:
        final_bins = temp('bins/final_bins/done')
    ...
    shell:
        "mv data/das_tool_bins/das_tool_DASTool_bins bins/non_dereplicated_bins; "
        "coverm cluster --precluster-method finch -t {threads} --checkm-tab-table {input.checkm} " \
        "--genome-fasta-directory bins/non_dereplicated_bins -x fa --output-representative-fasta-directory bins/final_bins --ani {params.derep_ani}; "
        "touch bins/final_bins/done"

However, GTDB-Tk and other rules depend on the bin files being in data/das_tool_bins/das_tool_DASTool_bins.

rule gtdbtk:
    input:
        done_file = "data/das_tool_bins/done"
    ...
    shell:
        "export GTDBTK_DATA_PATH={params.gtdbtk_folder} && "
        "gtdbtk classify_wf --cpus {threads} --pplacer_cpus {params.pplacer_threads} --extension fa "
        "--genome_dir data/das_tool_bins/das_tool_DASTool_bins --out_dir data/gtdbtk && touch data/gtdbtk/done"

Perhaps ensure DAS Tool is run first, bin files are moved, and downstream rules (galah_dereplicate, gtdbtk, checkm, etc.) all refer to the bins in bins/non_dereplicated_bins? This would also allow re-running of specific rules as the expected directly structure would be maintained.

rules checkm_... assume bins have been found?

A researcher I am assisting had the following error thrown when running the latest aviary 0.3.3:

[Tue May 10 13:17:25 2022]
rule checkm_rosella:
    input: data/rosella_bins/done
    output: data/rosella_bins/checkm.out
    jobid: 17
    threads: 80
    resources: tmpdir=/tmp

Activating conda environment: ../../../../../../home/user/.conda/envs/7f69cd9282402138a5a52ef44587d7df
[2022-05-10 13:17:32] INFO: CheckM v1.1.3
[2022-05-10 13:17:32] INFO: checkm lineage_wf -t 80 --pplacer_threads 48 -x fna data/rosella_bins/ data/rosella_bins//checkm --tab_table -f data/rosella_bins/checkm.out
[2022-05-10 13:17:32] INFO: [CheckM - tree] Placing bins in reference genome tree.
[2022-05-10 13:17:32] ERROR: No bins found. Check the extension (-x) used to identify bins.

  Controlled exit resulting from an unrecoverable error or warning.
Waiting at most 5 seconds for missing files.
MissingOutputException in line 313 of /srv/home/user/temp/aviary/aviary/modules/binning/binning.smk:
Job Missing files after 5 seconds. This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait:
data/rosella_bins/checkm.out completed successfully, but some output files are missing. 17
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: .snakemake/log/2022-05-10T131724.751376.snakemake.log
An error occurred
05/10/2022 01:17:38 PM CRITICAL: Command 'snakemake --snakefile /srv/home/user/temp/aviary/aviary/modules/Snakefile --directory /srv/projects3/IMOS/analysis/20220411_imos_db_binning/aviary_output_dirs/aviaryRecover_broomfield_1_singleSiteBinning --jobs 80 --rerun-incomplete --configfile '/srv/projects3/IMOS/analysis/20220411_imos_db_binning/aviary_output_dirs/aviaryRecover_broomfield_1_singleSiteBinning/config.yaml' --nolock  --conda-frontend mamba --use-conda --conda-prefix /srv/home/user/.conda/envs/  recover_mags' returned non-zero exit status 1.

Based on a quick look at the output/error and the code, it would appear that rules checkm_rosella, checkm_metabat2 etc. assume there are bin files produced from their corresponding binning rules. However checkm will fail if no bin files are present and cause the pipeline to exit.

Is this correct?

lorikeet.smk snakemake typo

rule lorikeet:
input: bins/final_bins
output: strain_diversity
jobid: 0
reason: Missing output files: strain_diversity
threads: 32
resources: tmpdir=/hpcfs/users/a1667917/tmp

Activating conda environment: ../../../myconda/envs/aviary/fcd3e55b2a2091296105e8c2f42ee964
Activating conda environment: ../../../myconda/envs/aviary/fcd3e55b2a2091296105e8c2f42ee964
Traceback (most recent call last):
File "/hpcfs/users/a1667917/Kevin/aviary/S1/.snakemake/scripts/tmpazygu3hf.run_lorikeet.py", line 29, in
f"-p {snakemake.params.parallel_genome} {short_reads} {long_reads} -o {snakemake.output.output_directory}", shell=True).wait()
AttributeError: 'Params' object has no attribute 'parallel_genome'
[Thu Jul 21 09:46:36 2022]
Error in rule lorikeet:
jobid: 0
output: strain_diversity
conda-env: /hpcfs/users/a1667917/myconda/envs/aviary/fcd3e55b2a2091296105e8c2f42ee964

RuleException:
CalledProcessError in line 71 of /hpcfs/users/a1667917/myconda/envs/aviary/lib/python3.10/site-packages/aviary/modules/strain_analysis/strain_analysis.smk:
Command 'source /hpcfs/users/a1667917/myconda/envs/aviary/bin/activate '/hpcfs/users/a1667917/myconda/envs/aviary/fcd3e55b2a2091296105e8c2f42ee964'; set -euo pipefail; python /hpcfs/users/a1667917/Kevin/aviary/S1/.snakemake/scripts/tmpazygu3hf.run_lorikeet.py' returned non-zero exit status 1.
File "/hpcfs/users/a1667917/myconda/envs/aviary/lib/python3.10/site-packages/aviary/modules/strain_analysis/strain_analysis.smk", line 71, in __rule_lorikeet
File "/hpcfs/users/a1667917/myconda/envs/aviary/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

And when I look at the Snakemake rule, the param is defined as "parallel_genomes"

When I fixed this it solved the issue.

George

lorikeet error

Hi Rhys,

Sorry for the spam - I have another issue with lorikeet.

I am running short read mode only.

(after fixing the typo with #56) I get the following error:

rule lorikeet:
input: bins/final_bins
output: strain_diversity
jobid: 0
reason: Missing output files: strain_diversity
threads: 32
resources: tmpdir=/hpcfs/users/a1667917/tmp

Activating conda environment: ../../../myconda/envs/aviary/fcd3e55b2a2091296105e8c2f42ee964
Activating conda environment: ../../../myconda/envs/aviary/fcd3e55b2a2091296105e8c2f42ee964
[2022-07-21T02:28:13Z INFO lorikeet] lorikeet version 0.7.1
[2022-07-21T02:28:13Z INFO lorikeet_genome] Using min-read-aligned-percent 0%
[2022-07-21T02:28:13Z INFO bird_tool_utils::clap_utils] Not using directory entry 'bins/final_bins/.snakemake_timestamp' as a genome FASTA file
[2022-07-21T02:28:14Z INFO lorikeet_genome::utils::utils] Not pre-generating minimap2 index
[2022-07-21T02:28:14Z WARN lorikeet_genome::utils::utils] Not using reference index...
[2022-07-21T02:28:14Z INFO lorikeet_genome::processing::lorikeet_engine] Processing short reads...
thread '' panicked at 'called Option::unwrap() on a None value', src/haplotype/haplotype_caller_engine.rs:305:14
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace
thread '' panicked at 'called Option::unwrap() on a None value', src/haplotype/haplotype_caller_engine.rs:305:14
thread '' panicked at 'called Option::unwrap() on a None value', src/haplotype/haplotype_caller_engine.rs:305:14
thread '' panicked at 'thread 'called Option::unwrap() on a None value', ' panicked at 'src/haplotype/haplotype_caller_engine.rscalled Option::unwrap() on a None value:', 305src/haplotype/haplotype_caller_engine.rs::14thread '
' panicked at 'called Option::unwrap() on a None value', src/haplotype/haplotype_caller_engine.rs:305:14305:14

thread '' panicked at 'called Option::unwrap() on a None value', src/haplotype/haplotype_caller_engine.rsthread ':' panicked at 'called Option::unwrap() on a None value', src/haplotype/haplotype_caller_engine.rs:305:14
305:14

I also get the same error when I run lorikeet manually like e.g.

lorikeet evolve -t 32 -d "/hpcfs/users/a1667917/Kevin/aviary/S1/bins/final_bins" -x "fna"
-p 8 -1 $FASTQ_DIR/Illumina_bacteria_fastp_R1.fastq.gz -2 $FASTQ_DIR/Illumina_bacteria_fastp_R2.fastq.gz -o lorikeet_test

I can confirm that the "S1/bins/final_bins" has a symlink that points to non-empty .fna bins which seem fine.

George

License confusion

The license situation in this repo is confusing:

  • LICENSE (and thus Github) says BSD-3-Clause
  • README says GPL3
  • setup.py says GPL3

Would be great if this could be clarified.

Found while looking at the update PR in Bioconda (bioconda/bioconda-recipes#36233) where the software is marked GPL3.

Exporting environmental CHECKM2DB does not work?

I received the following error while running aviary recover (version 0.4.3), even after defining the environmental variable CHECKM2DB

Activating conda environment: ../../../../../home/user/.conda/envs/c794e5a97a89d3794674035f5793a0dc
Using CheckM2 database /srv/db/checkm2_data/0.1.2/uniref100.KO.1.dmnd
[08/15/2022 05:08:50 PM] INFO: Running quality prediction workflow with 60 threads.
[08/15/2022 05:08:50 PM] WARNING: Database not found using the environmental variable: CHECKM2DB. Please fix your $PATH. Using internal database path instead.
[08/15/2022 05:08:50 PM] ERROR: DIAMOND database not found. Please download database using <checkm2 database --download>
[Mon Aug 15 17:08:51 2022]
Error in rule checkm_rosella:
    jobid: 17
    output: data/rosella_bins/checkm2_out, data/rosella_bins/checkm.out
    conda-env: /srv/home/user/.conda/envs/c794e5a97a89d3794674035f5793a0dc
    shell:
        touch data/rosella_bins/checkm.out; export CHECKM2DB=/srv/db/checkm2_data/0.1.2/uniref100.KO.1.dmnd; echo "Using CheckM2 database $CHECKM2DB"; checkm2 predict -i data/rosella_bins// -x fna -o data/rosella_bins/checkm2_out -t 60 --force; cp data/rosella_bins/checkm2_out/quality_report.tsv data/rosella_bins/checkm.out
        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job checkm_rosella since they might be corrupted:
data/rosella_bins/checkm2_out, data/rosella_bins/checkm.out
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

Setting the path with the --checkm2_db_path /srv/db/checkm2_data/0.1.3/CheckM2_database parameter does, however, work.

EmptyDataError in Aviary recover with long and short reads

Aviary v0.5.3 error in finalize_stats rule. 27/29 steps done, so I guess this is the last job and the other results are fine to use?

Simplified command (recovery from long-read assembly using 20 short reads and 2 long reads):

aviary recover --assembly 719_E1_20-24.ccs.filter.fasta -1 MainAutochamber.201907_E_1_30to34.1.fq.gz ... -2 MainAutochamber.201907_E_1_30to34.2.fq.gz ... --longreads 719_E1_1-5.ccs.filter.fastq.gz 719_E1_20-24.ccs.filter.fastq.gz --longread-type ccs --output results/aviary/binning/long/20221013/719_E1_20-24.ccs.filter -n 64 -m 500

Error:

rule finalize_stats:
    input: bins/checkm.out, bins/checkm2_output/quality_report.tsv, data/coverm_abundances.tsv, data/gtdbtk/done
    output: bins/bin_info.tsv, bins/checkm_minimal.tsv
    jobid: 1
    reason: Missing output files: bins/bin_info.tsv; Input files updated by another job: data/coverm_abundances.tsv, bins/checkm2_output/quality_report.tsv, data/gtdbtk/done, bins/checkm.out
    resources: mem_mb=1000, disk_mb=1000, tmpdir=/data1/tmp

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 64
Rules claiming more threads will be scaled down.
Provided resources: mem_mb=1000, disk_mb=1000
Select jobs to execute...
[Fri Oct 14 07:38:30 2022]
Error in rule finalize_stats:
    jobid: 0
    output: bins/bin_info.tsv, bins/checkm_minimal.tsv

RuleException:
EmptyDataErrorin line 715 of /mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.3/aviary/aviary/modules/binning/binning.smk:
No columns to parse from file
  File "/mnt/hpccs01/work/microbiome/sw/aviary_repos/aviary-v0.5.3/aviary/aviary/modules/binning/binning.smk", line 715, in __rule_finalize_stats
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 211, in wrapper
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/util/_decorators.py", line 317, in wrapper
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 950, in read_csv
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 605, in _read
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1442, in __init__
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/readers.py", line 1747, in _make_engine
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/site-packages/pandas/io/parsers/c_parser_wrapper.py", line 92, in __init__
  File "pandas/_libs/parsers.pyx", line 554, in pandas._libs.parsers.TextReader.__cinit__
  File "/mnt/hpccs01/work/microbiome/conda/envs/aviary-v0.5.3/lib/python3.10/concurrent/futures/thread.py", line 58, in run
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message

singlem

Hi Rhys,

The fix for #61 certainly worked - everything in the recover_mags module ran well on a re-run for me until the last 2 steps. All the files mentioned in #61 in the bins directory are created.

Alas, there seems to be an issue with singlem_appraise, specifically in the plot generation:

rule singlem_appraise:
input: data/singlem_out/metagenome.combined_otu_table.csv, data/gtdbtk/done, bins/checkm.out
output: data/singlem_out/singlem_appraise.svg
jobid: 30
reason: Missing output files: data/singlem_out/singlem_appraise.svg; Input files updated by another job: data/gtdbtk/done, bins/checkm.out, data/singlem_out/metagenome.combined_otu_table.csv
threads: 32
resources: tmpdir=/hpcfs/users/a1667917/tmp

Activating conda environment: ../../../myconda/envs/aviary/4c5875ddb0cf690b54611fae9c7a7e68
07/25/2022 04:17:09 PM INFO: Using as input 3 different sequence files e.g. bins/final_bins/concoct_bins.tsv.004.fna
07/25/2022 04:17:09 PM INFO: Searching with 14 SingleM package(s)
07/25/2022 04:17:09 PM INFO: Searching for reads matching 28 different protein HMM(s)
07/25/2022 04:17:11 PM INFO: Finished search phase
07/25/2022 04:17:11 PM INFO: Running separate alignments in GraftM..
07/25/2022 04:17:15 PM INFO: Finished extracting aligned sequences
07/25/2022 04:17:15 PM INFO: Running taxonomic assignment with GraftM..
07/25/2022 04:22:39 PM INFO: Finished running taxonomic assignment with GraftM
07/25/2022 04:22:39 PM INFO: Finished
07/25/2022 04:22:42 PM INFO: Using as input 1 different sequence files e.g. assembly/final_contigs.fasta
07/25/2022 04:22:42 PM INFO: Searching with 14 SingleM package(s)
07/25/2022 04:22:42 PM INFO: Searching for reads matching 28 different protein HMM(s)
07/25/2022 04:22:44 PM INFO: Finished search phase
07/25/2022 04:22:44 PM INFO: Running separate alignments in GraftM..
07/25/2022 04:22:48 PM INFO: Finished extracting aligned sequences
07/25/2022 04:22:48 PM INFO: Running taxonomic assignment with GraftM..
07/25/2022 04:28:11 PM INFO: Finished running taxonomic assignment with GraftM
07/25/2022 04:28:11 PM INFO: Finished
07/25/2022 04:28:14 PM INFO: Read in 29 markers from the different genomes
07/25/2022 04:28:14 PM INFO: After excluding duplicate markers that may indicate contamination, found 27 markers
07/25/2022 04:28:14 PM INFO: Read in 27 unique sequences from the 2 reference genomes
07/25/2022 04:28:14 PM INFO: Generating plot for marker: S1.12.ribosomal_protein_S12_S23
Traceback (most recent call last):
File "/hpcfs/users/a1667917/myconda/envs/aviary/4c5875ddb0cf690b54611fae9c7a7e68/bin/singlem", line 744, in
output_svg=args.plot)
File "/hpcfs/users/a1667917/myconda/envs/aviary/4c5875ddb0cf690b54611fae9c7a7e68/lib/python3.6/site-packages/singlem/appraisal_result.py", line 86, in plot
doing_binning)
File "/hpcfs/users/a1667917/myconda/envs/aviary/4c5875ddb0cf690b54611fae9c7a7e68/lib/python3.6/site-packages/singlem/appraisal_result.py", line 269, in _plot_gene
self._plot_scale(axis, max_total_count)
File "/hpcfs/users/a1667917/myconda/envs/aviary/4c5875ddb0cf690b54611fae9c7a7e68/lib/python3.6/site-packages/singlem/appraisal_result.py", line 352, in _plot_scale
overlap = (ylim[1]-ylim[0]-sum(sides))/(len(sides)-1)
ZeroDivisionError: float division by zero
[Mon Jul 25 16:28:14 2022]
Error in rule singlem_appraise:
jobid: 30
output: data/singlem_out/singlem_appraise.svg
conda-env: /hpcfs/users/a1667917/myconda/envs/aviary/4c5875ddb0cf690b54611fae9c7a7e68
shell:
singlem pipe --threads 32 --sequences bins/final_bins/*.fna --otu_table data/singlem_out/genomes.otu_table.csv; singlem pipe --threads 32 --sequences assembly/final_contigs.fasta --otu_table data/singlem_out/assembly.otu_table.csv; singlem appraise --metagenome_otu_tables data/singlem_out/metagenome.combined_otu_table.csv --genome_otu_tables data/singlem_out/genomes.otu_table.csv --assembly_otu_table data/singlem_out/assembly.otu_table.csv --plot data/singlem_out/singlem_appraise.svg --output_binned_otu_table data/singlem_out/binned.otu_table.csv --output_unbinned_otu_table data/singlem_out/unbinned.otu_table.csv > data/singlem_out/singlem_appraisal.tsv
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

All the csvs seem to be created in the output (some are size 0).

George

Quick note: shebang issue in Fasta_to_Scaffolds2Bin on the DAS_tool step

The pipeline failed because the shebang in Fasta_to_Scaffolds2Bin.sh (#!/bin/env bash) doesn't work on my system, and may not work on others.

On Linux I believe env is normally in /usr/bin/

The following shebang worked for me:

#!/usr/bin/env bash

Just a quick note for future reference.

report version at startup

For the sake of reproducibility and to help bug reporting, be good to report version at startup. Small thing.

Maxbin failing to bin contigs causes whole pipeline to fail

In rule maxbin_binning, if no bins can be found by Maxbin, the tool will fail and cause the pipeline to fail/stop. Ideally this error should be handled so that if this occurs, it is logged and the done file written to data/maxbin2_bins.

Example error log from Maxbin:

MaxBin 2.2.7
Input contig: /my/input/file.fasta
Thread: 50
out header: data/maxbin2_bins/maxbin
Min contig length: 1500
Located abundance file [data/maxbin_cov/p0.cov]
Located abundance file [data/maxbin_cov/p1.cov]
Located abundance file [data/maxbin_cov/p2.cov]
Located abundance file [data/maxbin_cov/p3.cov]
Located abundance file [data/maxbin_cov/p4.cov]
Located abundance file [data/maxbin_cov/p5.cov]
Located abundance file [data/maxbin_cov/p6.cov]
Located abundance file [data/maxbin_cov/p7.cov]
Located abundance file [data/maxbin_cov/p8.cov]
Located abundance file [data/maxbin_cov/p9.cov]
Located abundance file [data/maxbin_cov/p10.cov]
Searching against 107 marker genes to find starting seed contigs for 
/my/input/file.fasta...
Try harder to dig out marker genes from contigs.
Marker gene search reveals that the dataset cannot be binned (the medium of marker gene number <= 1). Program stop.

Could not create conda environment

Hi
I tried to use aviary to do assembly but got following errors:

07/16/2022 03:20:15 PM INFO: Version - 0.4.0
07/16/2022 03:20:15 PM INFO: Configuration file written to /scratch/project/alkane/Mengxiong/C3_anammox/aviary_assembly/config.yaml

07/16/2022 03:20:15 PM INFO: Executing: snakemake --snakefile /scratch/project_mnt/S0060/Mengxiong/aviary/aviary/modules/Snakefile --directory /scratch/project/alkane/Mengxiong/C3_anammox/aviary_assembly/ --jobs 16 --rerun-incomplete --configfile '/scratch/pr$
Building DAG of jobs...
Creating conda environment /scratch/project_mnt/S0060/Mengxiong/aviary/aviary/modules/binning/../../envs/coverm.yaml...
Downloading and installing remote packages.
CreateCondaEnvironmentException:
Could not create conda environment from /scratch/project_mnt/S0060/Mengxiong/aviary/aviary/modules/binning/../../envs/coverm.yaml:
Command:
mamba env create --quiet --file "/scratch/project/alkane/Mengxiong/41aae5c655da25559509d205af987a3a.yaml" --prefix "/scratch/project/alkane/Mengxiong/41aae5c655da25559509d205af987a3a"
Output:
/bin/bash: /home/uqmwu4/bin/mamba: /home/conda/feedstock_root/build_artifacts/mamba-split_1649138419879/_h_env_pl: bad interpreter: No such file or directory

07/16/2022 03:20:21 PM CRITICAL: Command 'snakemake --snakefile /scratch/project_mnt/S0060/Mengxiong/aviary/aviary/modules/Snakefile --directory /scratch/project/alkane/Mengxiong/C3_anammox/aviary_assembly/ --jobs 16 --rerun-incomplete --configfile '/scratch/$

What is the issue here? Thanks!

Working with large datasets and mapping reads: Missing SAM header / truncated file / no SQ lines in the header

In rule map_long_mega, when assembling, the following error is triggered when processing a large dataset:

[M::mm_idx_gen::92.453*1.77] collected minimizers
[M::mm_idx_gen::95.495*2.69] sorted minimizers
[WARNING]ESC[1;31m For a multi-part index, no @SQ lines will be outputted. Please use --split-prefix.ESC[0m
[M::main::95.495*2.69] loaded/built the index for 7078632 target sequence(s)
[M::mm_mapopt_update::98.280*2.64] mid_occ = 330
[M::mm_idx_stat] kmer size: 15; skip: 10; is_hpc: 0; #seq: 7078632
[M::mm_idx_stat::100.521*2.61] distinct minimizers: 131157591 (36.89% are singletons); average occurrences: 5.502; average spacing: 5.543; total length: 4000051290
[E::sam_parse1] missing SAM header
[W::sam_read1] Parse error at line 16
[main_samview] truncated file.
[Fri Jul  9 12:57:08 2021]
Error in rule map_long_mega:
    jobid: 5
    output: data/long_vs_mega.bam, data/long_vs_mega.bam.bai
    conda-env: /srv/home/uqjzaug1/.conda/envs/c707bdefb49c22a34af020f4f1ffb294
    shell:

        minimap2 -t 38 -ax map-ont -a data/spades_assembly.fasta data/long_reads.fastq.gz |  samtools view -@ 38 -b |
        samtools sort -@ 38 -o data/long_vs_mega.bam - &&         samtools index -@ 38 data/long_vs_mega.bam

        (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

Removing output files of failed job map_long_mega since they might be corrupted:
data/long_vs_mega.bam
Shutting down, this might take some time.

Looks like it is the same issue raised in RasmussenLab/vamb#63

Changing:

rule map_long_mega:
        ...
        minimap2 -t {threads} -ax map-ont -a {input.fasta} {input.fastq} |  samtools view -@ {threads} -b |
        samtools sort -@ {threads} -o {output.bam} - && \
        samtools index -@ {threads} {output.bam}

...to

rule map_long_mega:
        ...
        minimap2 -I 64g -t {threads} -ax map-ont -a {input.fasta} {input.fastq} |  samtools view -@ {threads} -b |
        samtools sort -@ {threads} -o {output.bam} - && \
        samtools index -@ {threads} {output.bam}

appears to fix the issue (at least in my case).

Automatic handling of this error, or config option to specify -I value would be helpful.

Biopython dependency

Very minor, but worth noting that vamb_make_bins depends on SeqIO from the Bio module. Some people may not have Biopython installed.

README suggestions

will eventually include and assembly stage and post binning analysis of MAGs

guess that is out of date now?

Also it'd be helpful to link the environment variables to the websites where they can be downloaded ie where the official instructions are. Right now it just says point to a folder - not 100% clear what is supposed to be in those folders, esp EGGNOG

Feature requests--flag MAGs with circular contigs and use CheckM output format 2 for final stats

Hey, just a friendly feature request.

Feature 1) would it be possible to get the final checkM output in the final_bins dir in checkM output format 2? That gives a whole bunch of useful genome stats, like bin size, # contigs, GC content, longest contig, N50, etc. Super useful stuff downstream. It shouldn't cost much extra time since checkM has already been run, should just involve running "checkM qa --outfmt 2".

Feature 2) If you decide to output a final metadata sheet that collates taxonomy, checkM stats (hopefully includes the outfmt 2 stats), relative abundance, etc, it would be awesome to have a field that includes whether the genome includes a circular contig, what the contig name is, AND how long the circular contig is--all stuff that's in Flye's assembly_info.tsv file. I'm finding some MAGs contain circular contigs that make up the bulk of the MAG (awesome!), but about half of the MAGs containing circular contigs just have smaller circular contigs that are likely misbinned plasmids or viruses. Would be very informative to flag in final output for those with long read data.

complete not supported yet?

(aviary-dev3)cl5n007:20220715:~/git/aviary$ aviary --help


                    ......:::::: AVIARY ::::::......

           A comprehensive metagenomics bioinformatics pipeline

Metagenome assembly, binning, and annotation:
        cluster   - Clusters samples based on OTU content using SingleM **TBC**
        assemble  - Perform hybrid assembly using short and long reads, 
                    or assembly using only short reads
        recover   - Recover MAGs from provided assembly using a variety 
                    of binning algorithms 
        annotate  - Annotate MAGs **TBC**
        genotype  - Perform strain level analysis of MAGs **TBC**
        complete  - Runs each stage of the pipeline: assemble, recover, 
                    annotate, genotype in that order.

Isolate assembly, binning, and annotation:
        isolate   - Perform isolate assembly **PARTIALLY COMPLETED**
        
Utility modules:
        configure - Set or overwrite the environment variables for future runs.


(aviary-dev3)cl5n007:20220715:~/git/aviary$ aviary complete --help
usage: aviary [--version] [--verbosity VERBOSITY] [--log LOG] {cluster,assemble,recover,annotate,genotype,viral,all,isolate,configure} ...
aviary: error: argument subparser_name: invalid choice: 'complete' (choose from 'cluster', 'assemble', 'recover', 'annotate', 'genotype', 'viral', 'all', 'isolate', 'configure')

Multi-Sample Functionality

Hi Rhys,

I hope you've been well!

I have a suggestion for aviary (you've almost certainly thought of it before I'd guess!)

It would be great if there was a way to run aviary in multi-sample mode, with aviary cluster run on all the outputs once the pipeline is completed (e.g. with input specifying a csv file input with sample name, paths to read files).

I've written a little snakemake wrapper pipeline to do this with a parser function for my samples now, so I can re-purpose that if it would be helpful.

George

finalize_stats and get_abundances does not seem to be in the DAG

Thought I'd make a new issue instead of continuing in #54.

Just to summarise, the full pipeline is not running to completion for me, I don't have 'bin_info.tsv', 'coverm_abundances.tsv' 'checkm_minimal.tsv' or 'done' in my output bins directory, only 'checkm.out' and a symlink to the final_bins directory - so when I run aviary cluster it does not work.

I had the same issue prior to the update with checkm2 and also afterwards so I don't think it is related to that either.

I have attached the log file. The only issue I see (prior to lorikeet) is that vamb and rosella run even though I specify "--skip-binners vamb rosella". Rosella completes with no error anyway, vamb errors which is why I skip it.

Regardless, the rules finalise_stats and get_abundances aren't in the DAG to begin with at all so I'm not sure that is causing the issue.

13748891_aviary.err.txt.

Error when run aviary: ModuleNotFoundError: No module named 'aviary.config'

Hi there,
I just installed the aviary following exactly the instructions. But, error occurred when I run aviary. It indicates that the missing cou of "aviary.config" as you could see below the error message. Please help. Thanks!

Kevin

#######################################################
Successfully installed aviary-genome-0.4.2 ruamel.yaml-0.17.21
(aviary1) [xxxxxx@login03 aviary]$
(aviary1) [xxxxxx@login03 aviary]$ aviary --help
Traceback (most recent call last):
File "/home/xxxxxx/.local/bin/aviary", line 7, in
from aviary.aviary import main
File "/home/xxxxxx/.local/lib/python3.8/site-packages/aviary/aviary.py", line 20, in
import aviary.config.config as Config
ModuleNotFoundError: No module named 'aviary.config'
#######################################################

README update

An easy to use for wrapper for a robust snakemake pipeline for metagenomic hybrid assembly

I think you should say

An easy to use for wrapper for a robust snakemake pipeline for metagenomic short read, long read and hybrid assembly

Flight / skbio error. numpy.ndarray size changed, may indicate binary incompatibility

The following error is thrown when running Rosella/Flight on a fresh install. I suspect it is related to this issue in HDBSCAN + numpy.

Also see https://stackoverflow.com/questions/66666380/issue-with-hdbscan-valueerror-numpy-ndarray-size-changed-may-indicate-binary for a possible solution.

EDIT: also see https://stackoverflow.com/questions/66060487/valueerror-numpy-ndarray-size-changed-may-indicate-binary-incompatibility-exp

2022-05-01T07:30:14Z ERROR bird_tool_utils::command] Error when running flight process. Exitstatus was : ExitStatus(unix_wait_status(256))
11:01
The STDERR was: “05/01/2022 05:30:09 PM INFO: Time - 17:30:09 01-05-2022\nTraceback (most recent call last):\n File \“/srv/home/user/.conda/envs/5456b973
7e12158ee6834b6c943f944c/bin/flight\“, line 10, in <module>\n  sys.exit(main())\n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/flight/flight.py\“, line 449
, in main\n  args.func(args)\n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/flight/flight.py\“, line 569, in refine\n  rosella = rosella_engine_construct
or(args)\n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/flight/flight.py\“, line 534, in rosella_engine_constructor\n  from flight.rosella.rosella import R
osella\n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/flight/rosella/rosella.py\“, line 48, in <module>\n  from flight.rosella.validating import Validator\
n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/flight/rosella/validating.py\“, line 48, in <module>\n  from flight.rosella.clustering import Clusterer, ite
rative_clustering_static, kmeans_cluster\n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/flight/rosella/clustering.py\“, line 56, in <module>\n  from flight
.rosella.binning import Binner\n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/flight/rosella/binning.py\“, line 43, in <module>\n  import skbio.stats.compo
sition\n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/skbio/__init__.py\“, line 11, in <module>\n  import skbio.io # noqa\n File \“/srv/home/user/.co
nda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/skbio/io/__init__.py\“, line 243, in <module>\n  import_module(‘skbio.io.format.clustal’)\n File \“/srv/home/user/.conda/envs/5456b9737
e12158ee6834b6c943f944c/lib/python3.9/importlib/__init__.py\“, line 127, in import_module\n  return _bootstrap._gcd_import(name[level:], package, level)\n File \“/srv/home/user/.conda/envs/5456b9737e12158e
e6834b6c943f944c/lib/python3.9/site-packages/skbio/io/format/clustal.py\“, line 148, in <module>\n  from skbio.alignment import TabularMSA\n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/skbio/alignment/__init__.py\“, line 204, in <module>\n  from ._pairwise import (\n File \“/srv/home/user/.conda/envs/5456b9737e12158ee6834b6c943f944c/lib/python3.9/site-packages/skbio/alignment/_pairwise.py\“, line 15, in <module>\n  from skbio.alignment._ssw_wrapper import StripedSmithWaterman\n File \“skbio/alignment/_ssw_wrapper.pyx\“, line 1, in init skbio.alignment._ssw_wrapper\nValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject\n”
thread ‘main’ panicked at ‘Failed to grab stdout from failed flight process’, /home/conda/.cargo/registry/src/github.com-1ecc6299db9ec823/bird_tool_utils-0.3.0/src/command.rs:27:14
11:02
Error in rule checkm_rosella:
  jobid: 17
  output: data/rosella_bins/checkm.out
  conda-env: /srv/home/user/.conda/envs/b74c952d3cb03d84d232c6fd11bc410d
  shell:
    checkm lineage_wf -t 30 --pplacer_threads 30 -x fna data/rosella_bins/ data/rosella_bins//checkm --tab_table -f data/rosella_bins/checkm.out
    (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
An error occurred
Complete log: .snakemake/log/2022-05-01T150003.760975.snakemake.log
05/01/2022 05:30:22 PM CRITICAL: Command ‘snakemake --snakefile /srv/home/user/temp/aviary/aviary/modules/Snakefile --directory /srv/projects/microbial_inducers/analysis/binning/20220428_aviary_recover/psin_15 --jobs 30 --rerun-incomplete --configfile ‘/srv/projects/microbial_inducers/analysis/binning/20220428_aviary_recover/psin_15/config.yaml’ --nolock --conda-frontend mamba --use-conda --conda-prefix /srv/home/user/.conda/envs/ recover_mags’ returned non-zero exit status 1.

pplacer_cpu parameter in GTDB-Tk

The --pplacer_cpus parameter doesn't seem to be recognised by GTDB-Tk v.0.3.1

export GTDBTK_DATA_PATH=/home/sternesp/microbiome/db/gtdbtk/release95 && gtdbtk classify_wf --cpus 3 --pplacer_cpus 3 --extension fa --genome_dir data/das_tool_bins/das_tool_DASTool_bins --out_dir data/gtdbtk && touch data/gtdbtk/done

gtdbtk: error: unrecognized arguments: --pplacer_cpus 3

Deleting --pplacer_cpus {threads} from aviary/Snakemake let's the pipeline run again.

The /home/ directory is quickly filled up and result in the stop of aviary with error at step of coverM ("ERROR coverM::bamgenerator").

Hi there,

I'm running aviary on a server shared by many users and have a limited /home/ directory (up to 400Gb). My aviary conda environment is installed in /home/MyAccount/.conda/envs/aviary/. When I run aviary, the /home/ directory is quickly filled up and resulting the stop of aviary with error at step of coverM ("ERROR coverM::bamgenerator"). Do you have a clue about this? Did I install or configure correctly? Now the annoying part is that the /home/ is full and I do not know how to free the space because I can not see these trash neither in my folder nor in others including the hidden folder. Please help, Thank you!

Kevin

$sudo df -h /home
Filesystem Size Used Avail Use% Mounted on
/dev/sdt1 413G 360G 33G 92% /

$sudo du -sh /home
208G /home

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.