koszullab / metator Goto Github PK

View Code? Open in Web Editor NEW

20.0 8.0 10.0 129.87 MB

Metagenomic binning based on Hi-C data

Home Page: https://research.pasteur.fr/fr/software/meta3c-metahic/

License: GNU General Public License v3.0

Python 99.73% Dockerfile 0.27%

metagenomics hi-c 3c metagenomic-pipeline metagenome-assembly louvain-community-detection

metator's Introduction

MetaTOR

Metagenomic Tridimensional Organisation-based Reassembly - A set of scripts that streamlines the processing and binning of metagenomic metaHiC datasets.

MetaTOR

Installation

Requirements

Python 3.8 to 3.10 or later is required.
The following librairies are required but will be automatically installed with the pip installation: numpy, scipy, sklearn, pandas, docopt, networkx biopython pyfastx, pysam, micomplete and pairix.
The following software should be installed separately if you used the pip installation:
- bowtie2
- samtools
- louvain (original implementation).
- networkanalysis (not necessary only if you want to use Leiden algorithm to partition the network)

Using pip

pip3 install metator

or, to use the latest version:

pip3 install -e git+https://github.com/koszullab/metator.git@master#egg=metator

Using conda

conda create -n metator -y --log-level warning -f metator.yaml

Louvain or Leiden dependency

In order to use Louvain or Leiden it's necessary to set a global variable LOUVAIN_PATH and LEIDEN_PATH depending on which algorithm you wan to use with the absolute path where the executable are.

For Louvain algorithm in the directory where you have the archive file (available in the external directory of this repository):

YOUR_DIRECTORY=$(pwd)
tar -xvzf louvain-generic.tar.gz
cd gen-louvain
make
export LOUVAIN_PATH=$YOUR_DIRECTORY/gen-louvain/

For Leiden algorithm, clone the networkanalysis repository from github and build the Java script. Then you can export the Leiden path:

export LEIDEN_PATH=/networkanalysis_repository_path/build/libs/networkanalysis-1.2.0.jar

Using docker container

A dockerfile is also available if that is of interest. You may fetch the image by running the following:

docker pull koszullab/metator

Usage

metator {network|partition|validation|pipeline} [parameters]

A metaTOR command takes the form metator action --param1 arg1 --param2 arg2 #etc.

There are three actions/steps in the metaTOR pipeline, which must be run in the following order:

network : Generate metaHiC contigs network from fastq reads or bam files and normalize it.
partition : Perform the Louvain or Leiden community detection algorithm many times to bin contigs according to the metaHiC signal between contigs.
validation : Use CheckM to validate the bins, then do a recursive decontamination step to remove contamination.

There are a number of other, optional, miscellaneous actions:

pipeline : Run all three of the above actions sequentially or only some of them depending on the arguments given. This can take a while.
contactmap : Generates a contact map from one bin from the final ouptut of metaTOR.
version : display current version number.
help : display help message.

A tutorial is available here to explain how to use metaTOR. More advanced tutorials to analyze the output files are also available:

Anvio manual curation of the contaminated bins. Available here.
Visualization and scaffolding of the MAGs with the contactmap modules of MetaTOR. Available here.

Principle of MetaTOR pipeline:

Output files

The output files will be in the ouput directory given as parmater or in working directory if no paramater were given. Depending on the command used, different files will be in the ouptut:

Files/Commands	description	network	partition	validation	pipeline
alignment_N_for.bam	Bam file of the forward alignment	X			X
alignment_N_rev.bam	Bam file of the reverse alignment	X			X
alignment_N.pairs	Pairs of the merge alignment	X			X
network.txt	Normalized network of the metaHiC library	X			X
contig_data_network.txt	Information on contigs after network step	X
clustering_matrix_partition.txt	Matrix of clustering from the partition iterations		X
contig_data_partition.txt	Information on contigs after partition step		X
overlapping_checkm_results.txt	CheckM results summary file from the partition step			X	X
overlapping_checkm_taxonomy.txt	CheckM taxonomy file from the partition step			X	X
recursive_checkm_results.txt	CheckM results summary file from the recursive step			X	X
recursive_checkm_taxonomy.txt	CheckM taxonomy file from the recursive step			X	X
clustering_matrix_validation.txt	Matrix of clustering from the recursive iterations			X
clustering_matrix.txt	Matrix of clustering from the partition and recursive iterations				X
contig_data_final.txt	Information on contigs after whole pipeline			X	X
bin_summary.txt	Information on the final bins			X	X
binning.txt	File with contigs names and their final clustering			X	X
overlapping_bin	Directory with the fasta of the partition bins		X	X	X
recursive_bin	Directory with the fasta of the recursive bins			X	X
final_bin	Directory with the fasta of the final bins			X	X

Bam alignment files For the bam alignments files, only the aligned reads are kept and the bam are sorted by name. The N value correspond to the id (order of the given fastq started at 0)

Pairs aligment files This format is used to store the relevant information of mapping of the merged alignment. It's a s The N value correspond to the id (order of the given fastq started at 0). It is a tab-separated format holding informations about Hi-C pairs. It has an official specification defined by the 4D Nucleome data coordination and integration center. Here we kept 7 columns readID-chr1-pos1-chr2-pos2-strand1-strand2.

Network file This is a tsv file of the network with edgelist form: Id of the first contig, id of the second contig and the weigth of edge normalized.

Contig data files These are the files with all the informations from the contigs:

ID	Name	Size	GC_content	Hit	Shotgun_coverage	Restriction_site	Core_bin_ID	Core_bin_contigs	Core_bin_size	Overlapping_bin_ID	Overlapping_bin_contigs	Overlapping_bin_size	Recursive_bin_ID	Recursive_bin_contigs	Recursive_bin_size	Final_bin
1	NODE_1	642311	38.6876450815882	3837	41.1565	2006	1	65	2175226	1	396	6322353	1	52	2158803	MetaTOR_1_1
2	NODE_2	576356	30.235826468363303	1724	24.509	1256	2	40	1735419	2	401	735419	0	-	-	MetaTOR_2_0
3	NODE_3	540571	42.305266098255366	2188	14.5855	3405	3	127	6409484	3	431	13615480	1	112	6385126	MetaTOR_3_1

They have to have the header when they are use as input but the order of the columns nd if they are others columns doesn't matter when they are used as input files. Depending on which step of the pipeline have been launch they just have some of these columns:

contig_data_network.txt: columns: ID, Name, Size, GC content, Hit, Shotgun_coverage and Restriction Site only.
contig_data_partition.txt: The same as the previous with the information of core bins and overlapping bins.
contig_data_final.txt: All the columns.

The shotgun coverage will be filled only if the depth.txt file is given, otherwise it will be filled with -. This column is only necessarry for the abundance and the theoritical_hit normalization. The restriction will also be filled with - if no enzyme are given. This column is only necessary for the RS and the theoritical_hit normalization. Moreover, if the contig is not binned (no HiC reads mapped on it) all the columns with binning information will be filled with -, and if a bin is not recursively decontamined because it's smaller than the size threshold or it doesn't have any contamination the recusive bin information will be filled 0, -, -. Finally, if the bin is not in a final bin, it will be annotated ND in the last column (for not determined).

clustering matrix files The clustering matrix files are at the .npz format which is a compresed foramt for sparsed matrix. This sparsed matrix contains the ratio of time each pairs of contigs are clusterize together by the algorithm of clustering (either Louvain or Leiden). The partition matrix contains the information for the partition step, the recursive one for the recursive step and the general one is the mean of both matrices. Be careful the index of the contigs are zero-based and not one-based as in the contig data file.

It's possible to read them in python using the scipy.sparse.load_npz() function. If the users wants a tsv file instead, he or she could load the matrix in python using load_npz and make sure to transform the matrix in the scipy.sparse.coo_matrix function and used the function from metator metator.io.save_sparse_matrix to save it as tsv file.

CheckM results Files from checkM output. Two types of files one with the main results of checkM checkm_results.txt and one with the taxonomy checkm_taxonomy.txt for both the partition and the recurisve bins.

binning.txt file This is a tsv file with two columns: the contig name and the final were the contig is. It only contains contigs which are binned. It could be use a an input to import a binning results in anvio.

Bin summary file This is the summary of the data of the final bins build with all the step of metaTOR. The HiC coverage is the number of contacts (intra and inter contigs) per kilobase in the whole bin. The Shotgun coverage is the mean coverage normalized by the size of the shotgun reads from the depth file.

	lineage	completness	contamination	size	contigs	N50	longest_contig	GC	coding_density	taxonomy	Coverage
MetaTOR_8_1	o__Clostridiales	68.29	2.46	1431612	15	116129	291620	26.36	87.97	k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales	146.46719755483332
MetaTOR_8_2	o__Clostridiales	58.42	2.01	1396934	58	41290	174682	28.89	83.70	k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales	22.252416224710686
MetaTOR_8_3	o__Clostridiales	49.37	0.94	1420821	82	33095	89964	30.29	83.24	k__Bacteria;p__Firmicutes;c__Clostridia;o__Clostridiales;f__Peptostreptococcaceae_3;g__Clostridium_3	44.27369196532141

References

Metagenomic chromosome conformation capture (meta3C) unveils the diversity of chromosome organization in microorganisms, Martial Marbouty, Axel Cournac, Jean-François Flot, Hervé Marie-Nelly, Julien Mozziconacci, and Romain Koszul, eLife, 2014
Meta3C analysis of a mouse gut microbiome, Martial Marbouty, Lyam Baudry, Axel Cournac, Romain Koszul, 2015
Scaffolding bacterial genomes and probing host-virus interactions in gut microbiome by proximity ligation (chromosome capture) assay, Martial Marbouty, Lyam Baudry, Axel Cournac, and Romain Koszul, Science Advances, 2017

Contact

Authors

Research lab

Spatial Regulation of Genomes (Institut Pasteur, Paris)

metator's People

Contributors

Stargazers

Watchers

Forkers

derekroy baudrly hivlab anborgi martynalukaszewicz js2264 dyxstat emagallong cergare mmarbout

metator's Issues

.hmm file existence in the HMM_databases folder

Dear metaTOR Developers,

I hope all is well. We are trying to run the metaTOR and we have been running into the HMM databases issue. More specifically the HMM_databases.hmm file seems not to exist:

We know that the link to databases in metator.py has been broken as mentioned in issue #6 therefore we forked metaTOR to our github and updated the link in metator.py like seen in metator.sh:

Our current databases folder extracted from metaTOR after updating the link looks like this:

Should the .hmm file(s) be missing in the HMM_databases folder but built in the process of running metator? We used the command 'metator dependencies' to fetch dependencies. Could this be the issue of the 'hmmer' dependency? Or maybe the issue with how we install the forked metator from our github? HMM_databases folder ended up being in src path instead of lib path so we copied over the databases from src path to a brand new environment that had the original metator from koszullab installed but empty HMM_databased folder ~/metator0513/lib/python3.6/site-packages/metator/bin/HMM_databsases .

Allocation of external dependencies & extra suggestions

Here is a report of some difficulties / indications that could be taken into account. I apologize in advance if they seem too obvious:

To properly get louvain working, the user should move to the working directory (i.e. the folder with all meta3cbox dependencies) and create a folder named tools, where louvain should be downloaded and compiled manually:

mkdir tools
cd tools
wget https://sourceforge.net/projects/louvain/files/louvain-generic.tar.gz
tar -xzvf louvain-generic.tar.gz
# Important to change the default name of the folder, otherwise it will not be recognized in the 2nd step (partition)
mv louvain-generic louvain
cd louvain
make
cd ..
rm louvain-generic.tar.gz

A similar procedure should be followed to allocate the HMMs that will be used in the 3rd step (annotation). From the working directory:

wget dl.pasteur.fr/fop/5eHgTGww/modele_HMM.tar.gz
tar -xzvf modele_HMM.tar.gz
rm modele_HMM.tar.gz

I encountered a permission error when accessing the distance file. Maybe this was just my case, but a full access to this file should be granted (chmod) before running the 2nd step (partition)
I encountered some errors when trying to re-run some of the individual steps, having to delete some of the previously generated files/folders, such as config_current.sh, config.sh or output. I did not really understand what was happening here, maybe I was not doing something right.

And I think this is it. I hope this is useful!

Best,
Juanma

ValueError: too many values to unpack

Hi Lyam,

I'm trying to run metaTOR from a dedicated conda environment with all required dependencies installed.

metaTOR.sh align starts running but exit with a Python error at the network creation step:
any idea on how to fix that?

Python 3.6.6

Mapping reads...
70604753 reads; of these:
  70604753 (100.00%) were unpaired; of these:
    65283672 (92.46%) aligned 0 times
    4041938 (5.72%) aligned exactly 1 time
    1279143 (1.81%) aligned >1 times
7.54% overall alignment rate
70604753 reads; of these:
  70604753 (100.00%) were unpaired; of these:
    64590673 (91.48%) aligned 0 times
    4480279 (6.35%) aligned exactly 1 time
    1533801 (2.17%) aligned >1 times
8.52% overall alignment rate
[bam_sort_core] merging from 1 files and 1 in-memory blocks...
[bam_sort_core] merging from 1 files and 1 in-memory blocks...
Alignment done.
Merging bam files...
Done.
Generating network from alignments...
Traceback (most recent call last):
  File "/mibi/users/jnesme/Bin/metaTOR/network.py", line 821, in <module>
    main()
  File "/mibi/users/jnesme/Bin/metaTOR/network.py", line 810, in main
    my_assembly, = reference_file
ValueError: too many values to unpack (expected 1)

Pyton 3.6.6 and all dependencies installed:

Requirement already up-to-date: numpy in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from -r requirements.txt (line 1)) (1.15.3)
Requirement already up-to-date: scipy in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from -r requirements.txt (line 2)) (1.1.0)
Requirement already up-to-date: matplotlib in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from -r requirements.txt (line 3)) (3.0.0)
Requirement already up-to-date: biopython in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from -r requirements.txt (line 4)) (1.72)
Requirement already up-to-date: pysam in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from -r requirements.txt (line 5)) (0.15.1)
Requirement already up-to-date: seaborn in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from -r requirements.txt (line 6)) (0.9.0)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from matplotlib->-r requirements.txt (line 3)) (0.10.0)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from matplotlib->-r requirements.txt (line 3)) (1.0.1)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from matplotlib->-r requirements.txt (line 3)) (2.2.2)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from matplotlib->-r requirements.txt (line 3)) (2.7.3)
Requirement already satisfied, skipping upgrade: pandas>=0.15.2 in /usr/local/home/jnesme/.local/lib/python3.6/site-packages (from seaborn->-r requirements.txt (line 6)) (0.20.1)
Requirement already satisfied, skipping upgrade: six in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from cycler>=0.10->matplotlib->-r requirements.txt (line 3)) (1.11.0)
Requirement already satisfied, skipping upgrade: setuptools in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from kiwisolver>=1.0.1->matplotlib->-r requirements.txt (line 3)) (40.4.3)
Requirement already satisfied, skipping upgrade: pytz>=2011k in /mibi/users/jnesme/miniconda2/envs/metator_env/lib/python3.6/site-packages (from pandas>=0.15.2->seaborn->-r requirements.txt (line 6)) (2018.5)

HMM database

Hi,

I guess the HMM URL in metator.sh doesn't work. When I run ./metator.sh dependencies, I still cannot download such databases.

Best

Failure to produce bin fasta files

Dear developers, I got the following error at the very last step of the pipeline:

metator binning --n-bins 300
Drawing enrichment vs. size plots...
Drawing distribution violinplot...
Extracting bin subnetworks and matrices...
INFO :: Loading partition...
INFO :: Loading network...
Extracting bin FASTA files...
WARNING :: /home/guritsk1/miniconda2/envs/metator-env/lib/python3.7/site-packages/metator/bin/../scripts/bins.py:216: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
  zip(*np.genfromtxt(partition_file, usecols=(0, 1), dtype=None))

Traceback (most recent call last):
  File "/home/guritsk1/miniconda2/envs/metator-env/lib/python3.7/site-packages/metator/bin/../scripts/bins.py", line 448, in <module>
    main()
  File "/home/guritsk1/miniconda2/envs/metator-env/lib/python3.7/site-packages/metator/bin/../scripts/bins.py", line 439, in main
    chunk_size=chunk_size,
  File "/home/guritsk1/miniconda2/envs/metator-env/lib/python3.7/site-packages/metator/bin/../scripts/bins.py", line 231, in extract_fasta
    fields = name.split("_")
TypeError: a bytes-like object is required, not 'str'

Could this be a syntax issue, or is this specific to my data? The chunkname_core_size_300.txt looks ok:

k141_8_0  208074 1
k141_11_0  164531 1
k141_12_0  119734 1
k141_15_0  74519 1
k141_17_0  26 1479
k141_18_0  2 16318
k141_20_0  2 16318
k141_21_0  361 30
k141_24_0  17 2418
k141_25_0  864 16

By the way, what do the two columns in chunkname_core_size_300.txt represent? Which column is the cluster number? If I know this I can just make a script to split the bin fasta files myself.

Recursive Louvain Binning

Hi Developer,

I follow the instructions from your tutorial to do the recursive Louvain clustering as:

'''
metator partition --network-file $subnetwork_dir/contaminated_subnetwork.dat --partition-dir subnetwork_partition_folder
metator binning
'''

However, I find that it will still run 300 times Louvain clustering for each contaminated bin, which consumes a long time and a large space. I change the parameter '--iterations' in the partition action but it still ran 300 times.

Moreover, I find a round of recursive Louvain binning can already drop the contamination to a low level. So will you run the recursive Louvain binning multiple times?

Thanks!

KeyError: 'LOUVAIN_PATH'

Link to annotation database is broken

Dear developers,

It looks like the link http://dl.pasteur.fr/fop/gzLz1lG8/hmm_databases.tgz is broken, at least from my location (US). Is it possible to have it updated so I can download the HMM model? Alternatively, can you direct me to the file I need to download to place and extract into python3.7/site-packages/metator/bin/HMM_databases?

VisibleDeprecationWarning when running "bins.py"

Good morning!

I am currently running the fourth step of the meta3box pipeline, i.e. the binning process, with these parameters:
bash meta3c.sh binning --n-bins 100 --iter 300 -p metabox_project

However, I ran into some troubles which I think are related to the arguments associated to read-in functions from Biopython

I hope you can help me out here! The version of Biopython that I am currently using is 1.68
If you need further information, I will be happy to provide it.

Best,
Juanma

Here is the full error report:

Drawing enrichment vs. size plots...
Drawing distribution violinplot...
Extracting bin subnetworks and matrices...
Loading partition...
Loading network...
Extracting bin FASTA files...

/Bin/meta3Cbox/bins.py:155: VisibleDeprecationWarning: Reading unicode strings without specifying the encoding argument is deprecated. Set the encoding, use None for the system default.
dtype=None))
Traceback (most recent call last):
File "/Bin/meta3Cbox/bins.py", line 305, in
chunk_size=chunk_size)
File "/Bin/meta3Cbox/bins.py", line 177, in extract_fasta
sequence = str(genome[header_name][pos_start:pos_end])
File "/usr/local/home/anaconda3/envs/meta3cbox/lib/python2.7/site-packages/Bio/Seq.py", line 235, in getitem
return Seq(self._data[index], self.alphabet)
TypeError: slice indices must be integers or None or have an index method

Link to MetaTOR manual in Tutorial gives 404

Just would like to report a dead link in https://github.com/koszullab/metaTOR/blob/master/doc/TUTORIAL.md

metaTOR manual

redirects me to a 404

network.py TypeError: Argument must be string or unicode.

Hello,

I've been trying to implement this tool and cannot get past the first alignment segment.

If I run $ ./meta3c.sh align -1 fastq_r1 -2 fastq_r2 -a assembly -q 20 -c 10000 -s 10000 -p new_project
after merging bams
I get this error:

Traceback (most recent call last):
File "/mnt/nas3/cory/home_dir/metaTOR-master/network.py", line 749, in
parameters=parameters)
File "/mnt/nas3/cory/home_dir/metaTOR-master/network.py", line 97, in alignment_to_contacts
with pysam.AlignmentFile(sam_merged, "rb") as alignment_merged_handle:
File "calignmentfile.pyx", line 302, in pysam.calignmentfile.AlignmentFile.cinit (pysam/calignmentfile.c:4834)
File "calignmentfile.pyx", line 372, in pysam.calignmentfile.AlignmentFile._open (pysam/calignmentfile.c:5769)
File "calignmentfile.pyx", line 58, in pysam.calignmentfile._encodeFilename (pysam/calignmentfile.c:3527)
TypeError: Argument must be string or unicode.

Any thoughts? Thanks!