uio-bmi / compairr Goto Github PK

Comparison of Adaptive Immune Receptor Repertoires

License: GNU Affero General Public License v3.0

Makefile 1.69% C++ 87.13% C 10.80% Shell 0.20% Dockerfile 0.18%

immune-repertoire airr bioinformatics immunology repertoire-analysis immunoinformatics rep-seq

compairr's Introduction

CompAIRR

CompAIRR (compairr) is a command line tool to compare two sets of adaptive immune receptor repertoires and compute their overlap. It can also identify which sequences are present in which repertoires. Furthermore, CompAIRR can cluster the sequences in a repertoire set. Sequence comparisons can be exact or approximate. CompAIRR has been shown to be very fast and to have a small memory footprint compared to similar tools, when up to 2 differences are allowed.

Installation

The code is C++11 standard compliant and should compile easily using make and a modern C++ compiler (e.g. GNU GCC or LLVM Clang). Run make clean, make, make test and make install in the main folder to clean, build, test and install the tool. There are no dependencies except for the C and C++ standard libraries.

Binaries for Linux (x86_64) and macOS (x86_64 and Arm64) are also distributed with each release.

A Dockerfile is included if you want to make a Docker image. A docker image may be built with the following command:

docker build -t compairr .

Ready-made Docker images for CompAIRR can be found on the Docker Hub.

CompAIRR can be installed on macOS using homebrew with brew install torognes/bioinf/compairr.

Tutorial

For an introduction to how to use CompAIRR, please have a look at the CompAIRR tutorial.

General options

Use the -h or --help option to show some help information.

Run the program with -v or --version for version information.

The type of operation that should be performed is specified with one of the options -m, -x, -c or -z (or the corresponding long option forms --matrix, --existence, --cluster, or --deduplicate).

The code is multi-threaded. The number of threads may be specified with the -t or --threads option.

The results will be written to standard out (stdout) unless a file name has been specified with the -o or --output-file option.

While the program is running it will print some status and progress information to standard error (stderr) unless a log file has been specified with the -l or --log option. Error messages and warnings will also be written here.

The default is to compare amino acid sequences, but nucleotide sequences are compared if the -n or --nucleotides option is given. The accepted amino acid symbols are ACDEFGHIKLMNPQRSTVWY, while the accepted nucleotide symbols are ACGTU. Lower case letters are also accepted. The program will abort with an error message if any other symbol is encountered in a sequence, unless one specifies the -u or --ignore-unknown option, in which case CompAIRR will simply ignore that sequence.

By default, the sequences should be given in the junction or junction_aa column of the input file, for nucleotide and amino acid sequences, respectively. Alternatively, the sequences may be present in the cdr3 or cdr3_aa column, if the --cdr3 option is given.

The user can specify how many differences are allowed when comparing sequences, using the option -d or --differences. To allow indels (insertions or deletions) the option -i or --indels may be specified, otherwise only substitutions are allowed. By default, no differences are allowed. The -i option is allowed only when d=1. The number of differences allowed strongly influences the speed of CompAIRR. The program will be slower as more differences are allowed. When d=0 or d=1 it is very fast, but it will be relatively slow with d=2 and even slower when d>2. See the section on performance below for an example.

The V and J gene alleles specified for each sequence must also match, unless the -g or --ignore-genes option is in effect.

Computing overlap between two repertoire sets

To compute the overlap between two repertoire sets, use the -m or --matrix option.

For each of the two repertoire sets there must an input file of tab-separated values formatted according to the AIRR standard for rearrangements. The two input files are specified on the command line without any preceding option letter. If only one filename is specified on the command line, or the same filename is specified twice, it is assumed that the set should be compared to itself. Each file must contain the repertoire ID and either the nucleotide or the amino acid sequence of the rearrangement. If the repertoire ID column is missing, all sequences are assumed to belong to the same repertoire (with ID 1 or 2, respectively, for the two sets). A sequence ID may also be included. Unless they should be ignored, the V gene, the J gene, and the duplicate count is also needed.

Each set can contain many repertoires and each repertoire can contain many sequences. The tool will find the sequences in the two sets that are similar and output a matrix with results.

CompAIRR assumes that all sequences within each repertoire are distinct, and that the abundance of each sequence is indicated in the duplicate_count field in the input file. Duplicated sequences, i.e. identical sequences (with the same V and J genes) within the same repertoire, may lead to unexpected results. CompAIRR will warn if it detects duplicates. Duplicates may be merged with the --deduplicate command.

The similar sequences of each repertoire in each set are found by comparing the sequences and their V and J genes. The duplicate count of each sequence is taken into account and a matrix is output containing a value for each combination of repertoires in the two sets. The value is usually the sum of the products of the duplicate counts of all pairs of sequences in the two repertoires that match. If the option -f or --ignore-counts is specified, the duplicate count information is ignored and all counts are treated as 1. Instead of summing the product of the counts, the ratio, min, max, or mean may be used if specified with the -s or --score option. The Morisita-Horn index or Jaccard index will be calculated if MH or Jaccard is specified with the -s option. These indices can only be computed when d=0.

The output will be a matrix of values in a tab-separated plain text file. Two different formats can be selected. In the default format, the first line contains the hash character (#) followed by the repertoire ID's from the second set. The following lines contains the repertoire ID from the first set, followed by the values corresponding to the comparison of this repertoire with each of the repertoires in the second set.

An alternative output format is used when the -a or --alternative option is specified. It will write the results in a three column format with the repertoire ID from set 1 and set 2 in the two first columns, respectively, and the value in the third column. There will be one line for each combination of repertoires in the sets. The very first line will contain a hash character (#) followed by the field names separated by tabs.

If the -p or --pairs option is specified, CompAIRR will write information about all pairs of matching sequences to a specified TSV file. Please note that such files may grow very large when there are many matches. Use of multithreading may be of little use in this case. The order of the lines in the file is unspecified. The following columns from both input files will be included in the output: repertoire_id, sequence_id, duplicate_count, v_call, j_call, and junction. The term junction will be replaced with junction_aa, cdr3, or cdr3_aa as appropriate. Additional columns from the input files may be copied to the pairs file using the -k or --keep-columns option. Multiple columns, separated by commas (but no spaces), may be given. A warning will be given if any of the specified columns are missing. In the header, columns from the first and second input file will be suffixed by _1 and _2, respectively. The distance between the sequences will be included if the --distance option is included. This is usually the Hamming distance (minimum number of substitutions), unless the --indel (or `-i´) option is specified, in which case the distance is the Levenshtein distance (minimum number of substitutions or indels).

Analysing in which repertoires a set of sequences are present

Use the option -x or --existence to analyse in which repertoires a set of sequences are present, and create a sequence presence matrix.

Two input files with repertoire sets in standard format must be specified on the command line. The first file should contain the different sequences to analyse. The sequence_id column must be present in this file. If the optional repertoire_id column is present, all those identifiers must be identical. The second file must contain the repertoires to match. The repertoire_id column must be present in the second file, otherwise the ID will be set to 2 for all sequences.

CompAIRR will identify in which repertoires each sequence is present and will output the results either as a matrix or as a three-column table (if the -a option is specified). The options -d, -i, -g, and -n (and the corresponding long option names --differences, --indels, --ignore-genes, and --nucleotides) will be taken into account when comparing sequences.

The output will be in a similar format as when computing the overlap (above), but the first column will contain the sequence_id from the first file instead of the repertoire_id.

The -p or --pairs option may be specified to output all pairs of matching sequences in the same way as for the overlap computation.

Clustering the sequences in a repertoire

To cluster the sequences in one repertoire, use the -c or --cluster option.

One input file in tab-separated format must be specified on the command line.

The tool will cluster the sequences using single linkage hierarchical clustering, according to the specified distance and indel options (-d, --distance, -i, --indels). The V and J gene alleles will be taken into account unless the -g or --ignore-genes option is specified. The options -n or --nucleotides indicate that the comparison should be performed with nucleotide sequences, not amino acid sequences. If the repertoire ID column is missing, all sequences are assumed to belong to the same repertoire (with ID 1).

The output will be in a similar TSV format as the input file, but preceded with two additional columns. The first column will contain a cluster number, starting at 1. The second column will contain the size of the cluster. The subsequent columns are repertoire_id, sequence_id, duplicate_count, v_call, j_call, and junction (or junction_aa, cdr3 or cdr3_aa, as appropriate).

The clusters are sorted by size, in descending order.

Deduplication

The --deduplicate command may be used to deduplicate a data set by merging entries in the same repertoire with identical sequences and identical V and J genes. This may be necessary to get correct results when computing overlaps between repertoires. Duplicates may be present for instance in cases were the data set contains both nucleotide and amino acid sequences from the same rearrangement, where the nucleotide sequences may be distinct while the amino acid sequences may not be, due to the degeneracy of the genetic code.

One input file in TSV format must be specified on the command line.

Strictly identical sequences in the same repertoire will be merged and their counts will be added together. If the -g or --ignore_genes option is specified, the V and J genes are ignored. The -n or --nucleotides option may be specified if the input is nucleotide sequences, otherwise amino acid sequences will be assumed. If the -f or --ignore_counts option is specified, the counts in the input file will be ignored, and just the number of identical sequences will be counted. If the repertoire ID column is missing, all sequences are assumed to belong to the same repertoire (with ID 1).

The output will be in a similar TSV format as the input file, with the following columns: repertoire_id, duplicate_count, v_call, j_call, and junction (or junction_aa, cdr3 or cdr3_aa, as appropriate). If the -g or --ignore_genes option is specified, the v_call and j_call columns will not be included.

Input files

The input files must be in tab-separated value (TSV) format accoring to the Rearrangement Schema of the AIRR standards 1.3 documentation.

The first line must contain the header. The rest of the file must contain one line per sequence. The following fields should be included:

repertoire_id: identifier of the repertoire
sequence_id: identifier of the sequence (optional except for for first file when using -x or --existence)
duplicate_count: number of identical copies of the same rearrangement (required unless -f option given)
v_call: V gene name with allele (required unless -g option given)
j_call: J gene name with allele (required unless -g option given)
junction: nucleotide sequence (required if -n option given and --cdr3 option not given)
junction_aa: amino acid sequence (single letter code) (required unless -n or --cdr3 options given)
cdr3: nucleotide sequence (required if both -n and --cdr3 options given)
cdr_aa: amino acid sequence (single letter code) (required if --cdr3 option given and -n option not given)

See below for an example. Other fields may be included, but will be ignored.

Command line option overview

The command line should look like this:

compairr OPTIONS TSVFILE1 [TSVFILE2]

Exactly one of the command options -m, -x or -c (or their long forms) must be specified. Other options as indicated in the table below could also be included. With the -m and -x command options, the names of two tab-separated value files with repertoires must also be specified on the command line, with the -c command option, only one such file should be specified.

Short	Long	Argument	Default	Description
`-a`	`--alternative`			Output results in three-column format, not matrix
	`--cdr3`			Use the `cdr3` or `cdr3_aa` column instead of `junction` or `junction_aa`
`-c`	`--cluster`			Cluster sequences in one repertoire
`-d`	`--differences`	INTEGER	0	Number of differences accepted
	`--distance`			Include sequence distance in pairs file
`-f`	`--ignore-counts`			Ignore duplicate count information
`-g`	`--ignore-genes`			Ignore V and J gene information
`-h`	`--help`			Display help text and exit
`-i`	`--indels`			Allow insertions or deletions
`-k`	`--keep-columns`	STRING		Copy given comma-separated columns to pairs file
`-l`	`--log`	FILENAME	(stderr)	Log to specified file instead of stderr
`-m`	`--matrix`			Compute overlap matrix between two sets
`-n`	`--nucleotides`			Compare nucleotides, not amino acids
`-o`	`--output`	FILENAME	(stdout)	Output results to specified file instead of stdout
`-p`	`--pairs`	FILENAME	(none)	Output matching pairs to specified file
`-s`	`--score`	STRING	product	Sum `product`, `ratio`, `min`, `max`, or `mean`; or compute `MH` or `Jaccard` index
`-t`	`--threads`	INTEGER	1	Number of threads to use (1-256)
`-u`	`--ignore-unknown`			Ignore sequences including unknown residue symbols
`-v`	`--version`			Display version information
`-x`	`--existence`			Check existence of sequences in repertoires
`-z`	`--deduplicate`			Deduplicate sequences

Example 1: Repertoire overlap

In this example we will compute the overlap of two repertoire sets.

Let's use two simple input files. The first is seta.tsv:

repertoire_id	sequence_id	duplicate_count	v_call	j_call	junction	junction_aa	sequence	rev_comp	productive	d_call	sequence_alignment	germline_alignment	v_cigar	d_cigar	j_cigar
A1	R	1	TCRBV07-06	TCRBJ02-01	tgcgcgagcagcaccagccatgaacagtatttt	CASSTSHEQYF									
A2	S	3	TCRBV07-09	TCRBJ01-02	tgcgcgagcagcctgcgcgtgggcggctatggctataccttt	CASSLRVGGYGYTF

The second is setb.tsv:

repertoire_id	sequence_id	duplicate_count	v_call	j_call	junction	junction_aa	sequence	rev_comp	productive	d_call	sequence_alignment	germline_alignment	v_cigar	d_cigar	j_cigar
B1	T	5	TCRBV07-09	TCRBJ01-02	tgcgcgagcagcctgcgcgtgggcggctatggctataccttt	CASSLRVGGYGYTF									
B1	U	10	TCRBV07-09	TCRBJ01-02	tgcgcgagcagcctgcgcgtgggcggctttggctataccttt	CASSLRVGGFGYTF									
B2	V	7	TCRBV07-06	TCRBJ02-01	tgcgcgagcagcaccagccatcagcagtatttt	CASSTSHQQYF

We run the following command:

compairr -m seta.tsv setb.tsv -d 1 -o output.tsv -p pairs.tsv

Here is the output to the console:

CompAIRR 1.7.0 - Comparison of Adaptive Immune Receptor Repertoires
https://github.com/uio-bmi/compairr

Start time:        Thu Mar 03 12:29:32 CET 2022
Command (m/c/x):   Overlap (-m)
Repertoire set 1:  seta.tsv
Repertoire set 2:  setb.tsv
Nucleotides (n):   No
Differences (d):   1
Indels (i):        No
Ignore counts (f): No
Ignore genes (g):  No
Ign. unknown (u):  No
Threads (t):       1
Output file (o):   output.tsv
Output format (a): Matrix
Score (s):         Sum of products of counts
Pairs file (p):    pairs.tsv
Log file (l):      (stderr)

Immune receptor repertoire set 1

Reading sequences: 100% (0s)
Repertoires:       2
Sequences:         2
Residues:          25
Shortest:          11
Longest:           14
Average length:    12.5
Total dupl. count: 4
Indexing:          100% (0s)

Repertoires in set:
# Sequences Count Repertoire ID
1         1     1 A1
2         1     3 A2

Immune receptor repertoire set 2

Reading sequences: 100% (0s)
Repertoires:       2
Sequences:         3
Residues:          39
Shortest:          11
Longest:           14
Average length:    13.0
Total dupl. count: 22
Indexing:          100% (0s)

Repertoires in set:
# Sequences Count Repertoire ID
1         2    15 B1
2         1     7 B2

Unique V genes:    2
Unique J genes:    2
Computing hashes:  100% (0s)
Computing hashes:  100% (0s)
Hashing sequences: 100% (0s)
Analysing:         100% (0s)
Writing results:   100% (0s)

End time:          Thu Mar 03 12:29:32 CET 2022

Repertoires will be sorted alphabetically by ID. The program gives some statistics on the input files after reading them.

Here is the result in the output.tsv file:

#	B1	B2
A1	0	7
A2	45	0

And here is the result in the pairs.tsv file:

#repertoire_id_1	sequence_id_1	duplicate_count_1	v_call_1	j_call_1	junction_aa_1	repertoire_id_2	sequence_id_2	duplicate_count_2	v_call_2	j_call_2	junction_aa_2
A1	R	1	TCRBV07-06	TCRBJ02-01	CASSTSHEQYF	B2	V	7	TCRBV07-06	TCRBJ02-01	CASSTSHQQYF
A2	S	3	TCRBV07-09	TCRBJ01-02	CASSLRVGGYGYTF	B1	T	5	TCRBV07-09	TCRBJ01-02	CASSLRVGGYGYTF
A2	S	3	TCRBV07-09	TCRBJ01-02	CASSLRVGGYGYTF	B1	U	10	TCRBV07-09	TCRBJ01-02	CASSLRVGGFGYTF

Here, sequence R in repertoire A1 is similar to sequence V in repertoire B2. The only difference is the E and Q in the 8th position. The gene allele names are also the same. They have duplicate counts of 1 and 7, respectively. The product is 7. That value is found in the third column on the second line in the main output file.

Sequence S in repertoire A2 with duplicate count 3 is similar to both sequence T and U in repertoire B1, with duplicate counts of 5 and 10, respectively. Sequence T in B1 is identical, while sequence U in B1 has an F instead of a Y in the 10th position. The result is 3 * (5 + 10) = 3 * 15 = 45. That value is found in the second column on the third line of the main output file.

Since there are no sequences from repertoire A1 similar to B1 or from A2 similar to B1, the other values in the matrix are zero.

This small dataset is included in the test folder and the tool can automatically be tested by running make test.

Example 2: Sequence existence

In this example we will use the -x or --existence option to find out in which repertoires a set of sequences are present.

The file setc.tsv contains the sequences that we will analyse:

repertoire_id	sequence_id	duplicate_count	v_call	j_call	junction	junction_aa	sequence	rev_comp	productive	d_call	sequence_alignment	germline_alignment	v_cigar	d_cigar	j_cigar
C	X	1	TCRBV07-09	TCRBJ01-02	tgcgcgagcagcctgcgcgtgggcggctttggctataccttt	CASSLRVGGFGYTF									
C	Y	1	TCRBV07-06	TCRBJ02-01	tgcgcgagcagcaccagccatcagcagtatttt	CASSTSHQQYF

The file above is included in the folder test in the distribution.

We will compare it to repertoire sets in the file setb.tsv described earlier.

We run the following command:

compairr -x setc.tsv setb.tsv -d 1 -f -o output.tsv -p pairs.tsv

Here is the output to the console:

CompAIRR 1.7.0 - Comparison of Adaptive Immune Receptor Repertoires
https://github.com/uio-bmi/compairr

Start time:        Thu Mar 03 12:31:16 CET 2022
Command (m/c/x):   Existence (-x)
Repertoire:        setc.tsv
Repertoire set:    setb.tsv
Nucleotides (n):   No
Differences (d):   1
Indels (i):        No
Ignore counts (f): Yes
Ignore genes (g):  No
Ign. unknown (u):  No
Threads (t):       1
Output file (o):   output.tsv
Output format (a): Matrix
Score (s):         Sum of products of counts
Pairs file (p):    pairs.tsv
Log file (l):      (stderr)

Immune receptor repertoire set 1

Reading sequences: 100% (0s)
Repertoires:       1
Sequences:         2
Residues:          25
Shortest:          11
Longest:           14
Average length:    12.5
Total dupl. count: 2
Indexing:          100% (0s)

Repertoires in set:
# Sequences Count Repertoire ID
1         2     2 C

Immune receptor repertoire set 2

Reading sequences: 100% (0s)
Repertoires:       2
Sequences:         3
Residues:          39
Shortest:          11
Longest:           14
Average length:    13.0
Total dupl. count: 22
Indexing:          100% (0s)

Repertoires in set:
# Sequences Count Repertoire ID
1         2    15 B1
2         1     7 B2

Unique V genes:    2
Unique J genes:    2
Computing hashes:  100% (0s)
Computing hashes:  100% (0s)
Hashing sequences: 100% (0s)
Analysing:         100% (0s)
Writing results:   100% (0s)

End time:          Thu Mar 03 12:31:16 CET 2022

Here is the result in the output.tsv file:

#	B1	B2
X	2	0
Y	0	1

Please note that the -f option was used to ignore the duplicate counts.

And here is the result in the pairs.tsv file:

#repertoire_id_1	sequence_id_1	duplicate_count_1	v_call_1	j_call_1	junction_aa_1	repertoire_id_2	sequence_id_2	duplicate_count_2	v_call_2	j_call_2	junction_aa_2
C	X	1	TCRBV07-09	TCRBJ01-02	CASSLRVGGFGYTF	B1	U	10	TCRBV07-09	TCRBJ01-02	CASSLRVGGFGYTF
C	X	1	TCRBV07-09	TCRBJ01-02	CASSLRVGGFGYTF	B1	T	5	TCRBV07-09	TCRBJ01-02	CASSLRVGGYGYTF
C	Y	1	TCRBV07-06	TCRBJ02-01	CASSTSHQQYF	B2	V	7	TCRBV07-06	TCRBJ02-01	CASSTSHQQYF

The results indicate that sequence X was found (twice) in repertoire B1 (matching sequences T and U) and that sequence Y was found in repertoire B2 (matching sequence V).

Example 3: Clustering sequences

This time we will cluster the nucleotide sequences in the file setb.tsv using the -c or --cluster option.

The command line to run is:

compairr -c setb.tsv -d 1 -n -o output.tsv

The output during the clustering is as follows:

CompAIRR 1.7.0 - Comparison of Adaptive Immune Receptor Repertoires
https://github.com/uio-bmi/compairr

Start time:        Thu Mar 03 12:33:05 CET 2022
Command (m/c/x):   Cluster (-c)
Repertoire:        setb.tsv
Nucleotides (n):   Yes
Differences (d):   1
Indels (i):        No
Ignore counts (f): No
Ignore genes (g):  No
Ign. unknown (u):  No
Threads (t):       1
Output file (o):   output.tsv
Log file (l):      (stderr)

Immune receptor repertoire clustering

Reading sequences: 100% (0s)
Repertoires:       2
Sequences:         3
Residues:          117
Shortest:          33
Longest:           42
Average length:    39.0
Total dupl. count: 22
Indexing:          100% (0s)

Unique V genes:    2
Unique J genes:    2

Computing hashes:  100% (0s)
Hashing sequences: 100% (0s)
Building network:  100% (0s)
Clustering:        100% (0s)
Sorting clusters:  100% (0s)
Writing clusters:  100% (0s)

Clusters:          2
End time:          Thu Mar 03 12:33:05 CET 2022

The result in the file output.tsv looks like this:

#cluster_no	cluster_size	repertoire_id	sequence_id	duplicate_count	v_call	j_call	junction
1	2	B1	T	5	TCRBV07-09	TCRBJ01-02	tgcgcgagcagcctgcgcgtgggcggctatggctataccttt
1	2	B1	U	10	TCRBV07-09	TCRBJ01-02	tgcgcgagcagcctgcgcgtgggcggctttggctataccttt
2	1	B2	V	7	TCRBV07-06	TCRBJ02-01	tgcgcgagcagcaccagccatcagcagtatttt

In this case, there are 2 clusters. The first contains 2 sequences (T and U from B1), while the second cluster contains 1 sequence (V from B2). The sequences are clustered across repertoires.

Example 4: Deduplication

This time we will deduplicate the amino acid sequences in the file setb.tsv using the -z or --deduplicate option.

The command line to run is:

compairr -z setb.tsv -o output.tsv

The output will look like this:

CompAIRR 1.8.0 - Comparison of Adaptive Immune Receptor Repertoires
https://github.com/uio-bmi/compairr

Start time:        Thu Sep 15 17:10:51 CEST 2022
Command:           Deduplicate (--deduplicate)
Repertoire:        setb.tsv
Nucleotides (n):   No
Differences (d):   0
Indels (i):        No
Ignore counts (f): No
Ignore genes (g):  No
Ign. unknown (u):  No
Threads (t):       1
Output file (o):   output.tsv
Log file (l):      (stderr)

Reading sequences: 100% (0s)
Repertoires:       2
Sequences:         3
Residues:          39
Shortest:          11
Longest:           14
Average length:    13.0
Total dupl. count: 22
Indexing:          100% (0s)
Unique V genes:    2
Unique J genes:    2
Computing hashes:  100% (0s)
Deduplicating:     100% (0s)
Duplicates merged: 0
Writing output:    100% (0s)

End time:          Thu Sep 15 17:10:51 CEST 2022

The result in the file output.tsv looks like this:

repertoire_id	duplicate_count	v_call	j_call	junction_aa
B1	5	TCRBV07-09	TCRBJ01-02	CASSLRVGGYGYTF
B1	10	TCRBV07-09	TCRBJ01-02	CASSLRVGGFGYTF
B2	7	TCRBV07-06	TCRBJ02-01	CASSTSHQQYF

There were no duplicates in this dataset so the output is essentially identical to the input data, but does not include all the original columns. If the two sequences in repertoire B1 had been identical, the two lines would have been merged and the new duplicate_count would have been 15.

Implementation

The program is written in C++. The strategy for finding similar sequences is based on a similar concept developed for the tool Swarm (Mahé et al. 2021). Basically, a 64-bit hash is computed for all sequences in the sets. All hashes for one set are stored in a Bloom filter and in a hash table. We then look for matches to sequences in the second set by looking them up in the Bloom filter and then, if there was a match, in the hash table. To find matches with 1 or 2 substitutions or indels, the hashes of all these variant sequences are generated and looked up. When d>2, a different strategy is used where all sequences are compared against each other and the number of differences is found.

Performance

As a preliminary performance test, Cohort 2 ("Keck") of the dataset by Emerson et al. (2017) was compared to itself. It contains 120 repertoires with a total of 24 205 557 extracted sequences. The test was performed with CompAIRR version 1.3.1. The timing results are shown below.

Distance	Indels	Threads	Time (s)	Time (mm:ss)
0	no	1	18	0:18
0	no	4	12	0:12
1	no	1	224	3:44
1	no	4	72	1:12
1	yes	1	367	6:07
1	yes	4	111	1:51
2	no	4	3200	53:20

When the distance is zero almost all of the time was used to read files.

Memory usage was 2.5GB, corresponding to an average of about 100 bytes per sequence.

Since this is a comparison of a repertoire set to itself, the dataset is only read once, and the memory needed is also reduced as compared to a situation were two different repertoire sets are compared.

Wall time and memory usage was measured by /usr/bin/time. The analysis was performed on an Apple Mac Mini M1 (2020) with 16GB RAM.

Benchmarking

The AIRR overlap functionality of CompAIRR has been thoroughly benchmarked against similar tools. All data, scripts, and results are available in a separate CompAIRR benchmarking repository.

Tips

If computer memory is limited, the dataset may be split into blocks before running CompAIRR on each block separately. Results then needs to be merged together again afterwards. This may be achieved with a simple script. We will consider providing such a script.

Development team

The code has been developed by Torbjørn Rognes based on code from Swarm where Frédéric Mahé and Lucas Czech made important contributions. Geir Kjetil Sandve had the idea of developing a tool for rapid repertoire set comparison. Lonneke Scheffer has tested and benchmarked the tool, and suggested new features. Milena Pavlovic and Victor Greiff have also contributed to the project.

Support

We will prioritize fixing important bugs. We will also try to answer questions, improve documentation and implement suggested enhancements as time permits. As we have no dedicated funding for this project we cannot make any guarantees on the level of support.

To report a potential bug, suggest enhancements or ask questions, please use one of the following means:

Submit an issue on GitHub (preferred)
Send an email to [email protected]

If you would like to contribute with code you are most welcome to submit a pull request.

Citing CompAIRR

Please cite the following if you use CompAIRR in any published work:

Rognes T, Scheffer L, Greiff V, Sandve GK (2021) CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching. Bioinformatics, btac505. doi: 10.1093/bioinformatics/btac505

The article is also available in preprint form:

Rognes T, Scheffer L, Greiff V, Sandve GK (2021) CompAIRR: ultra-fast comparison of adaptive immune receptor repertoires by exact and approximate sequence matching. bioRxiv, 2021.10.30.466600. doi: 10.1101/2021.10.30.466600

References

Emerson RO, DeWitt WS, Vignali M, Gravley J, Hu JK, Osborne EJ, Desmarais C, Klinger M, Carlson CS, Hansen JA, Rieder M, Robins HS (2017) Immunosequencing identifies signatures of cytomegalovirus exposure history and HLA-mediated effects on the T cell repertoire. Nature Genetics, 49 (5): 659-665. doi: 10.1038/ng.3822
Mahé F, Czech L, Stamatakis A, Quince C, de Vargas C, Dunthorn M, Rognes T (2021) Swarm v3: Towards Tera-Scale Amplicon Clustering. Bioinformatics, btab493. doi: 10.1093/bioinformatics/btab493

compairr's People

Contributors

Stargazers

Watchers

Forkers

rnaimehaom aymanekricha mariliacascalho64 torognes nahid18

compairr's Issues

Max 65535 repertoires

If more than 65535 repertoires are used, the counting of matches in each repertoire will be wrong. Probably due to a 16-bit integer used somewhere.

Optionally compare junction nucleotide sequences instead of amino acid sequences

If a certain option is specified, comparisons should be performed on nucleotide sequences and the junction field in the AIRR TSV input file should be used.

Multithreading slows the program down

When using multiple threads, it seems to become much slower instead of faster.

Unable to upgrade version

Hi, I downloaded the latest version of compairr (1.7.0) and compiled it according to the instructions in the readme. However, when I check the version after installation (using compairr -v) it shows version 1.6.1.

Also, uninstalling via:

make uninstall

sudo make uninstall

doesn't work, and returns:

make: *** No rule to make target 'install'.  Stop.

Any idea how to solve this issue?

Handle non-standard sequence symbols better

CompAIRR will currently abort if any character except for the 20 standard amino acid symbols (ACDEFGHIKLMNPQRSTVWY in upper or lower case) appear in the junction_aa column.

In some datasets, the symbols _ and * appear in the amino acid sequence to represent a frame shift or a translation stop.

We could improve CompAIRR by either treating these symbols (_*) as any other amino acid symbol, or we could ignore the sequences were these symbols appear. This kind of behaviour could be indicated with additional options, but I think the default behaviour should be to abort with an informative error message.

This could perhaps also be extended to other non-standard amino acid symbols, like BJOUXZ, and even to non-standard nucleotide symbols – only ACGTU are allowed now.

Option to ignore V- and J-genes

Option to compare a set of sequences against a repertoire

Add an option to compare a set of sequences against a repertoire and report which sequences match which repertoires. The sequence id specified should be reported for each sequence.

Output results in alternative format

The output could optionally be presented in an alternative three column format with sample (repertoire) names in the two first columns and the overlap value in the third column.

AIRR standards compliance

Make CompAIRR compliant with the AIRR standard for software tools:

Needed:

Open source code in public repository
AIRR standard file formats etc
Include example data and automated check
Provide information about run parameters as part of the output.
Provide a container build file
Provide user support, clearly stating which level of support users can expect, and how and from whom to obtain it.

Further steps:

Apply for ratification
Obtain certificate of compliance
Add badge

Optionally output the matching sequences to a separate file

Optionally, CompAIRR should output the matching sequences to a separate TSV file for any matching pair from set A and set

Detect duplicates in the input

If there are exact duplicates in the input (same repertoire id, same sequence, same V-gene, same J-gene) when d=0, the resulting MH-index or Jaccard index would be bogus. This should be detected and the program should terminate, telling the user to deduplicate / dereplicate their data first.

It should be possible to perform this check quickly by looking up the hashes.

Compute table of hamming distances between all sequences

Computing the all-vs-all hamming distances between sequences in a set would be a useful feature, even if it is not very fast.

Allow splitting of datasets during processing to reduce memory usage

Option to selected alternative ways to compute output values

Option to selected alternative ways to compute output values in the cells of the matrix, e.g.:

multiply (for Morisita-Horn) (current, default)
divide (for Renyi divergence, note that this is asymmetrical)
min (for Jaccard)
max
mean

Overlap metrics

Hi,

Does CompAIRR allow the calculation of overlap metrics such as Jaccard or Morisita-Horn index or should this be done manually (afterwards)?

Thank you and kind regards,
Sebastiaan Valkiers

Extracting information on the distance

Hello,
Is it possible to add a column in the output file with the calculated distance when performing a clustering analysis?
Thanks !

Crash during sequence hashing when using indels

Crash during sequence hashing when using d=1 and indels.

Option to ignore frequencies, just count the number of pairs of similar sequences

Segmentation fault when clustering sequences

CompAIRR sometimes crashes with a segfault when clustering nucleotide sequences. A user have reported this issue.

The error has been reproduced and a potential reason has been identified.

Distance 2 with indels

Due to the way sequence variants are generated, in some cases variants at distance 2 with indels may be generated multiple times and the resulting values may be inaccurate.

Is sequence matching exact?

Hi,

I've been using CompAIRR for calculating the overlap between different repertoires. I noticed that, when computing the overlap between one sample and itself, CompAIRR produces a different (greater) number than the number of sequences present in that repertoire. For example, if repertoire X would contain 200,000 clonotypes, the CompAIRR n_x_n overlap matrix would return a number >200,000 for the overlap between X and X. My question is therefore: is the CompAIRR result an approximation of the overlap between two repertoires or is it exact and am I misinterpretting the results?

For reference I used the following parameters:

compairr -m -f -o output.tsv input.tsv

Thank you.

Kind regards,
Sebastiaan Valkiers

Allow d>3

To make the tool more general, we could add the possibility of allowing any value for d. An alternative algorithm for identifying similar sequences should probably be used when d>2.

Segfault with very short sequences with d=3

CompAIRR terminates with a segmentation fault if the -d 3 option is used and a sequence of length 1 is included in the dataset.

New feature: Copy additional columns from input to output files

Make a CompAIRR parameter called something like columns-to-keep, where the names of additional columns of interest that are not transferred by default could be specified. So if the user specifies --columns-to-keep epitope, the pairs file has additional columns epitope_1 and epitope_2. And if the epitope column is only present in one of the input files, the fields in the pairs file could just be empty rather than throwing an error. This feature could be of general use for people who want to further analyse the sequences in the pairs file, it's essentially just transferring additional sequence metadata so the user does not have to map this data back to the input files.

Self-comparison

Add possibility of comparing a repertoire set with itself, without having to read it twice.

compairr MH index strange results

Hello! I have used compairr with MH index, but the results seem to be very strange. For example the HM index compairing the same sample does not equal to 1 and also varies between different samples, ex: sample 1 vs sample 1 = 138.04, sample 2 vs sample 2 =76.74. I went through the publication and github but did not manage to find info regarding MH index values. This is the command I use run compairr compairr -d 0 -l $f1$f2.log -m -o $f1$f2.out -s MH -t 7 -g -u $f1 $f2; . Is there a reason for MH index values to be higher than 1 and why do the values change so much between different samples? I Can provide a matrix with all the values if needed.

Thanks:)

Read AIRR standard input files

Read AIRR standard input files as described here:

https://docs.airr-community.org/en/stable/datarep/rearrangements.html

Improve speed with distance 0 and smaller datasets

The program seems to be slower than Immunarch on datasets of ~100 repertoires or less.

Alternative computation: Compare a set of sequences against one repertoire

Given a set of sequences and a repertoire, produce a table of which sequences are found in which sample of the repertoire.

Allow multiple indels or more substitutions with a single indel

It would be useful if CompAIRR would allow more than one indel. The ability to specify the maximum number of substitutions (Hamming distance), indels, or total changes (Levenshtein distance) would be nice.

Allow partial V or J gene match

Partial matches with V or J gene names could optionally be allowed, perhaps based on prefix matches in the gene name. Need investigate further.

Cluster sequences based on single linkage

Default repertoire ID if column missing

If no repertoire_id column is provided, assume all sequences in the file belong to the same repertoire (could default to IDs 1 and 2 for the first and second file respectively).

Increase number of significant digits in output

Timestamps at major steps in the analysis

Log timestamps at major steps in the analysis. E.g. before and after reading files, etc.

A better name?

Sometimes partly incorrect results with > 4 threads

The results may be partly wrong in some cases when using more than four threads and a large enough dataset.

Increase resolution of timing of operations

CompAIRR shows the timing of some of the operations that take a significant amount of time. However, it is just shown with a resolution of seconds. The resolution should be increased to enable more precise measurements.

A flag to switch to cdr3 instead of junction

A flag to switch to cdr3 instead of junction. In that case, the sequence columns would be cdr3 (nt) and cdr3_aa, both for input and output. Nothing will change for how the sequences should be treated further, but AIRR files can contain either junction or cdr3, the difference is that junction has one extra leading/trailing amino acid.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.