uio-bmi / compairr Goto Github PK

View Code? Open in Web Editor NEW

25.0 25.0 5.0 303 KB

Comparison of Adaptive Immune Receptor Repertoires

License: GNU Affero General Public License v3.0

Makefile 1.69% C++ 87.13% C 10.80% Shell 0.20% Dockerfile 0.18%

airr bioinformatics immune-repertoire immunoinformatics immunology rep-seq repertoire-analysis

compairr's Issues

Segmentation fault while hashing sequences

The program crashes with a segmentation fault while hashing sequences.

compairr MH index strange results

Hello! I have used compairr with MH index, but the results seem to be very strange. For example the HM index compairing the same sample does not equal to 1 and also varies between different samples, ex: sample 1 vs sample 1 = 138.04, sample 2 vs sample 2 =76.74. I went through the publication and github but did not manage to find info regarding MH index values. This is the command I use run compairr compairr -d 0 -l $f1$f2.log -m -o $f1$f2.out -s MH -t 7 -g -u $f1 $f2; . Is there a reason for MH index values to be higher than 1 and why do the values change so much between different samples? I Can provide a matrix with all the values if needed.

Thanks:)

Cluster sequences based on single linkage

Option to selected alternative ways to compute output values

Option to selected alternative ways to compute output values in the cells of the matrix, e.g.:

multiply (for Morisita-Horn) (current, default)
divide (for Renyi divergence, note that this is asymmetrical)
min (for Jaccard)
max
mean

Read AIRR standard input files

Read AIRR standard input files as described here:

https://docs.airr-community.org/en/stable/datarep/rearrangements.html

Allow splitting of datasets during processing to reduce memory usage

Handle non-standard sequence symbols better

CompAIRR will currently abort if any character except for the 20 standard amino acid symbols (ACDEFGHIKLMNPQRSTVWY in upper or lower case) appear in the junction_aa column.

In some datasets, the symbols _ and * appear in the amino acid sequence to represent a frame shift or a translation stop.

We could improve CompAIRR by either treating these symbols (_*) as any other amino acid symbol, or we could ignore the sequences were these symbols appear. This kind of behaviour could be indicated with additional options, but I think the default behaviour should be to abort with an informative error message.

This could perhaps also be extended to other non-standard amino acid symbols, like BJOUXZ, and even to non-standard nucleotide symbols – only ACGTU are allowed now.

Incorrect results with full Emerson dataset on 1 thread

Incorrect results were detected on the full Emerson dataset even with a single thread.

Option to ignore V- and J-genes

Optionally output the matching sequences to a separate file

Optionally, CompAIRR should output the matching sequences to a separate TSV file for any matching pair from set A and set

Optionally compare junction nucleotide sequences instead of amino acid sequences

If a certain option is specified, comparisons should be performed on nucleotide sequences and the junction field in the AIRR TSV input file should be used.

Improve speed with distance 0 and smaller datasets

The program seems to be slower than Immunarch on datasets of ~100 repertoires or less.

A flag to switch to cdr3 instead of junction

A flag to switch to cdr3 instead of junction. In that case, the sequence columns would be cdr3 (nt) and cdr3_aa, both for input and output. Nothing will change for how the sequences should be treated further, but AIRR files can contain either junction or cdr3, the difference is that junction has one extra leading/trailing amino acid.

Increase resolution of timing of operations

CompAIRR shows the timing of some of the operations that take a significant amount of time. However, it is just shown with a resolution of seconds. The resolution should be increased to enable more precise measurements.

Max 65535 repertoires

If more than 65535 repertoires are used, the counting of matches in each repertoire will be wrong. Probably due to a 16-bit integer used somewhere.

Detect duplicates in the input

If there are exact duplicates in the input (same repertoire id, same sequence, same V-gene, same J-gene) when d=0, the resulting MH-index or Jaccard index would be bogus. This should be detected and the program should terminate, telling the user to deduplicate / dereplicate their data first.

It should be possible to perform this check quickly by looking up the hashes.

Default repertoire ID if column missing

If no repertoire_id column is provided, assume all sequences in the file belong to the same repertoire (could default to IDs 1 and 2 for the first and second file respectively).

Allow partial V or J gene match

Partial matches with V or J gene names could optionally be allowed, perhaps based on prefix matches in the gene name. Need investigate further.

Segfault with very short sequences with d=3

CompAIRR terminates with a segmentation fault if the -d 3 option is used and a sequence of length 1 is included in the dataset.

Option to ignore frequencies, just count the number of pairs of similar sequences

Alternative computation: Compare a set of sequences against one repertoire

Given a set of sequences and a repertoire, produce a table of which sequences are found in which sample of the repertoire.

Multithreading slows the program down

When using multiple threads, it seems to become much slower instead of faster.

Is sequence matching exact?

Hi,

I've been using CompAIRR for calculating the overlap between different repertoires. I noticed that, when computing the overlap between one sample and itself, CompAIRR produces a different (greater) number than the number of sequences present in that repertoire. For example, if repertoire X would contain 200,000 clonotypes, the CompAIRR n_x_n overlap matrix would return a number >200,000 for the overlap between X and X. My question is therefore: is the CompAIRR result an approximation of the overlap between two repertoires or is it exact and am I misinterpretting the results?

For reference I used the following parameters:

compairr -m -f -o output.tsv input.tsv

Thank you.

Kind regards,
Sebastiaan Valkiers

Extracting information on the distance

Hello,
Is it possible to add a column in the output file with the calculated distance when performing a clustering analysis?
Thanks !

Option to compare a set of sequences against a repertoire

Add an option to compare a set of sequences against a repertoire and report which sequences match which repertoires. The sequence id specified should be reported for each sequence.

Overlap metrics

Hi,

Does CompAIRR allow the calculation of overlap metrics such as Jaccard or Morisita-Horn index or should this be done manually (afterwards)?

Thank you and kind regards,
Sebastiaan Valkiers

Allow multiple indels or more substitutions with a single indel

It would be useful if CompAIRR would allow more than one indel. The ability to specify the maximum number of substitutions (Hamming distance), indels, or total changes (Levenshtein distance) would be nice.

Output results in alternative format

The output could optionally be presented in an alternative three column format with sample (repertoire) names in the two first columns and the overlap value in the third column.

Timestamps at major steps in the analysis

Log timestamps at major steps in the analysis. E.g. before and after reading files, etc.

Sometimes partly incorrect results with > 4 threads

The results may be partly wrong in some cases when using more than four threads and a large enough dataset.

Distance 2 with indels

Due to the way sequence variants are generated, in some cases variants at distance 2 with indels may be generated multiple times and the resulting values may be inaccurate.

Compute additional metrics

Crash during sequence hashing when using indels

Crash during sequence hashing when using d=1 and indels.

New feature: Copy additional columns from input to output files

Make a CompAIRR parameter called something like columns-to-keep, where the names of additional columns of interest that are not transferred by default could be specified. So if the user specifies --columns-to-keep epitope, the pairs file has additional columns epitope_1 and epitope_2. And if the epitope column is only present in one of the input files, the fields in the pairs file could just be empty rather than throwing an error. This feature could be of general use for people who want to further analyse the sequences in the pairs file, it's essentially just transferring additional sequence metadata so the user does not have to map this data back to the input files.

Unable to upgrade version

Hi, I downloaded the latest version of compairr (1.7.0) and compiled it according to the instructions in the readme. However, when I check the version after installation (using compairr -v) it shows version 1.6.1.

Also, uninstalling via:

make uninstall

sudo make uninstall

doesn't work, and returns:

make: *** No rule to make target 'install'.  Stop.

Any idea how to solve this issue?

Open source code in public repository
AIRR standard file formats etc
Include example data and automated check
Provide information about run parameters as part of the output.
Provide a container build file
Provide user support, clearly stating which level of support users can expect, and how and from whom to obtain it.

Further steps:

Apply for ratification
Obtain certificate of compliance
Add badge

Increase number of significant digits in output

Self-comparison

Add possibility of comparing a repertoire set with itself, without having to read it twice.

uio-bmi / compairr Goto Github PK

compairr's Issues

Recommend Projects

Recommend Topics

Recommend Org