uio-bmi / compairr Goto Github PK
View Code? Open in Web Editor NEWComparison of Adaptive Immune Receptor Repertoires
License: GNU Affero General Public License v3.0
Comparison of Adaptive Immune Receptor Repertoires
License: GNU Affero General Public License v3.0
The program crashes with a segmentation fault while hashing sequences.
Hello! I have used compairr with MH index, but the results seem to be very strange. For example the HM index compairing the same sample does not equal to 1 and also varies between different samples, ex: sample 1 vs sample 1 = 138.04, sample 2 vs sample 2 =76.74. I went through the publication and github but did not manage to find info regarding MH index values. This is the command I use run compairr compairr -d 0 -l $f1$f2.log -m -o $f1$f2.out -s MH -t 7 -g -u $f1 $f2;
. Is there a reason for MH index values to be higher than 1 and why do the values change so much between different samples? I Can provide a matrix with all the values if needed.
Thanks:)
Option to selected alternative ways to compute output values in the cells of the matrix, e.g.:
Read AIRR standard input files as described here:
https://docs.airr-community.org/en/stable/datarep/rearrangements.html
CompAIRR will currently abort if any character except for the 20 standard amino acid symbols (ACDEFGHIKLMNPQRSTVWY
in upper or lower case) appear in the junction_aa
column.
In some datasets, the symbols _
and *
appear in the amino acid sequence to represent a frame shift or a translation stop.
We could improve CompAIRR by either treating these symbols (_*
) as any other amino acid symbol, or we could ignore the sequences were these symbols appear. This kind of behaviour could be indicated with additional options, but I think the default behaviour should be to abort with an informative error message.
This could perhaps also be extended to other non-standard amino acid symbols, like BJOUXZ
, and even to non-standard nucleotide symbols โ only ACGTU
are allowed now.
Incorrect results were detected on the full Emerson dataset even with a single thread.
Optionally, CompAIRR should output the matching sequences to a separate TSV file for any matching pair from set A and set
If a certain option is specified, comparisons should be performed on nucleotide sequences and the junction
field in the AIRR TSV input file should be used.
The program seems to be slower than Immunarch on datasets of ~100 repertoires or less.
A flag to switch to cdr3
instead of junction
. In that case, the sequence columns would be cdr3
(nt) and cdr3_aa
, both for input and output. Nothing will change for how the sequences should be treated further, but AIRR files can contain either junction
or cdr3
, the difference is that junction has one extra leading/trailing amino acid.
CompAIRR shows the timing of some of the operations that take a significant amount of time. However, it is just shown with a resolution of seconds. The resolution should be increased to enable more precise measurements.
If more than 65535 repertoires are used, the counting of matches in each repertoire will be wrong. Probably due to a 16-bit integer used somewhere.
If there are exact duplicates in the input (same repertoire id, same sequence, same V-gene, same J-gene) when d=0, the resulting MH-index or Jaccard index would be bogus. This should be detected and the program should terminate, telling the user to deduplicate / dereplicate their data first.
It should be possible to perform this check quickly by looking up the hashes.
If no repertoire_id
column is provided, assume all sequences in the file belong to the same repertoire (could default to IDs 1
and 2
for the first and second file respectively).
Partial matches with V or J gene names could optionally be allowed, perhaps based on prefix matches in the gene name. Need investigate further.
CompAIRR terminates with a segmentation fault if the -d 3
option is used and a sequence of length 1 is included in the dataset.
Given a set of sequences and a repertoire, produce a table of which sequences are found in which sample of the repertoire.
When using multiple threads, it seems to become much slower instead of faster.
Hi,
I've been using CompAIRR for calculating the overlap between different repertoires. I noticed that, when computing the overlap between one sample and itself, CompAIRR produces a different (greater) number than the number of sequences present in that repertoire. For example, if repertoire X would contain 200,000 clonotypes, the CompAIRR n_x_n overlap matrix would return a number >200,000 for the overlap between X and X. My question is therefore: is the CompAIRR result an approximation of the overlap between two repertoires or is it exact and am I misinterpretting the results?
For reference I used the following parameters:
compairr -m -f -o output.tsv input.tsv
Thank you.
Kind regards,
Sebastiaan Valkiers
Hello,
Is it possible to add a column in the output file with the calculated distance when performing a clustering analysis?
Thanks !
Add an option to compare a set of sequences against a repertoire and report which sequences match which repertoires. The sequence id specified should be reported for each sequence.
Hi,
Does CompAIRR allow the calculation of overlap metrics such as Jaccard or Morisita-Horn index or should this be done manually (afterwards)?
Thank you and kind regards,
Sebastiaan Valkiers
It would be useful if CompAIRR would allow more than one indel. The ability to specify the maximum number of substitutions (Hamming distance), indels, or total changes (Levenshtein distance) would be nice.
The output could optionally be presented in an alternative three column format with sample (repertoire) names in the two first columns and the overlap value in the third column.
Log timestamps at major steps in the analysis. E.g. before and after reading files, etc.
The results may be partly wrong in some cases when using more than four threads and a large enough dataset.
Due to the way sequence variants are generated, in some cases variants at distance 2 with indels may be generated multiple times and the resulting values may be inaccurate.
Crash during sequence hashing when using d=1 and indels.
Make a CompAIRR parameter called something like columns-to-keep
, where the names of additional columns of interest that are not transferred by default could be specified. So if the user specifies --columns-to-keep epitope
, the pairs
file has additional columns epitope_1
and epitope_2
. And if the epitope
column is only present in one of the input files, the fields in the pairs
file could just be empty rather than throwing an error. This feature could be of general use for people who want to further analyse the sequences in the pairs
file, it's essentially just transferring additional sequence metadata so the user does not have to map this data back to the input files.
Hi, I downloaded the latest version of compairr (1.7.0) and compiled it according to the instructions in the readme. However, when I check the version after installation (using compairr -v
) it shows version 1.6.1.
Also, uninstalling via:
make uninstall
or
sudo make uninstall
doesn't work, and returns:
make: *** No rule to make target 'install'. Stop.
Any idea how to solve this issue?
To make the tool more general, we could add the possibility of allowing any value for d. An alternative algorithm for identifying similar sequences should probably be used when d>2.
CompAIRR sometimes crashes with a segfault when clustering nucleotide sequences. A user have reported this issue.
The error has been reproduced and a potential reason has been identified.
Computing the all-vs-all hamming distances between sequences in a set would be a useful feature, even if it is not very fast.
Make CompAIRR compliant with the AIRR standard for software tools:
Needed:
Further steps:
Add possibility of comparing a repertoire set with itself, without having to read it twice.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.