Comments (10)
Dear Dima,
This is unexpected behaviour and not something I am able to reproduce with the test files I have here, so I suspect something unexpected is happening in the input file(s). The fact that the MH value exceeds 1 seems to indicate that there are a lot more matches being found than expected. Are there perhaps duplicate sequences available in your input files? I see you specified -g (ignoring genes), could there be duplicate junction sequences sharing the same genes? What happens when you remove the -g option?
If you are able to provide an example file for which compairr returns MH > 1, that could help us debug the issue.
from compairr.
(I did not mean to close this issue)
from compairr.
Hello Lonneke! Thanks for answering:) I was trying to sample the files not to send the entire ones and when taking 1000-10000 sequences (basically using head) Mh index was as it should be below 1. Going above 100 000 started giving problems. Sample files are bigger than 25 mb, I can send them by email if you want.
Taking out the -g
option did not help, the numbers simply changed but still were >1 for 100 000+ sequences. I am using files from Omniscope's data release so there is roughly 1million sequences per file. The problem might be (not sure) is that the original format is a csv and has one extra column conitg_id
. Besides transforming a file into a tsv I also had to add repertoire_id (I just put unique integer per file). Not sure if this might cause any issues. All the other columns follow AIRR format nomenclature and type.
I have attached the example output and log file of the bad results.
file1file2._log.txt
file1ile2.txt
from compairr.
Thanks! In principle additional columns shouldn't be a problem. Could you maybe email me a zipped version of the file, or for example a google drive download link or something similar? You can send it to [email protected]
from compairr.
Sent you an email with the data!
from compairr.
Hi again,
It looks indeed like your files contain a large number of duplicate sequences. File 1 has 499999 entries, of which 418094 unique sequences when considering columns junction_aa, v_call and j_call, and only 387430 unique entries when only considering the junction_aa column. For file 2, the numbers are 499999, 417415 and 384230 respectively. CompAIRR simply counts how many sequences between the repertoires are matching, so if a given sequence occurs on three different lines in file 1 and twice in file 2, the number of matching clonotypes that is counted is 2*3=6 instead of 1.
The solution would be to preprocess your files in order to remove those duplicates. I ran a quick test where I only kept the repertoire_id and junction_aa columns and removed duplicates, this yielded for me a Morisita-Horn index of 0.1016692745. If you want to use sequence frequency information, you will need to sum the values in the duplicate_count column when collapsing duplicates. Alternatively you can run CompAIRR with the -f flag to ignore sequence frequency information (this is what I did for the test).
Python code snippet I used for removing duplicates (ignoring frequency info), for your convenience:
import pandas as pd
df = pd.read_csv("file1.tsv", sep="\t", usecols=["repertoire_id", "junction_aa"])
df.drop_duplicates(inplace=True)
df.to_csv("file1_nodup.tsv", sep="\t", index=False)
I hope this was helpful!
from compairr.
from compairr.
CompAIRR uses the column named duplicate_count
, not umi_count
(see the description of the input file in the documentation)
When applying CompAIRR to a dataset, the input file(s) should not contain duplicates. If you want to represent how often a given clone occurred in a repertoire, this should be done by specifying the value in the duplicate_count
column. Whether that number means number of reads, or number of cells does not matter. The only thing that matters is that for one clone, there is one line in the input file. Having multiple lines containing the same sequence simply does not result in a sensible analysis, because the number of matches will exceed the total expected number of matches possible. This is why I suggested collapsing together duplicates, by keeping only one entry per clone and summing the duplicate_count values of all the duplicated entries in the input file. Duplicate entries commonly happen when clones have a different nucleotide sequence but the same amino acid sequence. But when doing matching on amino acid level, all amino acid sequences in the input file must be unique.
Morisita-Horn can be calculated with or without frequency information. Consider the formula on wikipedia.
- When including frequency information, the observed number of 'species' (xi and yi) are the values from the column duplicate_count. The total repertoire sizes (X and Y) are the sums of the duplicate_count columns of file 1 and 2.
- When excluding frequency information (
-f
flag), the values for xi and yi are always 1, and the total repertoire sizes are the number of clones (i.e., the number of lines in the file).
from compairr.
from compairr.
It's just a matter of the particular dataset (experimental methods, preprocessing) and research question. Sometimes accurate frequency information is not available or relevant.
I will close this issue now since it is resolved, but feel free to reach out if anything more arises.
from compairr.
Related Issues (20)
- Timestamps at major steps in the analysis HOT 1
- Option to selected alternative ways to compute output values HOT 3
- Max 65535 repertoires HOT 1
- Option to compare a set of sequences against a repertoire HOT 3
- Handle non-standard sequence symbols better HOT 2
- Segmentation fault when clustering sequences HOT 5
- Is sequence matching exact? HOT 3
- Overlap metrics HOT 3
- Segfault with very short sequences with d=3 HOT 1
- Allow d>3 HOT 1
- Unable to upgrade version HOT 3
- Detect duplicates in the input HOT 2
- Compute table of hamming distances between all sequences HOT 1
- Extracting information on the distance HOT 2
- Increase resolution of timing of operations HOT 2
- New feature: Copy additional columns from input to output files HOT 2
- Default repertoire ID if column missing HOT 1
- A flag to switch to cdr3 instead of junction HOT 1
- Allow multiple indels or more substitutions with a single indel HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from compairr.