Hello! I have used compairr with MH index, but the results seem to be very strange. Fo

compairr MH index strange results about compairr HOT 10 CLOSED

uio-bmi commented on September 24, 2024

compairr MH index strange results

from compairr.

Comments (10)

LonnekeScheffer commented on September 24, 2024

Dear Dima,
This is unexpected behaviour and not something I am able to reproduce with the test files I have here, so I suspect something unexpected is happening in the input file(s). The fact that the MH value exceeds 1 seems to indicate that there are a lot more matches being found than expected. Are there perhaps duplicate sequences available in your input files? I see you specified -g (ignoring genes), could there be duplicate junction sequences sharing the same genes? What happens when you remove the -g option?

If you are able to provide an example file for which compairr returns MH > 1, that could help us debug the issue.

from compairr.

LonnekeScheffer commented on September 24, 2024

(I did not mean to close this issue)

from compairr.

os-dima commented on September 24, 2024

Hello Lonneke! Thanks for answering:) I was trying to sample the files not to send the entire ones and when taking 1000-10000 sequences (basically using head) Mh index was as it should be below 1. Going above 100 000 started giving problems. Sample files are bigger than 25 mb, I can send them by email if you want.

Taking out the -g option did not help, the numbers simply changed but still were >1 for 100 000+ sequences. I am using files from Omniscope's data release so there is roughly 1million sequences per file. The problem might be (not sure) is that the original format is a csv and has one extra column conitg_id. Besides transforming a file into a tsv I also had to add repertoire_id (I just put unique integer per file). Not sure if this might cause any issues. All the other columns follow AIRR format nomenclature and type.

I have attached the example output and log file of the bad results.

file1file2._log.txt
file1ile2.txt

from compairr.

LonnekeScheffer commented on September 24, 2024

Thanks! In principle additional columns shouldn't be a problem. Could you maybe email me a zipped version of the file, or for example a google drive download link or something similar? You can send it to [email protected]

from compairr.

os-dima commented on September 24, 2024

Sent you an email with the data!

from compairr.

LonnekeScheffer commented on September 24, 2024

Hi again,

It looks indeed like your files contain a large number of duplicate sequences. File 1 has 499999 entries, of which 418094 unique sequences when considering columns junction_aa, v_call and j_call, and only 387430 unique entries when only considering the junction_aa column. For file 2, the numbers are 499999, 417415 and 384230 respectively. CompAIRR simply counts how many sequences between the repertoires are matching, so if a given sequence occurs on three different lines in file 1 and twice in file 2, the number of matching clonotypes that is counted is 2*3=6 instead of 1.

The solution would be to preprocess your files in order to remove those duplicates. I ran a quick test where I only kept the repertoire_id and junction_aa columns and removed duplicates, this yielded for me a Morisita-Horn index of 0.1016692745. If you want to use sequence frequency information, you will need to sum the values in the duplicate_count column when collapsing duplicates. Alternatively you can run CompAIRR with the -f flag to ignore sequence frequency information (this is what I did for the test).

Python code snippet I used for removing duplicates (ignoring frequency info), for your convenience:

import pandas as pd

df = pd.read_csv("file1.tsv", sep="\t", usecols=["repertoire_id", "junction_aa"])
df.drop_duplicates(inplace=True)
df.to_csv("file1_nodup.tsv", sep="\t", index=False)

I hope this was helpful!

from compairr.

os-dima commented on September 24, 2024

Understood, but wouldn’t it take away the whole idea of computing the diversity between the two samples? Taking out replicates makes all the cdr3s have a uniform distribution, thus making the calculation almost equal to a jaccard distance, no? I am probably understanding something wrong, sorry for the confusion. Compairr uses umi_count as a number in the MH formula, right? El El mié, 13 jul 2022 a las 13:48, Lonneke Scheffer < ***@***.***> escribió:

Hi again, It looks indeed like your files contain a large number of duplicate sequences. File 1 has 499999 entries, of which 418094 unique sequences when considering columns junction_aa, v_call and j_call, and only 387430 unique entries when only considering the junction_aa column. For file 2, the numbers are 499999, 417415 and 384230 respectively. CompAIRR simply counts how many sequences between the repertoires are matching, so if a given sequence occurs on three different lines in file 1 and twice in file 2, the number of matching clonotypes that is counted is 2*3=6 instead of 1. The solution would be to preprocess your files in order to remove those duplicates. I ran a quick test where I only kept the repertoire_id and junction_aa columns and removed duplicates, this yielded for me a Morisita-Horn index of 0.1016692745. If you want to use sequence frequency information, you will need to sum the values in the duplicate_count column when collapsing duplicates. Alternatively you can run CompAIRR with the -f flag to ignore sequence frequency information (this is what I did for the test). Python code snippet I used for removing duplicates (ignoring frequency info), for your convenience: `import pandas as pd df = pd.read_csv("file1.tsv", sep="\t", usecols=["repertoire_id", "junction_aa"]) df.drop_duplicates(inplace=True) df.to_csv("file1_nodup.tsv", sep="\t", index=False)` I hope this was helpful! — Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWWOS2EURHFBTTONPHIZTYTVT2UKDANCNFSM53KY7SYA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

-- Dmytro Pravdyvets Computational biologist www.omniscope.ai +34 646 111 734 Science is we all. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

from compairr.

LonnekeScheffer commented on September 24, 2024

CompAIRR uses the column named duplicate_count, not umi_count (see the description of the input file in the documentation)

When applying CompAIRR to a dataset, the input file(s) should not contain duplicates. If you want to represent how often a given clone occurred in a repertoire, this should be done by specifying the value in the duplicate_count column. Whether that number means number of reads, or number of cells does not matter. The only thing that matters is that for one clone, there is one line in the input file. Having multiple lines containing the same sequence simply does not result in a sensible analysis, because the number of matches will exceed the total expected number of matches possible. This is why I suggested collapsing together duplicates, by keeping only one entry per clone and summing the duplicate_count values of all the duplicated entries in the input file. Duplicate entries commonly happen when clones have a different nucleotide sequence but the same amino acid sequence. But when doing matching on amino acid level, all amino acid sequences in the input file must be unique.

Morisita-Horn can be calculated with or without frequency information. Consider the formula on wikipedia.

When including frequency information, the observed number of 'species' (xi and yi) are the values from the column duplicate_count. The total repertoire sizes (X and Y) are the sums of the duplicate_count columns of file 1 and 2.
When excluding frequency information (-f flag), the values for xi and yi are always 1, and the total repertoire sizes are the number of clones (i.e., the number of lines in the file).

from compairr.

os-dima commented on September 24, 2024

Thanks for the explanation! I ment duplicate_xount before, I just have it as umi_count in my head. I'll clean the data and adjust for the needs of my experiment. The first case with the observed number of species is the one I am doing, but I can't think of a situation where xi and yi == 1 would be preferred for the analysis. Can you give an example please? Thanks for your time and help:) Cheers, Dmytro Pravdyvets Computational biologist www.omniscope.ai +34 646 111 734 Science is we all. The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

…

On Wed, Jul 13, 2022 at 2:33 PM Lonneke Scheffer ***@***.***> wrote: CompAIRR uses the column named duplicate_count, not umi_count (see the description of the input file in the documentation <https://github.com/uio-bmi/compairr#input-files>) When applying CompAIRR to a dataset, the input file(s) should not contain duplicates. If you want to represent how often a given clone occurred in a repertoire, this should be done by specifying the value in the duplicate_count column. Whether that number means number of reads, or number of cells does not matter. The only thing that matters is that for one clone, there is one line in the input file. Having multiple lines containing the same sequence simply does not result in a sensible analysis, because the number of matches will exceed the total expected number of matches possible. Morisita-Horn can be calculated with or without frequency information. Consider the formula on wikipedia <https://en.wikipedia.org/wiki/Morisita%27s_overlap_index>. - When including frequency information, the observed number of 'species' (xi and yi) are the values from the column duplicate_count. The total repertoire sizes (X and Y) are the sums of the duplicate_count columns of file 1 and 2. - When excluding frequency information (-f flag), the values for xi and yi are always 1, and the total repertoire sizes are the number of clones (i.e., the number of lines in the file). The Morisita-Horn index is computed between two vectors of 'species'. Each species has a number of observations. In the numerator of the MH index these observations are multiplied with each other. In AIRR analysis, those 'species' are the clones (i.e., one rearrangement). If the frequency information is included, the value for duplicate_count is — Reply to this email directly, view it on GitHub <#34 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AWWOS2FUMXWHQ2NHZY7UTLLVT2ZRPANCNFSM53KY7SYA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

from compairr.

LonnekeScheffer commented on September 24, 2024

It's just a matter of the particular dataset (experimental methods, preprocessing) and research question. Sometimes accurate frequency information is not available or relevant.

I will close this issue now since it is resolved, but feel free to reach out if anything more arises.

from compairr.

compairr MH index strange results about compairr HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent