Git Product home page Git Product logo

Comments (10)

LonnekeScheffer avatar LonnekeScheffer commented on September 24, 2024

Dear Dima,
This is unexpected behaviour and not something I am able to reproduce with the test files I have here, so I suspect something unexpected is happening in the input file(s). The fact that the MH value exceeds 1 seems to indicate that there are a lot more matches being found than expected. Are there perhaps duplicate sequences available in your input files? I see you specified -g (ignoring genes), could there be duplicate junction sequences sharing the same genes? What happens when you remove the -g option?

If you are able to provide an example file for which compairr returns MH > 1, that could help us debug the issue.

from compairr.

LonnekeScheffer avatar LonnekeScheffer commented on September 24, 2024

(I did not mean to close this issue)

from compairr.

os-dima avatar os-dima commented on September 24, 2024

Hello Lonneke! Thanks for answering:) I was trying to sample the files not to send the entire ones and when taking 1000-10000 sequences (basically using head) Mh index was as it should be below 1. Going above 100 000 started giving problems. Sample files are bigger than 25 mb, I can send them by email if you want.

Taking out the -g option did not help, the numbers simply changed but still were >1 for 100 000+ sequences. I am using files from Omniscope's data release so there is roughly 1million sequences per file. The problem might be (not sure) is that the original format is a csv and has one extra column conitg_id. Besides transforming a file into a tsv I also had to add repertoire_id (I just put unique integer per file). Not sure if this might cause any issues. All the other columns follow AIRR format nomenclature and type.

I have attached the example output and log file of the bad results.

file1file2._log.txt
file1ile2.txt

from compairr.

LonnekeScheffer avatar LonnekeScheffer commented on September 24, 2024

Thanks! In principle additional columns shouldn't be a problem. Could you maybe email me a zipped version of the file, or for example a google drive download link or something similar? You can send it to [email protected]

from compairr.

os-dima avatar os-dima commented on September 24, 2024

Sent you an email with the data!

from compairr.

LonnekeScheffer avatar LonnekeScheffer commented on September 24, 2024

Hi again,

It looks indeed like your files contain a large number of duplicate sequences. File 1 has 499999 entries, of which 418094 unique sequences when considering columns junction_aa, v_call and j_call, and only 387430 unique entries when only considering the junction_aa column. For file 2, the numbers are 499999, 417415 and 384230 respectively. CompAIRR simply counts how many sequences between the repertoires are matching, so if a given sequence occurs on three different lines in file 1 and twice in file 2, the number of matching clonotypes that is counted is 2*3=6 instead of 1.

The solution would be to preprocess your files in order to remove those duplicates. I ran a quick test where I only kept the repertoire_id and junction_aa columns and removed duplicates, this yielded for me a Morisita-Horn index of 0.1016692745. If you want to use sequence frequency information, you will need to sum the values in the duplicate_count column when collapsing duplicates. Alternatively you can run CompAIRR with the -f flag to ignore sequence frequency information (this is what I did for the test).

Python code snippet I used for removing duplicates (ignoring frequency info), for your convenience:

import pandas as pd

df = pd.read_csv("file1.tsv", sep="\t", usecols=["repertoire_id", "junction_aa"])
df.drop_duplicates(inplace=True)
df.to_csv("file1_nodup.tsv", sep="\t", index=False)

I hope this was helpful!

from compairr.

os-dima avatar os-dima commented on September 24, 2024

from compairr.

LonnekeScheffer avatar LonnekeScheffer commented on September 24, 2024

CompAIRR uses the column named duplicate_count, not umi_count (see the description of the input file in the documentation)

When applying CompAIRR to a dataset, the input file(s) should not contain duplicates. If you want to represent how often a given clone occurred in a repertoire, this should be done by specifying the value in the duplicate_count column. Whether that number means number of reads, or number of cells does not matter. The only thing that matters is that for one clone, there is one line in the input file. Having multiple lines containing the same sequence simply does not result in a sensible analysis, because the number of matches will exceed the total expected number of matches possible. This is why I suggested collapsing together duplicates, by keeping only one entry per clone and summing the duplicate_count values of all the duplicated entries in the input file. Duplicate entries commonly happen when clones have a different nucleotide sequence but the same amino acid sequence. But when doing matching on amino acid level, all amino acid sequences in the input file must be unique.

Morisita-Horn can be calculated with or without frequency information. Consider the formula on wikipedia.

  • When including frequency information, the observed number of 'species' (xi and yi) are the values from the column duplicate_count. The total repertoire sizes (X and Y) are the sums of the duplicate_count columns of file 1 and 2.
  • When excluding frequency information (-f flag), the values for xi and yi are always 1, and the total repertoire sizes are the number of clones (i.e., the number of lines in the file).

from compairr.

os-dima avatar os-dima commented on September 24, 2024

from compairr.

LonnekeScheffer avatar LonnekeScheffer commented on September 24, 2024

It's just a matter of the particular dataset (experimental methods, preprocessing) and research question. Sometimes accurate frequency information is not available or relevant.

I will close this issue now since it is resolved, but feel free to reach out if anything more arises.

from compairr.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.