Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

output knitted and corrected barcodes to fastq about pheniqs HOT 7 OPEN

biosails commented on May 27, 2024

output knitted and corrected barcodes to fastq

from pheniqs.

Comments (7)

moonwatcher commented on May 27, 2024 1

Hi @kh49

I am so glad you find Pheniqs useful with scifi-RNA-seq. At the moment Pheniqs only writes decoded (error corrected) barcodes to SAM AUX tags. That said, there have been plans to add the functionality you describe since it can potentially make for a more seamless integration with downstream tools. I am looking into enhancing the token and transform syntax to allow what you are asking for.

I will update this ticket soon with additional details.

Regards,
Lior

from pheniqs.

kh49 commented on May 27, 2024 1

@moonwatcher Thank you! I will give this a try in the next few days and get back to you. I'm working on Linux and should be fine building from source, but I'll let you know if I have issues.

As far as the quality scores go, I think there would be merit in keeping both options you mentioned if it is not too much work. I think the fixed score case would be good enough in most cases, especially with the ability to tweak and fine tune correction parameters and error sensitivity in pheniqs. The second option sounds more technically correct. Maybe you can add a parameter to the config that lets users set a phred quality cutoff they'd like to use for the corrected barcodes. Then during error correction, barcodes with error probability lower than the set acceptable threshold would get assigned some quality score scaled between the maximum and the cutoff. Barcodes with error probability that is too high would get assigned some quality score scaled between the minimum and the cutoff.

from pheniqs.

moonwatcher commented on May 27, 2024

Hi @kh49

I have a version for you test. Are you comfortable building from source? are you testing on MacOS or Linux? alternatively I can provide instructions to build a static version with the builder.

35966b8

I added the option to address the decoded (error corrected) barcode sequence as one long, concatenated, sequence. So even if you had 3 cellular barcodes and each had two segments the tokens will be addressing the result as one long set of coordinates. To use it simply substitute the input segment index in the token with s for sample, c for cellular and m for molecular.

For instance s:: is the entire sample barcode. c::10 is the first 10 nucleotides of the cellular barcode.

There is one lingering issue... Since we are now pretending the decoded, or error corrected, barcode is a real sequence and allow to embed it in the output read, the error corrected nucleotides need a quality score. This is not a trivial question.

The current version (35966b8) does not modify the quality scores of the corrected nucleotides which is probably a wrong behavior. but I wanted to allow you to test this and give feedback.

One option is to provide a config option and give those corrected nucleotides a fixed quality score, say 30. You basically say: "I trust the decoding and I don't want any downstream tool to filter the read because it thinks the barcode quality is too low".

Another option is use the overall error probability of decoding the entire barcode as the new quality but that will be probably too high for one mismatch and way too low for 3 or 4 mismatches.

We are still debating it, but what will be your preferred method?

from pheniqs.

kh49 commented on May 27, 2024

Hi @moonwatcher I tested the new version on my test case and the cell barcode redirection seems to work as intended and I can manipulate it as needed for output. However, I get no output from the molecular barcode. For example, if I have the following:

    "template": {
        "transform": {
            "token": [
                "1::","c::","m::","0:8:24"
            ],
			"knit": ["2:1","3"]
        }
    }

I'm expecting the first output file/interleaved line to have the molecular barcode and cellular barcode concatenated. Since I'm using naive molecular barcode decoding, I can always designate the UMI location in the input tokens instead. However, I also noticed that when outputting to BAM, the new version does not output the RX and QX tags, but does still provide OX and BZ. I ran the same configuration (without the new token addressing) on the new version as well as the current conda release and the conda version outputs RX and QX tags as expected but the new version does not.

from pheniqs.

moonwatcher commented on May 27, 2024

Hi @kh49

Everything you said is correct :)

I initially wanted to use s|c|m for decoded and S|C|M for raw barcodes but raw barcodes are addressable on the input read segment so its redundant. The naive decoder performs no error correction, It exists mostly to allow to move the molecular barcode to an auxiliary tag. So, as you mentioned, you can just address the location on the input. However for completeness sake I added the assembly step to the naive molecular decoder so it can support the m:: tokenization. NOTICE It will be slightly slower since pheniqs will first assemble a new sequence from the input segments and than, again, tokenize that sequence.
I realize this is just a snippet but I feel compelled to mention it. in the template directive you posted you never use token 0 in the knit. Pheniqs still assembles it but never uses it, so you are just wasting clock cycles ;)
You are correct that the new version writes the molecular barcode to OX/BZ instead of RX/QX. This is a delicate nuance which is open to interpretation... I try to follow the SAM specifications as much as I can. If you look at page 5 and 6 RX/QX state Sequence bases from the unique molecular identifier. These could be either corrected or uncorrected. while OX states Raw (uncorrected) unique molecular identifier bases. So OX/BZ is always for raw while RX/QX is for either corrected or uncorrected. If you used MDD or PAMLD Pheniqs writes the raw to OX/BZ and the corrected to RX/QX, however that is rarely possible since molecular barcodes almost always come from an unknown distribution. On the other hand, I figured that having naive write them to OX/BZ leaves room for a different application to write corrected barcodes to RX/QX. Bottom line is that while it does not defy the specs to write the raw to RX/QX it might be more correct to have them in OX/BZ. Does that interfere with downstream tools? do you prefer it to revert to the previous logic (RX/QX for naive)? I am considering allowing to specify the tags in the config if that helps making the output more compatible with other tools.

Thank you for all the feedback so far. Looking forward to hear your opinion.

from pheniqs.

kh49 commented on May 27, 2024

Hi @moonwatcher, this all sounds good to me!

I think the framework you've laid out is solid. I think a quick update to the validation that can warn users when they are using the m:: form inefficiently would be enough.
Thanks! I was messing around with the config and hadn't thought about leaving pieces in. Good to know that I should clean those up in finalized configs.
For the BAM tags, I think your approach makes sense. It doesn't affect my downstream work, and it's always nice to remove some redundancy. That said, I do think allowing for tag configuration in the config would be a great way to allow for any special cases. It would be nice to have that for both input and output so you could specify additional pairs of sequence and quality tags from an input BAM to load into tokens, then output tokens to specific tag pairs as well. I have a special case with scifi-RNA-seq data where their demultiplexer creates a weird looking BAM. If you look at the example layout in that link, the final output of their BAM generation pipeline results in the quality scores of the RX and r2 tags being lumped into QX with a ~ separator. Right now, I'm using an awk script run via GNU parallel to parse the BAM and patch together new FASTQs which includes teasing out the quality scores from the QX tag. I don't know how common concatenating quality sequences with a ~ is in BAM files so it may not even be worth addressing, but general tag functionality would be great.

from pheniqs.

moonwatcher commented on May 27, 2024

Hi @kh49

Sorry for the long delay but I am still working though this.
I added a parameter corrected quality that you can add either globally on the root or at every decoder. This is the phred quality for corrected bases when the corrected code is written to the output read. The parameter takes the numeric phred value and defaults to 30, which is the character ? or a probability value of 0.001. The validator allows values from 2 to 104 for now. I added a small script to print the phred value and their corresponding characters and probabilities.

This raised another question: What should be written when the decoder fails to decode the barcode?

One option is what is currently done on the aux tags, which is to write a sequence of = which is the SAM value for unknown nucleotide (not the same as N which means any nucleotide).

Another option is to write the original sequence with the original qualities.

The read will be marked as failing quality control, as was always the case.

Still working on the other features you mentioned ;)

from pheniqs.

output knitted and corrected barcodes to fastq about pheniqs HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent