We need a way to specify which fasta files in the sam

Fixed in <a class="issue-link js-issue-link" data-error-text="Failed to load title" da

Some things left to do: There still may be some bugs in the lo

Transcriptome alignment about nanoseq HOT 6 CLOSED

lwratten commented on August 21, 2024

Transcriptome alignment

from nanoseq.

Comments (6)

lwratten commented on August 21, 2024

Having a brainstorm and have come up with a few possible ways to do this:

Have a --align_transcriptome flag or similar and when this is activated assume all reference fasta files are transcriptomes
Allow transcriptome to be specified in the sample sheet by entering transcriptome path/to/ref.fa in the genome section i.e.

sample,fastq,barcode,genome
K562_RUN1_REP3,,1,transcriptome path/to/ref.fa
HEPG2_RUN3_REP5,,2,path/genome.fa
CALIBRATION_RUN,,3,

In this case we could parse the transcriptome section to understand which references are transcriptomes and which are genomes.

Have a 5th column in the samplesheet for gtf/transcriptome(fa)

sample,fastq,barcode,genome,transcript
K562_RUN1_REP3,,1,,path/to/ref.fa
HEPG2_RUN3_REP5,,2,path/genome.fa, path/to/annot.gtf
CALIBRATION_RUN,,3,

In this case the 5th column would be optional - logic would be as follows

if 5th col exists:
     if gtf:
          check the genome file exists
          if minimap2:
               convert to bed12
               perform transcript aware genome alignment using `--junc-bed` flag
          if graphmap2:
               perform transcript aware genome alignment using `--gtf` flag
     else if fa:
          perform transcriptome alignment

Keen to get an agreement on how we should implement this as well as #31 so we can start developing.

from nanoseq.

drpatelh commented on August 21, 2024

This is a tricky one especially since we could have the possibility where we have different genomes/transcriptomes for the samples in the samplesheet.

I think we should just have an additional transcriptome entry in the samplesheet. This could either be a fasta transcriptome or a gtf file which we can use to extract the transcripts from the genome fasta file. That's the most flexible option.

If genome is present and not transcriptome map to that. If it's an iGenomes reference then we get the gtf automatically and generate the transcriptome on the fly.
If genome and transcriptome are present then use transcriptome but will have to make sure transcriptome is a fasta and not gtf.
If genome isn't present and transcriptome is then use transcriptome.

Working out and validating whether we need to use gtf or fasta for transcriptome will involve quite a bit of refactoring I suspect.

How does that sound?

from nanoseq.

lwratten commented on August 21, 2024

I think that sounds good!
I also feel like it's gonna require a lot of refactoring but it will give a lot of flexibility and functionality to our pipeline that will be worth it.
Especially for downstream steps like nanopolish etc. where transcriptome alignment is required.

from nanoseq.

drpatelh commented on August 21, 2024

Fixed in #46

from nanoseq.

drpatelh commented on August 21, 2024

Some things left to do:

There still may be some bugs in the logic so it will need extensive testing with different entries for genome and transcriptome, and by using the different --skip flags to see if the channels are all defined properly.
Add detailed documentation.

from nanoseq.

drpatelh commented on August 21, 2024

Additional tests have been added to GitHub Actions to cater for the the testing. Extensive documentation was also added in #57

from nanoseq.

Transcriptome alignment about nanoseq HOT 6 CLOSED

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent