mortazavilab / talon Goto Github PK

Technology agnostic long read analysis pipeline for transcriptomes

License: MIT License

Python 95.64% R 4.28% Makefile 0.08%

pacbio oxford-nanopore transcript-quantification transcriptome

talon's Issues

Build TALON annotation database

In order to track novel transcripts from run to run, it is crucial to maintain an external structure to store and update the annotations.

TODO

Take existing GTF parsing function from main TALON script and adapt it for creating a new annotation tracker instead (outside of main TALON script). Add filtering to remove irrelevant annotations.
Add dataset-tracking table to database and update TALON input options accordingly
Write a function to output the entire annotation tracker in GTF format for viewing in the UCSC genome browser, etc.
Update main TALON script to parse the annotation instead of a GTF file
Clean up transcript identification code
Write code in main TALON script to appropriately modify the annotation.

Requirements

Allow single runs of talon that combine multiple (1) technical and (2) biological replicates in a single run.
Track gene, transcript, and exon locations
Track how many datasets the transcript has been found in to date, and probably which ones (see point about 'bad' information). We need to identify when/where we first identify a new transcript, but we probably need to identify in a separate file when we observe existing transcripts and the abundance. This is just as interesting for "real" biology as finding new ones.
- This could also be helpful as a way of generating cell-line specific annotations (i.e. filter the annotation to include only transcripts from GM12878 datasets). Yes, for both known and novel transcripts.
Track whether an entry originated in an annotation like GENCODE or if it came from a long read dataset(s). We need to assign novel transcript IDs using a counter. It's ok if we decide to "reject" transcripts later, for example because they have non-canonical junctions.
In terms of our lab, the database/annotation needs to be set up in such a way that Gaby and I can both access it and write to it. Otherwise, we won't be able to compare our datasets. This does create a potential problem though- it means that we can't both run things on it at once. Also, how would this scale if we want to share our annotations with other people? More importantly, this pipeline needs to ultimately be run by the ENCODE DCC based on our raw data. So it's important that it works in a standard environment such as AWS.
Need to protect against addition of 'bad' information, i.e. if we decide that a dataset we ran can't be trusted, then there needs to be a mechanism for removing things that were discovered during that run. Yes, but that could be accomplished by keeping track of previous runs before running additional one (e.g. versioning and storing the previous runs).
- Check for collisions when adding information, including situations where a dataset is accidentally run in duplicate. If you rerun the same sample and add new information, then it's a bug. The only way I could see this as a potential problem is if you did abundance filtering, reran the same sample and now a solo transcript was "double-counted", so watch out for that.
- May want to apply filtering (abundance > 1) before adding a novel transcript to annotation. Not optional, but necessary for new transcripts. But more importantly, only accept novel transcripts seen in biological replicates.
The format should be convenient with respect to the TALON code, i.e. quick to read in. Not only quick, but also filtered from irrelevant Gencode stuff that we should not see.
Be able to output a GTF version
Be able to get a list of every dataset that is represented in the annotation. Ultimately, we want to also track abundances in a separate file that will match what's in this annotation file.

Anticipated challenges

5' and 3' differences: How large should the difference be in order to create a new transcript entry? What if this creates problems with retroactively sub-optimal transcript matches?

Example: Candidate 1 is assigned to the known transcript A.1 because its 5' difference is within some pre-determined threshold. However, when TALON encounters Candidate 2, a new transcript is created. If we see another transcript with the same structure as Candidate 1 (labeled 3 here), will the behavior be the same as for Candidate 1, or will it be assigned to Candidate 2? The concern here is to avoid different outcomes when the order of the input changes. We need TALON to behave deterministically.

Should we ever allow people to merge annotations?
- If we do this, there needs to be a clean mechanism for remapping IDs in datasets that were processed in the past. I am not sure that it's necessary. We can assume that the annotation is run serially.
If we upgrade to a new version of GENCODE, what happens? Also, how do we migrate to a new version of the genome assembly ?

Tests

If we run the same dataset twice, then the second run should not make any changes to the database at all. Yes.
TALON should produce the same results regardless of the order of the input file, except that the assigned new IDs would be different.

Ideas

Transcript and associated information should occupy a single line in order to make parsing as simple as possible. We might want to consider the way UCSC records transcripts in their tables.
- UCSC GENCODE v24 schema: https://genome.ucsc.edu/cgi-bin/hgTables
Right now, when a novel transcript is found, it is immediately added to the data structures of the current run. If I want to screen transcripts (i.e. for min abundance) before adding them to the database, this could be done in a separate iteration over novel transcripts at the end of the run. It's ok to add them to the structure, but we want to filter what's added at the end of the run based on being seen at least once in each of two biological replicates.
When outputting to GTF, include number of datasets each transcript is found in- maybe we could color based on this in the genome browser.

Missing "exon" table in TALON database

I have been trying to figure out the structure of the TALON database created by "initialize_talon_database.py" to fully understand how the program works. However, I have run into a missing table.
Within the "exon_annotation" table a foreign key is created referring to a "exon" table, it seems however this table is never created.
The "exon_annotation" table does get filled at database initialization, is a "add_exon_table" function missing from the script by design?

CREATE TABLE exon_annotations (
ID integer,
annot_name text,
source text,
attribute text,
value text,
/* Keys */
PRIMARY KEY (ID, source, attribute),
/* Foreign keys */
CONSTRAINT Foreign_key01
FOREIGN KEY (ID)
REFERENCES exon(ID)
);

Match candidate transcripts to gene(s)

High-level

If the candidate transcript overlaps with an annotated gene, then we consider it a transcript of that gene. We could potentially allow more than one match at first and then select the one that is "better". In addition, we may want to think about how much overlap should count.

Details: How will I assign transcripts to genes in practice?

One option would be to use Bedtools intersect. This would be similar in spirit to TranscriptClean. Other ideas might include taking advantage of sorting. I did a brief Google search for "data structures for comparing numeric intervals", and the results suggest to me that there might be good ways of doing this that would not require an outside tool. For instance, there is a data structure called an interval tree that seems appropriate. This would be applicable to other areas of the program as well.
https://en.wikipedia.org/wiki/Interval_tree
This Python library seems like a useful implementation because you can store any object linked to a given interval, for instance, a hypothetical gene class object. And we can safely assume that two genes never have the exact same coordinates.
https://pypi.python.org/pypi/intervaltree

Additional considerations:

Chromosomes need to be handled separately (can't just have one position interval tree)
- Solution: Use a dict: Each chromosome key points to its own interval tree
Strand
- I'm going to ignore this for the moment, but it may need to be its own issue

I decided it would be worthwhile to create a dedicated GeneTree class to manage the dict of interval trees, and create dedicated methods for it in order to keep the code readable. I have now implemented this class.

Tasks

Write the GeneTree class
Write a GeneTree class function: given an interval, get the genes that overlap with it
Write the Gene class
Deal with the strand issue

UNIQUE constraint failed: dataset.dataset_ID

This is happening at the end of about 1/2 of my runs for talon, with other combinations of .sam files, config files, and db's working just fine.

For example, this command produces the error:

talon --f ../TranscriptClean-master/config.csv --db ds2.2.db --build ds2.2 --threads 4 --o talon_out1

and this does not:

talon --f ../TranscriptClean-master/config_2.csv --db ds2.2.db --build ds2.2 --threads 4 --o talon_out2

The difference of course being which sam file is in the config file. The sam files were all produced by TranscriptClean after mapping by minimap2. I checked for non-unique read headers, there are none.

Any help on tracking down this error would be helpful.

SAM file lacks an MD tag

Hi, I tried to use TALON to process the alignment file I got from minimap2. It's a sam file.
python talon.py \ --f /private/.../TALON/hg38mini1BD.csv \ --db /private/.../TALON/hg38v27_talon.db \ --build hg38v27 --o /private/.../TALON/hg38tal1BD

The csv file has the alignment file name. however when I ran it, it gave me

Traceback (most recent call last):
  File "talon.py", line 2475, in <module>
    main()
  File "talon.py", line 2452, in main
    struct_collection, outprefix)
  File "talon.py", line 1694, in process_all_sam_files
    exon_annotations, abundance = annotate_sam_transcripts(sam, d_name, cursor, struct_collection, o)
  File "talon.py", line 1919, in annotate_sam_transcripts
    qc_metrics = check_read_quality(line, struct_collection)
  File "talon.py", line 2069, in check_read_quality
    raise ValueError("SAM transcript %s lacks an MD tag" % read_ID)
ValueError: SAM transcript 2d415520-3a14-44b9-af3c-1962ab413e36 lacks an MD tag

I am not sure what is happening here...

Move gene length check to annotation building step instead of TALON proper

Comparing query transcripts to annotation on a per-exon basis instead of per-transcript

In the course of comparing my results to MatchAnnot, I have become aware that it is not enough to assign the query to the best gene match and then do transcript comparisons from there. I need to be directly comparing exons. I think it is appropriate to create a new branch for this: exon-wise-comparison.

Create an exon tree data structure instead of a gene structure.
Create an exon class to populate the exon tree
Create method to identify known transcripts
- Show that it is correct (consistent with MatchAnnot)
Implement Match class that will be stored inside MatchTracker
Add smooth reporting of 5' and 3' differences at transcript ends
Write a method to create novel exons from query
Write a test for the 3' and 5' difference method

Making transcript identity assignments with adaptive exon searching

This is a fairly difficult problem. The first version I implemented was greedy, seeking to choose the most closely matched exon for every query exon in the sam transcript. Here is an example that illustrates why this doesn't work.

Transcript c33255/f1p0/3238: The exon coordinates are as follows:
827670-827775
829003-829104
847654-847806
849484-849602
851927-852110
852671-852766
853391-853424
853474-855121
856449-857242
When we try to match the very first exon, the greedy match is 827669-827775, which is found only in annotated transcript ENST00000609139.5. Visually, we can see in the UCSC genome browser shot below that there were other options. For instance, the first four exons of ENST00000608189.4 (second transcript) match the query.

Which types of genes should we allow in the GENCODE GTF file?

Should we be filtering out things like snoRNAs, miRNA, etc. At first, my plan was to stick to protein coding genes and lncRNAs, but then I visited the GENCODE page and found out that there is an absurd number of biotypes.

https://www.gencodegenes.org/gencode_biotypes.html

How do I use the create talon report R scrip

Sorry for all the questions.

I would like to use the R script to generate plots from my db generated through talon scripts.

I am not so familiar and I do not know how to input my files.

Develop the Transcript class

High Level

The Transcript class needs to

Allow fast comparison of transcripts on the level of "are they identical or not"
Be flexible- some users may care about differences at the 5' end and others may not (for example)
Be able to handle data from the GTF annotation or from the SAM file so that transcripts from these two sources can be readily compared

See #1 for additional considerations.

Details

Thoughts

Develop SamTranscript as a subclass of Transcript, with additional attributes for storing SAM information
I think it will be more appropriate to define transcripts in terms of their exons (instead of their introns) because GENCODE is based around exons. The SAM files are more intron-oriented, but it shouldn't be hard to convert over. Could use the CIGAR string
Initialization should not require gene information as a mandatory field, because we won't have this information right away for the SAM transcripts
Start and end relative to assigned gene: At first, I thought I should make it a requirement that the transcript be situated entirely within the gene start and end. But I actually think that assumption may be violated with novel transcripts on occasion.
Start, end, and strand should probably be mandatory initialization fields
It would be really cool if I had a way of computing the 5' UTR and/or 3' UTR length of a transcript object

Tasks

Write basic class structure
Write subclass for SAM transcripts
Write code to create transcript object from GTF entry
Implement string representation function (really simple for now)

GENCODE exon is zero basepairs long

ENSE00003758511.1

Novel Case: Handling sub-transcript case where all of the query exons match a transcript, but query has fewer exons

Case: c13526/f1p3/2925 (INPP5B)
Because the query transcript is shorter than the annotated ones, it is impossible to tell which isoform it came from. But more importantly than that, we need to make sure that we don't label this as (completely) known.

Need to return both of the best gene matches for a transcript if the distances are equal [FIXED]

Discovered this problem while working with #15

Clean up GTF parsing code by delegating sanity checks to a separate function

failed at test unit and integration

I am having issues getting python3.7 installed and currently version is 3.6.0 for python3.

make tests works with some warning,

but then make unit and make integration fail for most tests

I attach output.

Is this to be expected without using python3.7?

In any case I can get talon scripts to run and output in same environment
error_output.txt

Write a function to check whether a sam_transcript matches any transcripts on file for a gene

Transcript that maps well at first, but has a large 3' discrepancy

When I started running TALON on the PB36 ROIs (pre-TranscriptClean), I found the following transcript because it was showing up as a known match with a 3' difference of -14894

m171202_193234_42198_c101417902550000001823303602281843_s1_p0/124708/718_57_CCS

Also this: m171202_130838_42198_c101417902550000001823303602281842_s1_p0/67217/29_1817_CCS
3' difference of -32724

At first I thought they were bugs. But I think what is really happening is that the exons in the query only match the transcript on file that also happens to have a really long 3' UTR. Our query ends much earlier, resulting in the big 3' difference.

talon_generate_report error

Hi there,

I tried to run talon and everything run smoothly, but when I get to the point to run 'talon_generate_report', I got the following issue even though I have already installed all the necessary libraries in R.

$ talon_generate_report
Loading required packages...
One of the required packages is missing.
Here's the original error message:
there is no package called ‘DBI’Error in theme_bw(base_family = "Helvetica", base_size = 18) :
could not find function "theme_bw"
Calls: main ... custom_theme -> suppressMessages -> withCallingHandlers
Execution halted

I also couldn't find the exact arguments for running talon_generate_report. Is there somewhere in the documentation?

Thank you in advance!

Change GTF processing to allow exons that don't belong to a gene or transcript

Ali and I agreed that this is a good feature to have, but it does make filtering genes more complicated since a lot of these exons are less than 200 bp. And spike-ins fall into this category.

Database counter for 'dataset' does not match the number of entries in the table.

Hey,

I am testing out the new TALON version 4.4 with the same testset I used for 4.2 (which passes).
However, I am encountering the message in the title.

Database counter for 'dataset' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 2
counter_value: 1
Traceback (most recent call last):
  File "/home/miniconda/envs/py37/bin/talon", line 8, in <module>
    sys.exit(main())
  File "/home/miniconda/envs/py37/lib/python3.7/site-packages/talon/talon.py", line 2411, in main
    update_database(database, batch_size, run_info.outfiles, dataset_db_entries)
  File "/home/miniconda/envs/py37/lib/python3.7/site-packages/talon/talon.py", line 1786, in update_database
    check_database_integrity(cursor)
  File "/home/miniconda/envs/py37/lib/python3.7/site-packages/talon/talon.py", line 2087, in check_database_integrity
    raise RuntimeError("Discrepancy found in database. " + \
RuntimeError: Discrepancy found in database. Discarding changes to database and exiting...

I checked the database using DB Browser for SQLite to see if the "dataset" table has 2 dataset entries, but it contains just one. My config file also contains just one dataset entry. Any idea how these numbers are not matching up?

Process SAM file of transcripts

I will need to read in transcripts from SAM files and create representations of them that allow for comparisons against the annotation.

I think it would make sense for the transcripts to belong to the same class as the annotation-derived transcripts. Or at least be a subclass of them.

Tasks

Superficial SAM reading code that opens the file and goes line by line
Write and test a function that computes the end position of the transcript
Code to get introns from transcript (straight from jI field, or compute them if missing)
Code to get exon structure of a transcript. Similar to TranscriptClean.
Decide how we are going to treat multi-mapping transcripts. I could imagine a version of this where we allow multi-mapping by effectively doing several interval searches against the genes. Then, the gene lists could be combined and processed just like any other gene overlaps would be (i.e. trying to find the best overlap). But it might be simpler at first to exclude them like we do in TranscriptClean. It mostly matters from a data structure-ID point of view.

Develop the Gene class

Tasks

Write basic infrastructure of the class, i.e. documentation, initialization, print
Choose how to store transcripts belonging to the gene
For now, each gene has a dictionary of transcript ids mapped to transcript objects.

Interesting set of GM12878 transcripts for exon comparison/testing

While selecting test transcripts for the compute_transcript_end function, I stumbled upon this pair in GM12878_chr1_clean.sam: c30124/f1p15/2713 and c2986/f14p15/2748. It's interesting to me because the internal exon boundaries match, but there is a difference at the 3' end. The 5' ends are the same.

build_annotation crashes on pseudoautosomal region

Because I added exon validity checks, the build_annotation script now crashes on the pseudoautosomal region because there are two exons on different chromosomes that have the same gencode exon ID (even though their transcript and gene IDs are different). The ID is ENSE00001702291.1. Because of this, I am moving over to using chrome_start_end as an ID instead, with the gencode ID stored as a name instead.

Write a GeneTree class function: given an interval, get the genes that overlap with it

Novel cases: All query exons match a transcript, but the annotated transcript contains additional exons

This case is really two cases:

The exon matches are contiguous #23
The exon matches are not contiguous #24

Write function to get the best gene match for an input SAM transcript

It seems to be fairly common for transcripts to overlap more than one gene, but I suspect that the genes in question are not protein-coding. I may run TALON on a gencode file that excludes non-poly-A transcripts, but even so, it may be better to write a function that takes into account the possibility of multiple gene overlaps (with some pretty bad matches). My concern here is mostly for transcripts belonging to a novel gene. We don't want to accidentally assign them to a snoRNA or something just because it's the only overlap there is.

Write tests (At least for the overlap function)
Complete the function

GENCODE has occasional gene duplicates. How should this be handled?

When I added a check in the GeneTree add_gene class function to look for existing genes with the input gene_name before adding a new gene, I unexpectedly got a hit:

python talon.py --f test_files/GM12878_chr1_clean.sam -g test_files/gencode.v24.annotation.chr1.gtf
Traceback (most recent call last):
  File "talon.py", line 92, in <module>
    main()
  File "talon.py", line 84, in main
    read_gtf_file(gtf_file)
  File "talon.py", line 50, in read_gtf_file
    genes.add_gene_from_gtf(tab_fields)
  File "/dfs1/bio/dwyman/pacbio_f2016/TALON/genetree.py", line 103, in add_gene_from_gtf
    self.add_gene(gene_name, chromosome, start, end, strand)
  File "/dfs1/bio/dwyman/pacbio_f2016/TALON/genetree.py", line 71, in add_gene
    " is duplicated.")
KeyError: 'Gene names must be unique. TP73-AS1 is duplicated.'

I decided to look for this gene in the annotation to see what might be going on

awk '{if($3 == "gene") print $0}' test_files/gencode.v24.annotation.chr1.gtf | grep "TP73-AS1"
chr1	HAVANA	gene	3735601	3747336	.	-	.	gene_id "ENSG00000227372.10"; gene_type "antisense"; gene_status "KNOWN"; gene_name "TP73-AS1"; level 2; havana_gene "OTTHUMG00000003414.4";
chr1	ENSEMBL	gene	3736943	3737103	.	+	.	gene_id "ENSG00000276189.1"; gene_type "misc_RNA"; gene_status "NOVEL"; gene_name "TP73-AS1"; level 3;

Sure enough, these genes share the same name. They have different gene_id fields though. This means that one way around the problem is to use the gene_id instead of the gene_name to guard against collisions. I generally don't like using gene_ids because they are so difficult to read, but it is probably the responsible thing to do since they are must more precise than gene names. I can always retain use of the gene_name in the object printing functions.

Solution:

I modified the Gene and GeneTree classes so that they now use the gene_id as the primary gene identifier, with the gene_name included mainly for readability. The collision problem no longer happens.

Strand bug affecting application of 3' and 5' difference cutoffs

I was recovering only half of expected "score 5" matches because I didn't consider that for negative strand transcripts, the first exon processed is really its last exon. Need to handle this gracefully

Handle case where query transcript only overlaps a miRNA gene but clearly does not belong to it

Case observed in #15. This is a pathological case for MatchAnnot when the transcript in question has only one exon- the transcript erroneously gets assigned to the miRNA gene.

At present, TALON and MatchAnnot both assign c14773/f1p0/3486 to gene MIR4425, but TALON does not assign it a transcript match. I need to change this behavior so that c14773/f1p0/3486 gets assigned to a novel gene. This issue is closely related to #12.

What should I do about GTF fields beyond genes, transcripts, and exons?

In the course of working on GTF processing for the transcripts (#5), I found lines in the GTF file that went beyond the major three groups I work with (genes, transcripts, and exons). Here are some examples:

chr2	HAVANA	CDS	41611	41627	.	-	2	gene_id "ENSG00000184731.5"; transcript_id "ENST00000327669.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; exon_number 2; exon_id "ENSE00001537167.2"; level 2; protein_id "ENSP00000328347.4"; tag "basic"; transcript_support_level "1"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS42645.1"; havana_gene "OTTHUMG00000151321.1"; havana_transcript "OTTHUMT00000322220.1";

chr2	HAVANA	stop_codon	41608	41610	.	-	0	gene_id "ENSG00000184731.5"; transcript_id "ENST00000327669.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; exon_number 2; exon_id "ENSE00001537167.2"; level 2; protein_id "ENSP00000328347.4"; tag "basic"; transcript_support_level "1"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS42645.1"; havana_gene "OTTHUMG00000151321.1"; havana_transcript "OTTHUMT00000322220.1";

chr2	HAVANA	UTR	38814	41610	.	-	.	gene_id "ENSG00000184731.5"; transcript_id "ENST00000327669.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; exon_number 2; exon_id "ENSE00001537167.2"; level 2; protein_id "ENSP00000328347.4"; tag "basic"; transcript_support_level "1"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS42645.1"; havana_gene "OTTHUMG00000151321.1"; havana_transcript "OTTHUMT00000322220.1";

The simplest option of course is to just ignore them. But will I be missing important distinctions if I do? The UTR components may be particularly important.

Updating the annotation during TALON run

At the end of a TALON run, it is necessary to update the TALON SQLite database both to add novel genes/transcripts/exons, but also to update existing entries when new information comes to light.

What if we find a novel transcript for a known gene that starts further upstream/ends further downstream?
- Does not matter since genes are not explicitly attached to a particular set of locations.
Transcript entries in the database should be more or less stable.
The gene, transcript, and exon tables all have a "datasets_containing" field. If I want to keep this feature, it will need to be updated every run
- This was scrapped.
Must create and add new database entries for novel genes, transcripts, and exons

Files

ENCODE Spring Retreat, April 20, 2018 (TALON).pdf

pip install talon doesn't work?

Hello!

When I use PIP install talon, the installation is successful, but the command such as Talon is not displayed.

too many values to unpack(expected 2)

Hello,

I used talon_initialize_database, there's error.

cmd : talon_initialize_database --f Mus_musculus.GRCm38.87.gtf --a gencode.v29 --g hg38 --o example_talon 1>x1.log

Traceback (most recent call last):
File "/home/luoluo/.local/bin/talon_initialize_database", line 10, in
sys.exit(main())
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 996, in main
genes, transcripts, exons = read_gtf_file(gtf_file)
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 496, in read_gtf_file
transcript = Transcript.get_transcript_from_gtf(tab_fields)
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/transcript.py", line 286, in get_transcript_from_gtf
annotations = extract_transcript_annotations_from_GTF(transcript_info)
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/transcript.py", line 306, in extract_transcript_annotations_from_GTF
key, val = pair.split()
ValueError: too many values to unpack (expected 2)

is there something wrong I did???

IndexError

Hi,

I was trying to run talon on my minion data. But I am encountering index error. Bellow is the error. Can you please look into it.

Thanks

[ 2019-11-13 23:44:23 ] Started TALON run
Traceback (most recent call last):
File "/usr/local/bin/talon", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/talon/talon.py", line 2346, in main
sam_files, dset_metadata = check_inputs(options)
File "/usr/local/lib/python3.7/site-packages/talon/talon.py", line 1411, in check_inputs
curr_sam = line[3]
IndexError: list index out of range

some guidance on running; cannot generate novel annotations

I was confused as to how to move through talon scripts.

I downsampled my reads to 10e4, then mapped to hg38 using minimap2 with --MD flag included to generated > aln.sam

I then used TranscriptClean but without -v or -j flags to error correct and output clean.sam

Next I used this to initialize a talon db using the initialize script feeding in gencode.v29.annotation.gtf and --g hg38_build --a hg38_annot

Following this, I used talon.py script with the db, and config.csv pointing to clean.sam file and metadata. This gave and log_QC.txt file and update the hg38.db.

Finally I ran the create_abundance_file_from_database and this gave me a .tsv file.

However when I check the columns for exon and gene, I cannot find any unknown. Even with small number of reads would expect something?

example_head.xlsx

FIXED: Novel Case: All of query exons are accounted for in the matched transcript, but the match contains additional exons (not the sub-transcript case)

Related to #23
c13856/f1p1/3056: In this case, TALON returns transcript match set(['ENST00000372597.5', 'ENST00000372596.5']) because it can't tell which of these transcripts is a better match. If we look closely, we can see that ENST00000372597.5 is better because ENST00000372596.5 contains all of the query exons, BUT it also contains one extra exon that is not in the query. Update: After looking more closely at this example, I noticed that neither of the candidates are a perfect match- they each contain a tiny exon at position 3 that the query does not.

I decided to address the spirit of the original problem by checking each match at the end to see if the query and match transcripts each contain the same number of exons. If they do, it's a match, and if not, it isn't.

Developing class structures that make sense for genes, transcripts, and exons

In designing Python classes for genes, transcripts, and exons, it is really important that there be enough complexity to allow for flexible handling of different situations, but I don't want things to become needlessly complicated.

This need is illustrated well by the following example.

Here, we have an annotated gene, A, which has an annotated transcript, A.1. Transcript A.1 has three exons, each of which have defined start and end positions. Using a long read sequencing platform, we detect four transcripts that are candidate isoforms of gene A, and which all happen to resemble transcript A.1. My software needs to be able to efficiently compare each candidate to transcript A.1, and decide whether each one is a match or not. The definition of a match is flexible. Here are some different rules that could be applied alone or in combination:

All exon start and end locations must match exactly (most strict)
- Under this framework, only Candidate 1 matches transcript A.1
Deviation is permitted at the 3' transcript end
- Candidate 1 and 2 are matches
Deviation is permitted at the 5' transcript end
- Candidate 1 and 3 are matches
Small differences are allowed on the internal exon boundaries (i.e. "fuzzy junctions")
- Candidate 1 and 4 are matches

Because of this structure, I think it is necessary to have exon entities be part of the transcript object. Then, I can have a method that creates an "exon string" representation of the transcript according to the rules in operation. If 3' variation is allowed, for instance, then the end_E3 position would be omitted from the exon string. If fuzzy junctions are permitted, then maybe I would need to create multiple permuted exon strings of transcript A.1 for the candidates to be compared to. But this could get really complicated fast for genes that have many exons, so I may need a smarter solution. For situations like that, it might make more sense to split the candidate transcript into exon "substrings" that could be matched adaptively to exons of the genes.

I can see already that I will need test cases to track the matching process under different rule structures.

I just had another thought- once we establish that a novel transcript has been found, the program should create a transcript object linked to the gene that is stored the same way that an annotated transcript would be. That way, every candidate matched to that gene from this point on will be screened against the novel transcript to see if it matches.

4/5/2018

I had a discussion with Ali, and we decided to avoid fuzzy junctions for now. I should probably still try to find out how often novel transcripts are just wobbly versions of annotated ones, but it will not drive the structure of the code. This means I can go forward with my plan to represent transcripts as objects that have a list of exons (or introns). To compare transcripts to each other, we call a class function i.e. get_transcript_string() to create a string representation of each transcript, which can be subject to different rules as outlined in the previous section.

4/6/2018

I need to work out how to deal with exons/introns without losing too much information.

4/9/2018

The basic structure is in place, but the next phase will involve more detailed exon comparisons.

Running slowly on PBS/Torque

I was wondering if there is a faster way to run this on large SAM files (>30gb) - will there be a parallel version of TALON?

How to choose best exon match for query when tot_diffs are the same

I haven't seen this case happen in my data yet, but I bet it will at some point so I should be prepared to deal with it.

Exon-based comparison tool

Hi, I noticed that there was mention of an exon-based comparison tool but cannot find that option in the talon commands. I am trying to quantify differences in 3' and 5' exon boundaries compared with known isoforms and this would be extremely helpful. Thank you!

Comparing TALON prototype results to Tofu and MatchAnnot

Tofu + STAR + MatchAnnot commands:

# Run Tofu
source activate myenv
collapse_isoforms_by_sam.py --fq --input GM12878_chr1_canonicalOnly.fq  -s sorted.GM12878_chr1_canonicalOnly.sam --dun-merge-5-shorter -o Tofu_STAR_MatchAnnot/Tofu/tofu
source deactivate

# Remap with STAR using bash script. Here is the essential command though
STARlong --runThreadN 4 --genomeDir /bio/dwyman/pacbio_f2016/data/STAR_hg38_ENCODE \
--readFilesIn /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/Tofu/tofu.collapsed.rep.fq \
--sjdbGTFfile /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
--outFileNamePrefix /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/ \
--outSAMattributes NH HI NM MD jM jI --readNameSeparator space --outFilterMultimapScoreRange 1 --outFilterMismatchNmax 2000 --scoreGapNoncan -20 --scoreGapGCAG -4 --scoreGapATAC -8 --scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 --alignEndsType Local --seedSearchStartLmax 50 --seedPerReadNmax 100000 --seedPerWindowNmax 1000 --alignTranscriptsPerReadNmax 100000 --alignTranscriptsPerWindowNmax 10000 --outSAMtype SAM

samtools sort Aligned.out.sam -o sorted.Aligned.out -O sam


# Run MatchAnnot
# Crashed when I didn't have --format alt. Must be some types in the v24 annotation
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/sorted.Aligned.out > \
/bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/match_annot_results.txt

TALON

python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf

Tofu + STAR + MatchAnnot Results:

Total transcripts at MatchAnnot stage: 1526
Number of transcripts assigned a score of 5: 684
- This means that the intron boundaries matched internally, but the leading or trailing 5'/3' ends may have varied
Number of transcripts with perfect 3' and 5' match: 0
Number of transcripts with perfect 3' match but not 5': 8
Number of transcripts with perfect 5' match but not 3': 5

grep "result" match_annot_results.txt | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 == 0 && $11 ==0 ) print $0}' | wc -l
ot_results.txt | grep "sc: 5" | awk
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($11 ==0 ) print $0}' | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 ==0 ) print $0}' | wc -l

TALON results (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e):

Total transcripts processed: 1881
Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 954

Unexpected case: Transcript appears to match gene, but is on the wrong strand

I found this case while working on #15. MatchAnnot assigns transcript c33225/f1p0/3407 to gene PHTF1, while TALON does not assign it to a gene at all. The reason for this discrepancy is that TALON requires the transcript and gene strand to match, while MatchAnnot does not. That's all well and good, but in this particular case, the exons line up really well, so I'm surprised that we sequenced this, especially since there are a bunch of similar transcripts on the correct strand. We can see this below. Transcript c33225/f1p0/3407 is the one in blue

Queries we might like to run on TALON database

What is the most commonly utilized 3' end for a given gene? What about 3' exon? Is this different for different datasets?
How many transcripts have 5' CAGE support?
How many genes or isoforms were shared across two experiments?
How often is a particular exon of a gene utilized in its transcripts?

Further downstream,

Given a gene and a cell type, what does the most likely transcript look like?

Storing exon coordinates in Transcript object vs exon objects

Right now, I just store the start and end positions of the exons in the transcript object. It may make more sense in the future to change it to storing the exon objects themselves. The issue will be the SAM exons, which don't graduate to objects until the query stage.

Handle case where longer gene covers more of transcript query, but shorter gene is a closer match

While working on #15, I came across the following case c18234/f1p0/3056 (red bar). Visually, we can see that the transcript matches the NAP1L4P1 gene, and indeed, that is how MatchAnnot assigns it. However, TALON assigns the transcript to the CD58 gene, which appears incorrect. This may be a pretty insidious problem to fix because of how I do my gene assignments. Maybe I need to move to a more directly exon-based approach.

I found a second such pathological case: c26372/f1p0/3232

Exact matches to FULL exon string are extremely rare

I decided to use the MatchAnnot output for PB14 to check how many sam_transcripts can be expected to exactly match an annotated transcript exon_string (including the 3' and 5' ends).

grep "result" /bio/dwyman/pacbio_f2016/GM12878/rep3/MatchAnnot/MatchAnnot_out/selection_2-10kb/selection_2-10kb_annotation | grep "sc: 5" | awk '{if($10 == 0 && $11 ==0 ) print $0}' | head

I only found two transcripts in the output:

result:   PB.435.2|chr1:206557165-206589282(+)|c23251/f37p58/3495  RASSF5                RASSF5-004                ex:  5  sc: 5  5-3:     0     0
result:   PB.2089.2|chr9:130579606-130638352(+)|c15546/f2p6/3123  FUBP3                 FUBP3-002                 ex: 19  sc: 5  5-3:     0     0

I was lucky- the first one is on chromosome 1, so it should be in my testing set. I think the transcript cluster ID is different because cluster was re-run at some point between my test set and the MatchAnnot file, but I was able to find the SAM entry in my test set using the start position

c25862/f52p88/3496	0	chr1	206557165	255	529M25575N111M1007N298M495N116M1530N2443M	*	0	0	GAACTGCTTTACGCGAGGGGCAGGAAAGGCGCGGGAGGCGGGGGAGGTGCGGAGATGGCGCTCTGCACGGCGGCGGAGGGAGGGCGCTGGCGCCGGGGACACGAAACCGCAGAGCCCGGACGAGTCAGGGAGTGAGGCGCGAGCCGGGCGCCCGGGGCTCTGCAGGCGCAGGCGGCGCGGGGACAGGAGCAGGTTACCGGGCCGCCCGAGCGCTCGCACCCCGCTGAAAGAAACGCAGGCGGCCCGCCGGCTCTGCCTGGTCCGCTACCCGACCAGCTCCCGGCTCGGGGCTCAGAGCTAGGGGCTTACGCCAAGCGGAGCCCGGGGAGGGGTGCCCACCTCCCTCCGCCGCATCCCAAGCCCGGCCCCCTTGATGCGCTGGCGGCCTCGGCCGGGAACTCCGGGGTAGATGACCGTGGACAGCAGCATGAGCAGTGGGTACTGCAGCCTGGACGAGGAACTGGAAGACTGCTTCTTCACTGCTAAGACTACCTTTTTCAGAAATGCGCAGAGCAAACATCTTTCAAAGAATGTCTGTAAACCTGTGGAGGAGACACAGCGCCCGCCCACACTGCAGGAGATCAAGCAGAAGATCGACAGCTACAACACGCGAGAGAAGAACTGCCTGGGCATGAAACTGAGTGAAGACGGCACCTACACGGGTTTCATCAAAGTGCATCTGAAACTCCGGCGGCCTGTGACGGTGCCTGCTGGGATCCGGCCCCAGTCCATCTATGATGCCATCAAGGAGGTGAACCTGGCGGCTACCACGGACAAGCGGACATCCTTCTACCTGCCCCTAGATGCCATCAAGCAGCTGCACATCAGCAGCACCACCACCGTCAGTGAGGTCATCCAGGGGCTGCTCAAGAAGTTCATGGTTGTGGACAATCCCCAGAAGTTTGCACTTTTTAAGCGGATACACAAGGACGGACAAGTGCTCTTCCAGAAACTCTCCATTGCTGACCGCCCCCTCTACCTGCGCCTGCTTGCTGGGCCTGACACGGAGGTCCTCAGCTTTGTGCTAAAGGAGAATGAAACTGGAGAGGTAGAGTGGGATGCCTTCTCCATCCCTGAACTTCAGAACTTCCTAACAATCCTGGAAAAAGAGGAGCAGGACAAAATCCAACAAGTGCAAAAGAAGTATGACAAGTTTAGGCAGAAACTGGAGGAGGCCTTAAGAGAATCCCAGGGCAAACCTGGGTAACCGGTCCTGCTTCCTCTCCTCCTGGTGCATTCAGATTTATTTGTATTATTAATTATTATTTTGCAACAGACACTTTTTCTCAGGACATCTCTGGCAGGTGCATTTGTGCCTGCCCAGCAGTTCCAGCTGTGGCAAAAGTCTCTTCCATGGACAAGTGTTTGCATGGGGGTTCAGCTGTGCCCGCCCCCAGGCTGTGCCCCACCACAGATTCTGCCAAGGATCAGAACTCATGTGAAACAAACAGCTGACGTCCTCTCTCGATCTGCAAGCCTTTCACCAACCAAATAGTTGCCTCTCTCGTCACCAAACTGGAACCTCACACCAGCCGGCAAAGGAAGGAAGAAAGGTTTTAGAGCTGTGTGTTCTTTCTCTGGCTTTGATTCTTCTTTGAGTTCTCTTACTTGCCACGTACAGGACCATTATTTATGAGTGAAAAGTTGTAGCACATTCCTTTTGCAGGTCTGAGCTAAGCCCCTGAAAGCAGGGTAATGCTCATAAAAGGACTGTTCCCGCGGCCCCAAGGTGCCTGTTGTTCACACTTAAGGGAAGTTTATAAAGCTACTGGCCCCAGATGCTCAGGGTAAGGAGCACCAAAGCTGAGGCTGGCTCAGAGATCTCCAGAGAAGCTGCAGCCTGCCCTGGCCCTGGCTCTGGCCCTGGCCCACATTGCACATGGAAACCCAAAGGCATATATCTGCGTATGTGTGGTACTTAGTCACATCTTTGTCAACAAACTGTTCGTTTTTAAGTTACAAATTTGAATTTAATGTTGTCATCATCGTCATGTGTTTCCCCAAAGGGAAGCCAGTCATTGACCATTTAAAAAGTCTCCTGCTAAGTATGGAAATCAGACAGTAAGAGAAAGCCAAAAAGCAATGCAGAGAAAGGTGTCCAAGCTGTCTTCAGCCTTCCCCAGCTAAAGAGCAGAGGAGGGCCTGGGCTACTTGGGTTCCCCATCGGCCTCCAGCACTGCCTCCCTCCTCCCACTGCGACTCTGGGATCTCCAGGTGCTGCCCAAGGAGTTGCCTTGATTACAGAGAGGGGAGCCTCCAATTCGGCCAACTTGGAGTCCTTTCTGTTTTGAAGCATGGGCCAGACCCGGCACTGCGCTCGGAGAGCCGGTGGGCCTGGCCTCCCCGTCGACCTCAGTGCCTTTTTGTTTTCAGAGAGAAATAGGAGTAGGGCGAGTTTGCCTGAAGCTCTGCTGCTGGCTTCTCCTGCCAGGAAGTGAACAATGGCGGCGGTGTGGGAGACAAGGCCAGGAGAGCCCGCGTTCAGTATGGGTTGAGGGTCACAGACCTCCCTCCCATCTGGGTGCCTGAGTTTTGACTCCAATCAGTGATACCAGACCACATTGACAGGGAGGATCAAATTCCTGACTTACATTTGCACTGGCTTCTTGTTTAGGCTGAATCCTAAAATAAATTAGTCAAAAAATTCCAACAAGTAGCCAGGACTGCAGAGACACTCCAGTGCAGAGGGAGAAGGACTTGTAATTTTCAAAGCAGGGCTGGTTTTCCAACCCAGCCTCTGAGAAACCATTTCTTTGCTATCCTCTGCCTTCCCAAGTCCCTCTTGGGTCGGTTCAAGCCCAAGCTTGTTCGTGTAGCTTCAGAAGTTCCCTCTCTGACCCAGGCTGAGTCCATACTGCCCCTGATCCCAGAAGGAATGCTGACCCCTCGTCGTATGAACTGTGCATAGTCTCCAGAGCTTCAAAGGCAACACAAGCTCGCAACTCTAAGATTTTTTTAAACCACAAAAACCCTGGTTAGCCATCTCATGCTCAGCCTTATCACTTCCCTCCCTTTAGAAACTCTCTCCCTGCTGTATATTAAAGGGAGCAGGTGGAGAGTCATTTTCCTTCGTCCTGCATGTCTCTAACATTAATAGAAGGCATGGCTCCTGCTGCAACCGCTGTGAATGCTGCTGAGAACCTCCCTCTATGGGGATGGCTATTTTATTTTTGAGAAGGAAAAAAAAAGTCATGTATATATACACATAAAGGCATATAGCTATATATAAAGAGATAAGGGTGTTTATGAAATGAGAAAATTATTGGACAATTCAGACTTTACTAAAGCACAGTTAGACCCAAGGCCTATGCTGAGGTCTAAACCTCTGAAAAAAGTATAGTATCGAGTACCCGTTCCCTCCCAGAGGTGGGAGTAACTGCTGGTAGTGCCTTCTTTGGTTGTGTTGCTCAGTGTGTAAGTGTTTGTTTCCAGGATATTTTCTTTTTAAATGTCTTTCTTATATGGGTTTTAAAAAAAAGTAATAAAAGCCTGTTGCA	*	NH:i:1	HI:i:1	NM:i:4	MD:Z:1370C1A209A7C1906	jM:B:c,21,21,21,21	jI:B:i,206557694,206583268,206583380,206584386,206584685,206585179,206585296,206586825

I looked up this transcript in the UCSC genome browser. There is an off-by-one error at the 3' end.

Transcript: 206557165-206589268
Annotation: 206557165-206589267

Hand-checked exons of transcript:

206,557,165 - 206,557,693
206,583,269 - 206,583,379
206,584,387 - 206,584,684
206,585,180 - 206,585,295
206,586,826 - 206589268

Write a test for the compute_transcript_end function

This function finds the end position of a SAM transcript with respect to the reference genome. To write a test, I will identify some SAM examples and look at them in the UCSC genome browser so that I can record the correct end positions. Then, my test will run the function on these examples and check the results against the correct answers.

mortazavilab / talon Goto Github PK

talon's Issues

TODO

Requirements

Anticipated challenges

Tests

Ideas

High-level

Details: How will I assign transcripts to genes in practice?

Tasks

High Level

Details

Thoughts

Tasks

Tasks

Tasks

Solution:

4/5/2018

4/6/2018

4/9/2018

Recommend Projects

Recommend Topics

Recommend Org