mortazavilab / talon Goto Github PK

Technology agnostic long read analysis pipeline for transcriptomes

License: MIT License

Python 95.64% R 4.28% Makefile 0.08%

pacbio oxford-nanopore transcript-quantification transcriptome

talon's Introduction

TALON

TALON is a Python package for identifying and quantifying known and novel genes/isoforms in long-read transcriptome data sets. TALON is technology-agnostic in that it works from mapped SAM files, allowing data from different sequencing platforms (i.e. PacBio and Oxford Nanopore) to be analyzed side by side.

Installation
Running TALON
Working with the TALON results
Citing TALON

Reads must be aligned to the reference genome and oriented in the forward direction (5'->3') prior to using TALON. We recommend the Minimap2 aligner - please see their GitHub page here for recommended long-read parameters by technology. Please note that TALON requires the SAM MD tag, so Minimap2 should be run with the --MD flag enabled. In principle, you can use any other long-read alignment software provided that an MD tag is generated.

We also recommend correcting the aligned reads with TranscriptClean to fix artifactual noncanonical splice junctions, though this is not strictly necessary for TALON to run.

To learn more about how TALON works, please see our preprint in BioRxiv: https://www.biorxiv.org/content/10.1101/672931v1

Installation

Newer versions of TALON (v4.0+) are designed to be run with Python 3.6+.

To install TALON, simply download the files using Github's "Download ZIP" button, then unzip them in the directory where you would like to store the program. Alternately, you can download a specific version of the program from the Releases tab.

Go to the directory and run:

pip install cython
pip install .

This will install TALON. You can now run the commands from anywhere.

NOTE: Talon versions 4.2 and lower are not installable. Check the README of those releases to see how you can run the scripts from the install directory, or visit the wiki here.

How to run

For a small, self-contained example with all necessary files included, see https://github.com/mortazavilab/TALON/tree/master/example

Flagging reads for internal priming

Current long-read platforms that rely on poly-(A) selection are prone to internal priming artifacts. These occur when the oligo-dT primer binds off-target to A-rich sequences inside an RNA transcript rather than at the end. Therefore, we recommend running the talon_label_reads utility on each of your SAM files separately to record the fraction of As in the n-sized window immediately following each read alignment (reference genome sequence). The default n value is 20 bp, but you can adjust this to match the length of the T sequence in your primer if desired. The output of talon_label_reads is a SAM file with the fraction As recorded in the fA:f custom SAM tag. Non-primary alignments are omitted. This SAM file can now be used as your input to the TALON annotator.

Usage: talon_label_reads [options]

Options:
  -h, --help            show this help message and exit
  --f=SAM_FILE          SAM file of transcripts
  --g=GENOME_FILE       Reference genome fasta file
  --t=THREADS           Number of threads to run
  --ar=FRACA_RANGE_SIZE
                        Size of post-transcript interval to compute fraction
                        As on. Default = 20
  --tmpDir=TMP_DIR      Path to directory for tmp files. Default =
                        tmp_label_reads
  --deleteTmp           If this option is set, the temporary directory
                        generated by the program will be removed at the end of
                        the run.
  --o=OUTPREFIX         Prefix for outfiles

Initializing a TALON database

The first step in using TALON is to initialize a SQLite database from the GTF annotation of your choice (i.e. GENCODE). This step is done using talon_initialize_database, and only needs to be performed once for your analysis. Keep track of the build and annotation names you choose, as these will be used downstream when running TALON and its utilities.

NOTE: The GTF file you use must contain genes, transcripts, and exons. If the file does not contain explicit gene and/or transcript entries, key tables of the database will be empty and you will experience problems in the downstream analysis. Please see our GTF troubleshooting section for help.

Usage: talon_initialize_database [options]

Options:
  -h, --help           Show help message and exit
  --f                  GTF annotation file
  --g                  The name of the reference genome build that the annotation describes. Use a short and memorable name since you will need to specify the genome build when you run TALON later.
  --a                  The name of the annotation (for metadata purposes)
  --l                  Minimum required transcript length (default = 0 bp)
  --idprefix           Prefix for naming novel discoveries in eventual TALON runs (default = 'TALON')
  --5p                 Maximum allowable distance (bp) at the 5' end during annotation (default = 500 bp)
  --3p                 Maximum allowable distance (bp) at the 3' end during annotation (default = 300 bp)
  --o                  Output prefix for the database

Running TALON

Now that you've initialized your database and checked your reads for evidence of internal priming, you're ready to annotate them. The input database is modified in place to track and quantify transcripts in the provided dataset(s). In a talon run, each input SAM read is compared to known and previously observed novel transcript models on the basis of its splice junctions. This allows us to not only assign a novel gene or transcript identity where appropriate, but to track new transcript models and characterize how they differ from known ones. The types of novelty assigned are shown in this diagram.

To run the talon annotator, create a comma-delimited configuration file with the following four columns: name, sample description, platform, sam file (full path). There should be one line for each dataset, and dataset names must be unique. If you decide later to add more datasets to an existing analysis, you can do so by creating a new config file for this data and running TALON again on the existing database.

If you're using the --cb option, the dataset names will be pulled from the SAM CB tag, making the first column of the config file unnecessary. Accordingly, TALON expects that when the --cb tag is provided, the config file only includes the following: sample description, platform, sam file (full path).

Please note that TALON versions 4.4+ can be run in multithreaded fashion for a much faster runtime.

usage: talon [-h] [--f CONFIG_FILE] [--cb] [--db FILE,] [--build STRING,]
             [--threads THREADS] [--cov MIN_COVERAGE]
             [--identity MIN_IDENTITY] [--nsg] [--o OUTPREFIX]

optional arguments:
  -h, --help            show this help message and exit  
  --f CONFIG_FILE       Dataset config file: dataset name, sample description,
                        platform, sam file (comma-delimited)  
  --db FILE,            TALON database. Created using
                        talon_initialize_database
  --cb                  Use cell barcode tags to determine dataset. Useful for
                        single-cell data. Requires 3-entry config file.
  --build STRING,       Genome build (i.e. hg38) to use. Must be in the
                        database.
  --threads THREADS, -t THREADS
                        Number of threads to run program with.
  --cov MIN_COVERAGE, -c MIN_COVERAGE
                        Minimum alignment coverage in order to use a SAM
                        entry. Default = 0.9
  --identity MIN_IDENTITY, -i MIN_IDENTITY
                        Minimum alignment identity in order to use a SAM
                        entry. Default = 0.8
  --nsg, --create_novel_spliced_genes
                        Make novel genes with the intergenic novelty label for
                        transcripts that don't share splice junctions with any
                        other models
  --tmpDir
                        Path to directory for tmp files. Default = `talon_tmp/`
  --o OUTPREFIX         Prefix for output files

TALON generates two output files in the course of a run. The QC log (file with suffix 'QC.log') is useful for tracking why a particular read was or was not included in the TALON analysis.

QC log format

Columns:

dataset
read_ID
passed_QC (1/0)
primary_mapped (1/0)
read_length
fraction_aligned
Identity

The second output file (suffix 'read_annot.tsv') appears at the very end of the run and contains a line for every read that was successfully annotated.

Read annotation file format

Columns:

Name of individual read
Name of dataset the read belongs to
Name of genome build used in TALON run
Chromosome
Read start position (1-based). This refers to the 5' end start, so for reads on the - strand, this number will be larger than the read end (col 6).
Read end position (1-based). This refers to the 3' end stop, so for reads on the - strand, this will be smaller than the read start (col 5).
Strand (+ or -)
Number of exons in the transcript
Read length (soft-clipped bases not included)
Gene ID (issued by TALON, integer)
Transcript ID (issued by TALON, integer)
Annotation gene ID
Annotation transcript ID
Annotation gene name (human-readable gene symbol)
Annotation transcript name (human-readable transcript symbol)
Gene novelty: one of "Known", "Antisense", or "Intergenic".
Transcript novelty: one of "Known", "ISM", "NIC", "NNC", "Antisense", "Intergenic", or "Genomic".
ISM subtype. If transcript novelty is not ISM, this field will be 'None'. If the transcript is an ISM, then this field can be 'Prefix', 'Suffix', 'Both', or 'None'.
fraction_As: From the talon_label_reads step. Records the fraction of As present in the n bases after the read alignment.
Custom_label: If the user provided a custom SAM flag (lC:Z), the value will be shown here.
Allelic_label: If the user provided a custom SAM flag (lA:Z) to denote which allele a read came from, the value will be shown here.
Start_support: If the user provided a custom SAM flag (tS:Z) to denote external evidence for the start site of a read, the value will be shown here.
End_support: If the user provided a custom SAM flag (tE:Z to denote external evidence for the end site of a read, the value will be shown here.

It is also possible to obtain this file from a TALON database at any time by running the talon_fetch_reads utility.

Usage: talon_fetch_reads [-h] [--db FILE,] [--build STRING,]
                         [--datasets STRING,] [--o OUTPREFIX]

optional arguments:
  -h, --help          show this help message and exit
  --db FILE,          TALON database
  --build STRING,     Genome build (i.e. hg38) to use. Must be in the
                      database.
  --datasets STRING,  Optional: Comma-delimited list of datasets to include.
                      Default behavior is to include all datasets in the
                      database.
  --o OUTPREFIX       Prefix for output files

Working with the TALON results

Accessing abundance information

The talon_abundance module can be used to extract a raw or filtered transcript count matrix from your TALON database. Each row of this file represents a transcript detected by TALON in one or more of your datasets. To generate a file suitable for gene expression analysis, skip the --whitelist option (i.e. make an unfiltered abundance file). To generate a file for isoform-level analysis, please see the next section to generate a whitelist file to use.

Usage: talon_abundance [options]

Options:
  -h, --help            show this help message and exit
  --db=FILE             TALON database
  -a ANNOT, --annot=ANNOT
                        Which annotation version to use. Will determine which
                        annotation transcripts are considered known or novel
                        relative to. Note: must be in the TALON database.
  --whitelist=FILE      Whitelist file of transcripts to include in the
                        output. First column should be TALON gene ID,
                        second column should be TALON transcript ID
  -b BUILD, --build=BUILD
                        Genome build to use. Note: must be in the TALON
                        database.
  -d FILE, --datasets=FILE
                        Optional: A file indicating which datasets should be
                        included (one dataset name per line). Default is to
                        include                   all datasets.
  --o=FILE              Prefix for output file

Please note to run this utility, you must provide genome build (-b) and annotation (-a) names that match those provided for the talon_initialize_database, otherwise it will not run.

Abundance file format

The columns in the abundance file are as follows:

TALON gene ID
TALON transcript ID
Gene ID from your annotation of choice. If the gene is novel relative to that annotation, this will be the TALON-derived name.
Transcript ID from your annotation of choice. If the transcript is novel relative to that annotation, this will be the TALON-derived name.
Gene name from your annotation of choice (makes the file a bit more human-readable!). If the transcript is novel relative to that annotation, this will be the TALON-derived name.
Transcript name from your annotation of choice. If the transcript is novel relative to that annotation, this will be the TALON-derived name.
Number of exons in the transcript
Length of transcript model (nucleotides). Note: For known transcripts, this will be the length of the model as defined in the annotation. The actual reads that matched it may not be that length. For actual read lengths, see the read_annot output file.
Gene novelty (Known, Antisense, Intergenic)
Transcript status (Known, ISM, NIC, NNC, Antisense, Intergenic)
ISM subtype (Both, Prefix, Suffix, None)
---------------------------- Remaining columns -----------------------------
One column per dataset, with a count indicating how many times the current transcript was observed in that dataset.

Filtering your transcriptome for isoform-level analysis

Before quantifying your results on the isoform level, it is important to filter the novel transcript models because long-read platforms are prone to several forms of artifacts. The most effective experimental design for filtering is to use biological replicates. Some limited filtering is possible even for singlet datasets, but keep in mind that this is likely to be far less effective.

The talon_filter_transcripts module generates a whitelist of transcripts that are either:
a) Known
b) Observed at least n times in each of k datasets.
The default value for n is 5 and the default for k is the total number of datasets you provide for filtering. In addition, the filter requires that all n reads used to support a novel transcript must not have evidence of internal priming (default: internal priming defined as > 0.5 fraction As). If you wish to disregard internal priming, set --maxFracA to 1 (not generally recommended).

Usage: talon_filter_transcripts [options]

Options:
  -h, --help            show this help message and exit
  --db=FILE             TALON database
  -a ANNOT, --annot=ANNOT
                        Which annotation version to use. Will determine which
                        annotation transcripts are considered known or novel
                        relative to. Note: must be in the TALON database.
  --includeAnnot        Include all transcripts from the annotation, regardless
                        of if they were observed in the data.
  --datasets=DATASETS   Datasets to include. Can be provided as a comma-
                        delimited list on the command line, or as a file with
                        one dataset per line. If this option is omitted, all
                        datasets will be included.
  --maxFracA=MAX_FRAC_A
                        Maximum fraction of As to allow in the window located
                        immediately after any read assigned to a novel
                        transcript (helps to filter out internal priming
                        artifacts). Default = 0.5. Use 1 if you prefer to not
                        filter out internal priming.
  --minCount=MIN_COUNT  Number of minimum occurrences required for a novel
                        transcript PER dataset. Default = 5
  --minDatasets=MIN_DATASETS
                        Minimum number of datasets novel transcripts must be
                        found in. Default = all datasets provided
  --allowGenomic        If this option is set, transcripts from the Genomic
                        novelty category will be permitted in the output
                        (provided they pass the thresholds). Default behavior
                        is to filter out genomic transcripts since they are
                        unlikely to be real novel isoforms.
  --o=FILE              Outfile name

The columns in the resulting output file are:

TALON gene ID (an integer). This is the same type of ID found in column 1 of TALON abundance files.
TALON transcript ID (an integer). This is the same type of ID found in column 2 of TALON abundance files.

Obtaining a custom GTF transcriptome annotation from a TALON database

You can use the talon_create_GTF utility to extract a GTF-formatted annotation from the TALON database.

Usage: talon_create_GTF [options]

Options:
  -h, --help            show this help message and exit
  --db=FILE             TALON database
  -b BUILD, --build=BUILD
                        Genome build to use. Note: must be in the TALON
                        database.
  -a ANNOT, --annot=ANNOT
                        Which annotation version to use. Will determine which
                        annotation transcripts are considered known or novel
                        relative to. Note: must be in the TALON database.
  --whitelist=FILE      Whitelist file of transcripts to include in the
                        output. First column should be TALON gene ID,
                        second column should be TALON transcript ID
  --observed            If this option is set, the GTF file will only
                        include transcripts that were observed in at least one
                        dataset (redundant if dataset file provided).
  -d FILE, --datasets=FILE
                        Optional: A file indicating which datasets should be
                        included (one dataset name per line). Default is to
                        include                   all datasets.
  --o=FILE              Prefix for output GTF

Please note to run this utility, you must provide genome build (-b) and annotation (-a) names that match those provided for the talon_initialize_database, otherwise it will not run.

Creating a TALON AnnData object

For users that have single-cell data or that prefer to use the AnnData format to access abundance information, the talon_create_adata utility can be run. This utility produces an AnnData with counts information in sparse matrix format for each transcript, so it is also helpful if the abundance files start to get very large.

Usage: talon_create_adata [options]

Options:
  -h, --help            show this help message and exit
  --db=FILE             TALON database
  -a ANNOT, --annot=ANNOT
                        Which annotation version to use. Will determine which
                        annotation transcripts are considered known or novel
                        relative to. Note: must be in the TALON database.
  --pass_list=FILE      Pass list file of transcripts to include in the
                        output. First column should be TALON gene ID,
                        second column should be TALON transcript ID
  -b BUILD, --build=BUILD
                        Genome build to use. Note: must be in the TALON
                        database.
  --gene                Output AnnData on the gene level rather than the
                        transcript
  -d FILE, --datasets=FILE
                        Optional: A file indicating which datasets should be
                        included (one dataset name per line). Default is to
                        include all datasets.
  --o=FILE              Output .h5ad file name

Citing TALON

Please cite our preprint when using TALON:

A technology-agnostic long-read analysis pipeline for transcriptome discovery and quantification. Dana Wyman, Gabriela Balderrama-Gutierrez, Fairlie Reese, Shan Jiang, Sorena Rahmanian, Weihua Zeng, Brian Williams, Diane Trout, Whitney England, Sophie Chu, Robert C. Spitale, Andrea J. Tenner, Barbara J. Wold, Ali Mortazavi bioRxiv 672931; doi: https://doi.org/10.1101/672931

License

MIT, see LICENSE

talon's People

Contributors

Stargazers

Watchers

talon's Issues

Exact matches to FULL exon string are extremely rare

I decided to use the MatchAnnot output for PB14 to check how many sam_transcripts can be expected to exactly match an annotated transcript exon_string (including the 3' and 5' ends).

grep "result" /bio/dwyman/pacbio_f2016/GM12878/rep3/MatchAnnot/MatchAnnot_out/selection_2-10kb/selection_2-10kb_annotation | grep "sc: 5" | awk '{if($10 == 0 && $11 ==0 ) print $0}' | head

I only found two transcripts in the output:

result:   PB.435.2|chr1:206557165-206589282(+)|c23251/f37p58/3495  RASSF5                RASSF5-004                ex:  5  sc: 5  5-3:     0     0
result:   PB.2089.2|chr9:130579606-130638352(+)|c15546/f2p6/3123  FUBP3                 FUBP3-002                 ex: 19  sc: 5  5-3:     0     0

I was lucky- the first one is on chromosome 1, so it should be in my testing set. I think the transcript cluster ID is different because cluster was re-run at some point between my test set and the MatchAnnot file, but I was able to find the SAM entry in my test set using the start position

c25862/f52p88/3496	0	chr1	206557165	255	529M25575N111M1007N298M495N116M1530N2443M	*	0	0	GAACTGCTTTACGCGAGGGGCAGGAAAGGCGCGGGAGGCGGGGGAGGTGCGGAGATGGCGCTCTGCACGGCGGCGGAGGGAGGGCGCTGGCGCCGGGGACACGAAACCGCAGAGCCCGGACGAGTCAGGGAGTGAGGCGCGAGCCGGGCGCCCGGGGCTCTGCAGGCGCAGGCGGCGCGGGGACAGGAGCAGGTTACCGGGCCGCCCGAGCGCTCGCACCCCGCTGAAAGAAACGCAGGCGGCCCGCCGGCTCTGCCTGGTCCGCTACCCGACCAGCTCCCGGCTCGGGGCTCAGAGCTAGGGGCTTACGCCAAGCGGAGCCCGGGGAGGGGTGCCCACCTCCCTCCGCCGCATCCCAAGCCCGGCCCCCTTGATGCGCTGGCGGCCTCGGCCGGGAACTCCGGGGTAGATGACCGTGGACAGCAGCATGAGCAGTGGGTACTGCAGCCTGGACGAGGAACTGGAAGACTGCTTCTTCACTGCTAAGACTACCTTTTTCAGAAATGCGCAGAGCAAACATCTTTCAAAGAATGTCTGTAAACCTGTGGAGGAGACACAGCGCCCGCCCACACTGCAGGAGATCAAGCAGAAGATCGACAGCTACAACACGCGAGAGAAGAACTGCCTGGGCATGAAACTGAGTGAAGACGGCACCTACACGGGTTTCATCAAAGTGCATCTGAAACTCCGGCGGCCTGTGACGGTGCCTGCTGGGATCCGGCCCCAGTCCATCTATGATGCCATCAAGGAGGTGAACCTGGCGGCTACCACGGACAAGCGGACATCCTTCTACCTGCCCCTAGATGCCATCAAGCAGCTGCACATCAGCAGCACCACCACCGTCAGTGAGGTCATCCAGGGGCTGCTCAAGAAGTTCATGGTTGTGGACAATCCCCAGAAGTTTGCACTTTTTAAGCGGATACACAAGGACGGACAAGTGCTCTTCCAGAAACTCTCCATTGCTGACCGCCCCCTCTACCTGCGCCTGCTTGCTGGGCCTGACACGGAGGTCCTCAGCTTTGTGCTAAAGGAGAATGAAACTGGAGAGGTAGAGTGGGATGCCTTCTCCATCCCTGAACTTCAGAACTTCCTAACAATCCTGGAAAAAGAGGAGCAGGACAAAATCCAACAAGTGCAAAAGAAGTATGACAAGTTTAGGCAGAAACTGGAGGAGGCCTTAAGAGAATCCCAGGGCAAACCTGGGTAACCGGTCCTGCTTCCTCTCCTCCTGGTGCATTCAGATTTATTTGTATTATTAATTATTATTTTGCAACAGACACTTTTTCTCAGGACATCTCTGGCAGGTGCATTTGTGCCTGCCCAGCAGTTCCAGCTGTGGCAAAAGTCTCTTCCATGGACAAGTGTTTGCATGGGGGTTCAGCTGTGCCCGCCCCCAGGCTGTGCCCCACCACAGATTCTGCCAAGGATCAGAACTCATGTGAAACAAACAGCTGACGTCCTCTCTCGATCTGCAAGCCTTTCACCAACCAAATAGTTGCCTCTCTCGTCACCAAACTGGAACCTCACACCAGCCGGCAAAGGAAGGAAGAAAGGTTTTAGAGCTGTGTGTTCTTTCTCTGGCTTTGATTCTTCTTTGAGTTCTCTTACTTGCCACGTACAGGACCATTATTTATGAGTGAAAAGTTGTAGCACATTCCTTTTGCAGGTCTGAGCTAAGCCCCTGAAAGCAGGGTAATGCTCATAAAAGGACTGTTCCCGCGGCCCCAAGGTGCCTGTTGTTCACACTTAAGGGAAGTTTATAAAGCTACTGGCCCCAGATGCTCAGGGTAAGGAGCACCAAAGCTGAGGCTGGCTCAGAGATCTCCAGAGAAGCTGCAGCCTGCCCTGGCCCTGGCTCTGGCCCTGGCCCACATTGCACATGGAAACCCAAAGGCATATATCTGCGTATGTGTGGTACTTAGTCACATCTTTGTCAACAAACTGTTCGTTTTTAAGTTACAAATTTGAATTTAATGTTGTCATCATCGTCATGTGTTTCCCCAAAGGGAAGCCAGTCATTGACCATTTAAAAAGTCTCCTGCTAAGTATGGAAATCAGACAGTAAGAGAAAGCCAAAAAGCAATGCAGAGAAAGGTGTCCAAGCTGTCTTCAGCCTTCCCCAGCTAAAGAGCAGAGGAGGGCCTGGGCTACTTGGGTTCCCCATCGGCCTCCAGCACTGCCTCCCTCCTCCCACTGCGACTCTGGGATCTCCAGGTGCTGCCCAAGGAGTTGCCTTGATTACAGAGAGGGGAGCCTCCAATTCGGCCAACTTGGAGTCCTTTCTGTTTTGAAGCATGGGCCAGACCCGGCACTGCGCTCGGAGAGCCGGTGGGCCTGGCCTCCCCGTCGACCTCAGTGCCTTTTTGTTTTCAGAGAGAAATAGGAGTAGGGCGAGTTTGCCTGAAGCTCTGCTGCTGGCTTCTCCTGCCAGGAAGTGAACAATGGCGGCGGTGTGGGAGACAAGGCCAGGAGAGCCCGCGTTCAGTATGGGTTGAGGGTCACAGACCTCCCTCCCATCTGGGTGCCTGAGTTTTGACTCCAATCAGTGATACCAGACCACATTGACAGGGAGGATCAAATTCCTGACTTACATTTGCACTGGCTTCTTGTTTAGGCTGAATCCTAAAATAAATTAGTCAAAAAATTCCAACAAGTAGCCAGGACTGCAGAGACACTCCAGTGCAGAGGGAGAAGGACTTGTAATTTTCAAAGCAGGGCTGGTTTTCCAACCCAGCCTCTGAGAAACCATTTCTTTGCTATCCTCTGCCTTCCCAAGTCCCTCTTGGGTCGGTTCAAGCCCAAGCTTGTTCGTGTAGCTTCAGAAGTTCCCTCTCTGACCCAGGCTGAGTCCATACTGCCCCTGATCCCAGAAGGAATGCTGACCCCTCGTCGTATGAACTGTGCATAGTCTCCAGAGCTTCAAAGGCAACACAAGCTCGCAACTCTAAGATTTTTTTAAACCACAAAAACCCTGGTTAGCCATCTCATGCTCAGCCTTATCACTTCCCTCCCTTTAGAAACTCTCTCCCTGCTGTATATTAAAGGGAGCAGGTGGAGAGTCATTTTCCTTCGTCCTGCATGTCTCTAACATTAATAGAAGGCATGGCTCCTGCTGCAACCGCTGTGAATGCTGCTGAGAACCTCCCTCTATGGGGATGGCTATTTTATTTTTGAGAAGGAAAAAAAAAGTCATGTATATATACACATAAAGGCATATAGCTATATATAAAGAGATAAGGGTGTTTATGAAATGAGAAAATTATTGGACAATTCAGACTTTACTAAAGCACAGTTAGACCCAAGGCCTATGCTGAGGTCTAAACCTCTGAAAAAAGTATAGTATCGAGTACCCGTTCCCTCCCAGAGGTGGGAGTAACTGCTGGTAGTGCCTTCTTTGGTTGTGTTGCTCAGTGTGTAAGTGTTTGTTTCCAGGATATTTTCTTTTTAAATGTCTTTCTTATATGGGTTTTAAAAAAAAGTAATAAAAGCCTGTTGCA	*	NH:i:1	HI:i:1	NM:i:4	MD:Z:1370C1A209A7C1906	jM:B:c,21,21,21,21	jI:B:i,206557694,206583268,206583380,206584386,206584685,206585179,206585296,206586825

I looked up this transcript in the UCSC genome browser. There is an off-by-one error at the 3' end.

Transcript: 206557165-206589268
Annotation: 206557165-206589267

Hand-checked exons of transcript:

206,557,165 - 206,557,693
206,583,269 - 206,583,379
206,584,387 - 206,584,684
206,585,180 - 206,585,295
206,586,826 - 206589268

Build TALON annotation database

In order to track novel transcripts from run to run, it is crucial to maintain an external structure to store and update the annotations.

TODO

Take existing GTF parsing function from main TALON script and adapt it for creating a new annotation tracker instead (outside of main TALON script). Add filtering to remove irrelevant annotations.
Add dataset-tracking table to database and update TALON input options accordingly
Write a function to output the entire annotation tracker in GTF format for viewing in the UCSC genome browser, etc.
Update main TALON script to parse the annotation instead of a GTF file
Clean up transcript identification code
Write code in main TALON script to appropriately modify the annotation.

Requirements

Allow single runs of talon that combine multiple (1) technical and (2) biological replicates in a single run.
Track gene, transcript, and exon locations
Track how many datasets the transcript has been found in to date, and probably which ones (see point about 'bad' information). We need to identify when/where we first identify a new transcript, but we probably need to identify in a separate file when we observe existing transcripts and the abundance. This is just as interesting for "real" biology as finding new ones.
- This could also be helpful as a way of generating cell-line specific annotations (i.e. filter the annotation to include only transcripts from GM12878 datasets). Yes, for both known and novel transcripts.
Track whether an entry originated in an annotation like GENCODE or if it came from a long read dataset(s). We need to assign novel transcript IDs using a counter. It's ok if we decide to "reject" transcripts later, for example because they have non-canonical junctions.
In terms of our lab, the database/annotation needs to be set up in such a way that Gaby and I can both access it and write to it. Otherwise, we won't be able to compare our datasets. This does create a potential problem though- it means that we can't both run things on it at once. Also, how would this scale if we want to share our annotations with other people? More importantly, this pipeline needs to ultimately be run by the ENCODE DCC based on our raw data. So it's important that it works in a standard environment such as AWS.
Need to protect against addition of 'bad' information, i.e. if we decide that a dataset we ran can't be trusted, then there needs to be a mechanism for removing things that were discovered during that run. Yes, but that could be accomplished by keeping track of previous runs before running additional one (e.g. versioning and storing the previous runs).
- Check for collisions when adding information, including situations where a dataset is accidentally run in duplicate. If you rerun the same sample and add new information, then it's a bug. The only way I could see this as a potential problem is if you did abundance filtering, reran the same sample and now a solo transcript was "double-counted", so watch out for that.
- May want to apply filtering (abundance > 1) before adding a novel transcript to annotation. Not optional, but necessary for new transcripts. But more importantly, only accept novel transcripts seen in biological replicates.
The format should be convenient with respect to the TALON code, i.e. quick to read in. Not only quick, but also filtered from irrelevant Gencode stuff that we should not see.
Be able to output a GTF version
Be able to get a list of every dataset that is represented in the annotation. Ultimately, we want to also track abundances in a separate file that will match what's in this annotation file.

Anticipated challenges

5' and 3' differences: How large should the difference be in order to create a new transcript entry? What if this creates problems with retroactively sub-optimal transcript matches?

Example: Candidate 1 is assigned to the known transcript A.1 because its 5' difference is within some pre-determined threshold. However, when TALON encounters Candidate 2, a new transcript is created. If we see another transcript with the same structure as Candidate 1 (labeled 3 here), will the behavior be the same as for Candidate 1, or will it be assigned to Candidate 2? The concern here is to avoid different outcomes when the order of the input changes. We need TALON to behave deterministically.

Should we ever allow people to merge annotations?
- If we do this, there needs to be a clean mechanism for remapping IDs in datasets that were processed in the past. I am not sure that it's necessary. We can assume that the annotation is run serially.
If we upgrade to a new version of GENCODE, what happens? Also, how do we migrate to a new version of the genome assembly ?

Tests

If we run the same dataset twice, then the second run should not make any changes to the database at all. Yes.
TALON should produce the same results regardless of the order of the input file, except that the assigned new IDs would be different.

Ideas

Transcript and associated information should occupy a single line in order to make parsing as simple as possible. We might want to consider the way UCSC records transcripts in their tables.
- UCSC GENCODE v24 schema: https://genome.ucsc.edu/cgi-bin/hgTables
Right now, when a novel transcript is found, it is immediately added to the data structures of the current run. If I want to screen transcripts (i.e. for min abundance) before adding them to the database, this could be done in a separate iteration over novel transcripts at the end of the run. It's ok to add them to the structure, but we want to filter what's added at the end of the run based on being seen at least once in each of two biological replicates.
When outputting to GTF, include number of datasets each transcript is found in- maybe we could color based on this in the genome browser.

Novel Case: Handling sub-transcript case where all of the query exons match a transcript, but query has fewer exons

Case: c13526/f1p3/2925 (INPP5B)
Because the query transcript is shorter than the annotated ones, it is impossible to tell which isoform it came from. But more importantly than that, we need to make sure that we don't label this as (completely) known.

GENCODE exon is zero basepairs long

ENSE00003758511.1

Change GTF processing to allow exons that don't belong to a gene or transcript

Ali and I agreed that this is a good feature to have, but it does make filtering genes more complicated since a lot of these exons are less than 200 bp. And spike-ins fall into this category.

Match candidate transcripts to gene(s)

High-level

If the candidate transcript overlaps with an annotated gene, then we consider it a transcript of that gene. We could potentially allow more than one match at first and then select the one that is "better". In addition, we may want to think about how much overlap should count.

Details: How will I assign transcripts to genes in practice?

One option would be to use Bedtools intersect. This would be similar in spirit to TranscriptClean. Other ideas might include taking advantage of sorting. I did a brief Google search for "data structures for comparing numeric intervals", and the results suggest to me that there might be good ways of doing this that would not require an outside tool. For instance, there is a data structure called an interval tree that seems appropriate. This would be applicable to other areas of the program as well.
https://en.wikipedia.org/wiki/Interval_tree
This Python library seems like a useful implementation because you can store any object linked to a given interval, for instance, a hypothetical gene class object. And we can safely assume that two genes never have the exact same coordinates.
https://pypi.python.org/pypi/intervaltree

Additional considerations:

Chromosomes need to be handled separately (can't just have one position interval tree)
- Solution: Use a dict: Each chromosome key points to its own interval tree
Strand
- I'm going to ignore this for the moment, but it may need to be its own issue

I decided it would be worthwhile to create a dedicated GeneTree class to manage the dict of interval trees, and create dedicated methods for it in order to keep the code readable. I have now implemented this class.

Tasks

Write the GeneTree class
Write a GeneTree class function: given an interval, get the genes that overlap with it
Write the Gene class
Deal with the strand issue

Need to return both of the best gene matches for a transcript if the distances are equal [FIXED]

Discovered this problem while working with #15

Missing "exon" table in TALON database

I have been trying to figure out the structure of the TALON database created by "initialize_talon_database.py" to fully understand how the program works. However, I have run into a missing table.
Within the "exon_annotation" table a foreign key is created referring to a "exon" table, it seems however this table is never created.
The "exon_annotation" table does get filled at database initialization, is a "add_exon_table" function missing from the script by design?

CREATE TABLE exon_annotations (
ID integer,
annot_name text,
source text,
attribute text,
value text,
/* Keys */
PRIMARY KEY (ID, source, attribute),
/* Foreign keys */
CONSTRAINT Foreign_key01
FOREIGN KEY (ID)
REFERENCES exon(ID)
);

Updating the annotation during TALON run

At the end of a TALON run, it is necessary to update the TALON SQLite database both to add novel genes/transcripts/exons, but also to update existing entries when new information comes to light.

What if we find a novel transcript for a known gene that starts further upstream/ends further downstream?
- Does not matter since genes are not explicitly attached to a particular set of locations.
Transcript entries in the database should be more or less stable.
The gene, transcript, and exon tables all have a "datasets_containing" field. If I want to keep this feature, it will need to be updated every run
- This was scrapped.
Must create and add new database entries for novel genes, transcripts, and exons

Database counter for 'dataset' does not match the number of entries in the table.

Hey,

I am testing out the new TALON version 4.4 with the same testset I used for 4.2 (which passes).
However, I am encountering the message in the title.

Database counter for 'dataset' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 2
counter_value: 1
Traceback (most recent call last):
  File "/home/miniconda/envs/py37/bin/talon", line 8, in <module>
    sys.exit(main())
  File "/home/miniconda/envs/py37/lib/python3.7/site-packages/talon/talon.py", line 2411, in main
    update_database(database, batch_size, run_info.outfiles, dataset_db_entries)
  File "/home/miniconda/envs/py37/lib/python3.7/site-packages/talon/talon.py", line 1786, in update_database
    check_database_integrity(cursor)
  File "/home/miniconda/envs/py37/lib/python3.7/site-packages/talon/talon.py", line 2087, in check_database_integrity
    raise RuntimeError("Discrepancy found in database. " + \
RuntimeError: Discrepancy found in database. Discarding changes to database and exiting...

I checked the database using DB Browser for SQLite to see if the "dataset" table has 2 dataset entries, but it contains just one. My config file also contains just one dataset entry. Any idea how these numbers are not matching up?

Develop the Gene class

Tasks

Write basic infrastructure of the class, i.e. documentation, initialization, print
Choose how to store transcripts belonging to the gene
For now, each gene has a dictionary of transcript ids mapped to transcript objects.

too many values to unpack(expected 2)

Hello,

I used talon_initialize_database, there's error.

cmd : talon_initialize_database --f Mus_musculus.GRCm38.87.gtf --a gencode.v29 --g hg38 --o example_talon 1>x1.log

Traceback (most recent call last):
File "/home/luoluo/.local/bin/talon_initialize_database", line 10, in
sys.exit(main())
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 996, in main
genes, transcripts, exons = read_gtf_file(gtf_file)
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 496, in read_gtf_file
transcript = Transcript.get_transcript_from_gtf(tab_fields)
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/transcript.py", line 286, in get_transcript_from_gtf
annotations = extract_transcript_annotations_from_GTF(transcript_info)
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/transcript.py", line 306, in extract_transcript_annotations_from_GTF
key, val = pair.split()
ValueError: too many values to unpack (expected 2)

is there something wrong I did???

Queries we might like to run on TALON database

What is the most commonly utilized 3' end for a given gene? What about 3' exon? Is this different for different datasets?
How many transcripts have 5' CAGE support?
How many genes or isoforms were shared across two experiments?
How often is a particular exon of a gene utilized in its transcripts?

Further downstream,

Given a gene and a cell type, what does the most likely transcript look like?

Comparing query transcripts to annotation on a per-exon basis instead of per-transcript

In the course of comparing my results to MatchAnnot, I have become aware that it is not enough to assign the query to the best gene match and then do transcript comparisons from there. I need to be directly comparing exons. I think it is appropriate to create a new branch for this: exon-wise-comparison.

Create an exon tree data structure instead of a gene structure.
Create an exon class to populate the exon tree
Create method to identify known transcripts
- Show that it is correct (consistent with MatchAnnot)
Implement Match class that will be stored inside MatchTracker
Add smooth reporting of 5' and 3' differences at transcript ends
Write a method to create novel exons from query
Write a test for the 3' and 5' difference method

Write a function to check whether a sam_transcript matches any transcripts on file for a gene

failed at test unit and integration

I am having issues getting python3.7 installed and currently version is 3.6.0 for python3.

make tests works with some warning,

but then make unit and make integration fail for most tests

I attach output.

Is this to be expected without using python3.7?

In any case I can get talon scripts to run and output in same environment
error_output.txt

How do I use the create talon report R scrip

Sorry for all the questions.

I would like to use the R script to generate plots from my db generated through talon scripts.

I am not so familiar and I do not know how to input my files.

Comparing TALON prototype results to Tofu and MatchAnnot

Tofu + STAR + MatchAnnot commands:

# Run Tofu
source activate myenv
collapse_isoforms_by_sam.py --fq --input GM12878_chr1_canonicalOnly.fq  -s sorted.GM12878_chr1_canonicalOnly.sam --dun-merge-5-shorter -o Tofu_STAR_MatchAnnot/Tofu/tofu
source deactivate

# Remap with STAR using bash script. Here is the essential command though
STARlong --runThreadN 4 --genomeDir /bio/dwyman/pacbio_f2016/data/STAR_hg38_ENCODE \
--readFilesIn /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/Tofu/tofu.collapsed.rep.fq \
--sjdbGTFfile /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
--outFileNamePrefix /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/ \
--outSAMattributes NH HI NM MD jM jI --readNameSeparator space --outFilterMultimapScoreRange 1 --outFilterMismatchNmax 2000 --scoreGapNoncan -20 --scoreGapGCAG -4 --scoreGapATAC -8 --scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 --alignEndsType Local --seedSearchStartLmax 50 --seedPerReadNmax 100000 --seedPerWindowNmax 1000 --alignTranscriptsPerReadNmax 100000 --alignTranscriptsPerWindowNmax 10000 --outSAMtype SAM

samtools sort Aligned.out.sam -o sorted.Aligned.out -O sam


# Run MatchAnnot
# Crashed when I didn't have --format alt. Must be some types in the v24 annotation
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
 /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/sorted.Aligned.out > \
/bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/match_annot_results.txt

TALON

python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf

Tofu + STAR + MatchAnnot Results:

Total transcripts at MatchAnnot stage: 1526
Number of transcripts assigned a score of 5: 684
- This means that the intron boundaries matched internally, but the leading or trailing 5'/3' ends may have varied
Number of transcripts with perfect 3' and 5' match: 0
Number of transcripts with perfect 3' match but not 5': 8
Number of transcripts with perfect 5' match but not 3': 5

grep "result" match_annot_results.txt | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 == 0 && $11 ==0 ) print $0}' | wc -l
ot_results.txt | grep "sc: 5" | awk
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($11 ==0 ) print $0}' | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 ==0 ) print $0}' | wc -l

TALON results (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e):

Total transcripts processed: 1881
Number of transcripts with internal exon matches (MatchAnnot score 5 equivalent): 954

Unexpected case: Transcript appears to match gene, but is on the wrong strand

I found this case while working on #15. MatchAnnot assigns transcript c33225/f1p0/3407 to gene PHTF1, while TALON does not assign it to a gene at all. The reason for this discrepancy is that TALON requires the transcript and gene strand to match, while MatchAnnot does not. That's all well and good, but in this particular case, the exons line up really well, so I'm surprised that we sequenced this, especially since there are a bunch of similar transcripts on the correct strand. We can see this below. Transcript c33225/f1p0/3407 is the one in blue

Making transcript identity assignments with adaptive exon searching

This is a fairly difficult problem. The first version I implemented was greedy, seeking to choose the most closely matched exon for every query exon in the sam transcript. Here is an example that illustrates why this doesn't work.

Transcript c33255/f1p0/3238: The exon coordinates are as follows:
827670-827775
829003-829104
847654-847806
849484-849602
851927-852110
852671-852766
853391-853424
853474-855121
856449-857242
When we try to match the very first exon, the greedy match is 827669-827775, which is found only in annotated transcript ENST00000609139.5. Visually, we can see in the UCSC genome browser shot below that there were other options. For instance, the first four exons of ENST00000608189.4 (second transcript) match the query.

Handle case where query transcript only overlaps a miRNA gene but clearly does not belong to it

Case observed in #15. This is a pathological case for MatchAnnot when the transcript in question has only one exon- the transcript erroneously gets assigned to the miRNA gene.

At present, TALON and MatchAnnot both assign c14773/f1p0/3486 to gene MIR4425, but TALON does not assign it a transcript match. I need to change this behavior so that c14773/f1p0/3486 gets assigned to a novel gene. This issue is closely related to #12.

Strand bug affecting application of 3' and 5' difference cutoffs

I was recovering only half of expected "score 5" matches because I didn't consider that for negative strand transcripts, the first exon processed is really its last exon. Need to handle this gracefully

Write a GeneTree class function: given an interval, get the genes that overlap with it

Files

ENCODE Spring Retreat, April 20, 2018 (TALON).pdf

UNIQUE constraint failed: dataset.dataset_ID

This is happening at the end of about 1/2 of my runs for talon, with other combinations of .sam files, config files, and db's working just fine.

For example, this command produces the error:

talon --f ../TranscriptClean-master/config.csv --db ds2.2.db --build ds2.2 --threads 4 --o talon_out1

and this does not:

talon --f ../TranscriptClean-master/config_2.csv --db ds2.2.db --build ds2.2 --threads 4 --o talon_out2

The difference of course being which sam file is in the config file. The sam files were all produced by TranscriptClean after mapping by minimap2. I checked for non-unique read headers, there are none.

Any help on tracking down this error would be helpful.

Storing exon coordinates in Transcript object vs exon objects

Right now, I just store the start and end positions of the exons in the transcript object. It may make more sense in the future to change it to storing the exon objects themselves. The issue will be the SAM exons, which don't graduate to objects until the query stage.

Novel cases: All query exons match a transcript, but the annotated transcript contains additional exons

This case is really two cases:

The exon matches are contiguous #23
The exon matches are not contiguous #24

Write a test for the compute_transcript_end function

This function finds the end position of a SAM transcript with respect to the reference genome. To write a test, I will identify some SAM examples and look at them in the UCSC genome browser so that I can record the correct end positions. Then, my test will run the function on these examples and check the results against the correct answers.

FIXED: Novel Case: All of query exons are accounted for in the matched transcript, but the match contains additional exons (not the sub-transcript case)

Related to #23
c13856/f1p1/3056: In this case, TALON returns transcript match set(['ENST00000372597.5', 'ENST00000372596.5']) because it can't tell which of these transcripts is a better match. If we look closely, we can see that ENST00000372597.5 is better because ENST00000372596.5 contains all of the query exons, BUT it also contains one extra exon that is not in the query. Update: After looking more closely at this example, I noticed that neither of the candidates are a perfect match- they each contain a tiny exon at position 3 that the query does not.

I decided to address the spirit of the original problem by checking each match at the end to see if the query and match transcripts each contain the same number of exons. If they do, it's a match, and if not, it isn't.

Process SAM file of transcripts

I will need to read in transcripts from SAM files and create representations of them that allow for comparisons against the annotation.

I think it would make sense for the transcripts to belong to the same class as the annotation-derived transcripts. Or at least be a subclass of them.

Tasks

Superficial SAM reading code that opens the file and goes line by line
Write and test a function that computes the end position of the transcript
Code to get introns from transcript (straight from jI field, or compute them if missing)
Code to get exon structure of a transcript. Similar to TranscriptClean.
Decide how we are going to treat multi-mapping transcripts. I could imagine a version of this where we allow multi-mapping by effectively doing several interval searches against the genes. Then, the gene lists could be combined and processed just like any other gene overlaps would be (i.e. trying to find the best overlap). But it might be simpler at first to exclude them like we do in TranscriptClean. It mostly matters from a data structure-ID point of view.

IndexError

Hi,

I was trying to run talon on my minion data. But I am encountering index error. Bellow is the error. Can you please look into it.

Thanks

[ 2019-11-13 23:44:23 ] Started TALON run
Traceback (most recent call last):
File "/usr/local/bin/talon", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/talon/talon.py", line 2346, in main
sam_files, dset_metadata = check_inputs(options)
File "/usr/local/lib/python3.7/site-packages/talon/talon.py", line 1411, in check_inputs
curr_sam = line[3]
IndexError: list index out of range

Write function to get the best gene match for an input SAM transcript

It seems to be fairly common for transcripts to overlap more than one gene, but I suspect that the genes in question are not protein-coding. I may run TALON on a gencode file that excludes non-poly-A transcripts, but even so, it may be better to write a function that takes into account the possibility of multiple gene overlaps (with some pretty bad matches). My concern here is mostly for transcripts belonging to a novel gene. We don't want to accidentally assign them to a snoRNA or something just because it's the only overlap there is.

Write tests (At least for the overlap function)
Complete the function

pip install talon doesn't work?

Hello!

When I use PIP install talon, the installation is successful, but the command such as Talon is not displayed.

build_annotation crashes on pseudoautosomal region

Because I added exon validity checks, the build_annotation script now crashes on the pseudoautosomal region because there are two exons on different chromosomes that have the same gencode exon ID (even though their transcript and gene IDs are different). The ID is ENSE00001702291.1. Because of this, I am moving over to using chrome_start_end as an ID instead, with the gencode ID stored as a name instead.

What should I do about GTF fields beyond genes, transcripts, and exons?

In the course of working on GTF processing for the transcripts (#5), I found lines in the GTF file that went beyond the major three groups I work with (genes, transcripts, and exons). Here are some examples:

chr2	HAVANA	CDS	41611	41627	.	-	2	gene_id "ENSG00000184731.5"; transcript_id "ENST00000327669.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; exon_number 2; exon_id "ENSE00001537167.2"; level 2; protein_id "ENSP00000328347.4"; tag "basic"; transcript_support_level "1"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS42645.1"; havana_gene "OTTHUMG00000151321.1"; havana_transcript "OTTHUMT00000322220.1";

chr2	HAVANA	stop_codon	41608	41610	.	-	0	gene_id "ENSG00000184731.5"; transcript_id "ENST00000327669.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; exon_number 2; exon_id "ENSE00001537167.2"; level 2; protein_id "ENSP00000328347.4"; tag "basic"; transcript_support_level "1"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS42645.1"; havana_gene "OTTHUMG00000151321.1"; havana_transcript "OTTHUMT00000322220.1";

chr2	HAVANA	UTR	38814	41610	.	-	.	gene_id "ENSG00000184731.5"; transcript_id "ENST00000327669.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; exon_number 2; exon_id "ENSE00001537167.2"; level 2; protein_id "ENSP00000328347.4"; tag "basic"; transcript_support_level "1"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS42645.1"; havana_gene "OTTHUMG00000151321.1"; havana_transcript "OTTHUMT00000322220.1";

The simplest option of course is to just ignore them. But will I be missing important distinctions if I do? The UTR components may be particularly important.

Transcript that maps well at first, but has a large 3' discrepancy

When I started running TALON on the PB36 ROIs (pre-TranscriptClean), I found the following transcript because it was showing up as a known match with a 3' difference of -14894

m171202_193234_42198_c101417902550000001823303602281843_s1_p0/124708/718_57_CCS

Also this: m171202_130838_42198_c101417902550000001823303602281842_s1_p0/67217/29_1817_CCS
3' difference of -32724

At first I thought they were bugs. But I think what is really happening is that the exons in the query only match the transcript on file that also happens to have a really long 3' UTR. Our query ends much earlier, resulting in the big 3' difference.

Develop the Transcript class

High Level

The Transcript class needs to

Allow fast comparison of transcripts on the level of "are they identical or not"
Be flexible- some users may care about differences at the 5' end and others may not (for example)
Be able to handle data from the GTF annotation or from the SAM file so that transcripts from these two sources can be readily compared

See #1 for additional considerations.

Details

Thoughts

Develop SamTranscript as a subclass of Transcript, with additional attributes for storing SAM information
I think it will be more appropriate to define transcripts in terms of their exons (instead of their introns) because GENCODE is based around exons. The SAM files are more intron-oriented, but it shouldn't be hard to convert over. Could use the CIGAR string
Initialization should not require gene information as a mandatory field, because we won't have this information right away for the SAM transcripts
Start and end relative to assigned gene: At first, I thought I should make it a requirement that the transcript be situated entirely within the gene start and end. But I actually think that assumption may be violated with novel transcripts on occasion.
Start, end, and strand should probably be mandatory initialization fields
It would be really cool if I had a way of computing the 5' UTR and/or 3' UTR length of a transcript object

Tasks

Write basic class structure
Write subclass for SAM transcripts
Write code to create transcript object from GTF entry
Implement string representation function (really simple for now)

talon_generate_report error

Hi there,

I tried to run talon and everything run smoothly, but when I get to the point to run 'talon_generate_report', I got the following issue even though I have already installed all the necessary libraries in R.

$ talon_generate_report
Loading required packages...
One of the required packages is missing.
Here's the original error message:
there is no package called ‘DBI’Error in theme_bw(base_family = "Helvetica", base_size = 18) :
could not find function "theme_bw"
Calls: main ... custom_theme -> suppressMessages -> withCallingHandlers
Execution halted

I also couldn't find the exact arguments for running talon_generate_report. Is there somewhere in the documentation?

Thank you in advance!

some guidance on running; cannot generate novel annotations

I was confused as to how to move through talon scripts.

I downsampled my reads to 10e4, then mapped to hg38 using minimap2 with --MD flag included to generated > aln.sam

I then used TranscriptClean but without -v or -j flags to error correct and output clean.sam

Next I used this to initialize a talon db using the initialize script feeding in gencode.v29.annotation.gtf and --g hg38_build --a hg38_annot

Following this, I used talon.py script with the db, and config.csv pointing to clean.sam file and metadata. This gave and log_QC.txt file and update the hg38.db.

Finally I ran the create_abundance_file_from_database and this gave me a .tsv file.

However when I check the columns for exon and gene, I cannot find any unknown. Even with small number of reads would expect something?

example_head.xlsx

Interesting set of GM12878 transcripts for exon comparison/testing

While selecting test transcripts for the compute_transcript_end function, I stumbled upon this pair in GM12878_chr1_clean.sam: c30124/f1p15/2713 and c2986/f14p15/2748. It's interesting to me because the internal exon boundaries match, but there is a difference at the 3' end. The 5' ends are the same.

Which types of genes should we allow in the GENCODE GTF file?

Should we be filtering out things like snoRNAs, miRNA, etc. At first, my plan was to stick to protein coding genes and lncRNAs, but then I visited the GENCODE page and found out that there is an absurd number of biotypes.

https://www.gencodegenes.org/gencode_biotypes.html

Running slowly on PBS/Torque

I was wondering if there is a faster way to run this on large SAM files (>30gb) - will there be a parallel version of TALON?

Move gene length check to annotation building step instead of TALON proper

How to choose best exon match for query when tot_diffs are the same

I haven't seen this case happen in my data yet, but I bet it will at some point so I should be prepared to deal with it.

Clean up GTF parsing code by delegating sanity checks to a separate function

SAM file lacks an MD tag

Hi, I tried to use TALON to process the alignment file I got from minimap2. It's a sam file.
python talon.py \ --f /private/.../TALON/hg38mini1BD.csv \ --db /private/.../TALON/hg38v27_talon.db \ --build hg38v27 --o /private/.../TALON/hg38tal1BD

The csv file has the alignment file name. however when I ran it, it gave me

Traceback (most recent call last):
  File "talon.py", line 2475, in <module>
    main()
  File "talon.py", line 2452, in main
    struct_collection, outprefix)
  File "talon.py", line 1694, in process_all_sam_files
    exon_annotations, abundance = annotate_sam_transcripts(sam, d_name, cursor, struct_collection, o)
  File "talon.py", line 1919, in annotate_sam_transcripts
    qc_metrics = check_read_quality(line, struct_collection)
  File "talon.py", line 2069, in check_read_quality
    raise ValueError("SAM transcript %s lacks an MD tag" % read_ID)
ValueError: SAM transcript 2d415520-3a14-44b9-af3c-1962ab413e36 lacks an MD tag

I am not sure what is happening here...

Handle case where longer gene covers more of transcript query, but shorter gene is a closer match

While working on #15, I came across the following case c18234/f1p0/3056 (red bar). Visually, we can see that the transcript matches the NAP1L4P1 gene, and indeed, that is how MatchAnnot assigns it. However, TALON assigns the transcript to the CD58 gene, which appears incorrect. This may be a pretty insidious problem to fix because of how I do my gene assignments. Maybe I need to move to a more directly exon-based approach.

I found a second such pathological case: c26372/f1p0/3232

GENCODE has occasional gene duplicates. How should this be handled?

When I added a check in the GeneTree add_gene class function to look for existing genes with the input gene_name before adding a new gene, I unexpectedly got a hit:

python talon.py --f test_files/GM12878_chr1_clean.sam -g test_files/gencode.v24.annotation.chr1.gtf
Traceback (most recent call last):
  File "talon.py", line 92, in <module>
    main()
  File "talon.py", line 84, in main
    read_gtf_file(gtf_file)
  File "talon.py", line 50, in read_gtf_file
    genes.add_gene_from_gtf(tab_fields)
  File "/dfs1/bio/dwyman/pacbio_f2016/TALON/genetree.py", line 103, in add_gene_from_gtf
    self.add_gene(gene_name, chromosome, start, end, strand)
  File "/dfs1/bio/dwyman/pacbio_f2016/TALON/genetree.py", line 71, in add_gene
    " is duplicated.")
KeyError: 'Gene names must be unique. TP73-AS1 is duplicated.'

I decided to look for this gene in the annotation to see what might be going on

awk '{if($3 == "gene") print $0}' test_files/gencode.v24.annotation.chr1.gtf | grep "TP73-AS1"
chr1	HAVANA	gene	3735601	3747336	.	-	.	gene_id "ENSG00000227372.10"; gene_type "antisense"; gene_status "KNOWN"; gene_name "TP73-AS1"; level 2; havana_gene "OTTHUMG00000003414.4";
chr1	ENSEMBL	gene	3736943	3737103	.	+	.	gene_id "ENSG00000276189.1"; gene_type "misc_RNA"; gene_status "NOVEL"; gene_name "TP73-AS1"; level 3;

Sure enough, these genes share the same name. They have different gene_id fields though. This means that one way around the problem is to use the gene_id instead of the gene_name to guard against collisions. I generally don't like using gene_ids because they are so difficult to read, but it is probably the responsible thing to do since they are must more precise than gene names. I can always retain use of the gene_name in the object printing functions.

Solution:

I modified the Gene and GeneTree classes so that they now use the gene_id as the primary gene identifier, with the gene_name included mainly for readability. The collision problem no longer happens.

Developing class structures that make sense for genes, transcripts, and exons

In designing Python classes for genes, transcripts, and exons, it is really important that there be enough complexity to allow for flexible handling of different situations, but I don't want things to become needlessly complicated.

This need is illustrated well by the following example.

Here, we have an annotated gene, A, which has an annotated transcript, A.1. Transcript A.1 has three exons, each of which have defined start and end positions. Using a long read sequencing platform, we detect four transcripts that are candidate isoforms of gene A, and which all happen to resemble transcript A.1. My software needs to be able to efficiently compare each candidate to transcript A.1, and decide whether each one is a match or not. The definition of a match is flexible. Here are some different rules that could be applied alone or in combination:

All exon start and end locations must match exactly (most strict)
- Under this framework, only Candidate 1 matches transcript A.1
Deviation is permitted at the 3' transcript end
- Candidate 1 and 2 are matches
Deviation is permitted at the 5' transcript end
- Candidate 1 and 3 are matches
Small differences are allowed on the internal exon boundaries (i.e. "fuzzy junctions")
- Candidate 1 and 4 are matches

Because of this structure, I think it is necessary to have exon entities be part of the transcript object. Then, I can have a method that creates an "exon string" representation of the transcript according to the rules in operation. If 3' variation is allowed, for instance, then the end_E3 position would be omitted from the exon string. If fuzzy junctions are permitted, then maybe I would need to create multiple permuted exon strings of transcript A.1 for the candidates to be compared to. But this could get really complicated fast for genes that have many exons, so I may need a smarter solution. For situations like that, it might make more sense to split the candidate transcript into exon "substrings" that could be matched adaptively to exons of the genes.

I can see already that I will need test cases to track the matching process under different rule structures.

I just had another thought- once we establish that a novel transcript has been found, the program should create a transcript object linked to the gene that is stored the same way that an annotated transcript would be. That way, every candidate matched to that gene from this point on will be screened against the novel transcript to see if it matches.

4/5/2018

I had a discussion with Ali, and we decided to avoid fuzzy junctions for now. I should probably still try to find out how often novel transcripts are just wobbly versions of annotated ones, but it will not drive the structure of the code. This means I can go forward with my plan to represent transcripts as objects that have a list of exons (or introns). To compare transcripts to each other, we call a class function i.e. get_transcript_string() to create a string representation of each transcript, which can be subject to different rules as outlined in the previous section.

4/6/2018

I need to work out how to deal with exons/introns without losing too much information.

4/9/2018

The basic structure is in place, but the next phase will involve more detailed exon comparisons.

Exon-based comparison tool

Hi, I noticed that there was mention of an exon-based comparison tool but cannot find that option in the talon commands. I am trying to quantify differences in 3' and 5' exon boundaries compared with known isoforms and this would be extremely helpful. Thank you!

mortazavilab / talon Goto Github PK

talon's Introduction

TALON

Table of contents

Installation

How to run

Flagging reads for internal priming

Initializing a TALON database

Running TALON

Working with the TALON results

Accessing abundance information

Filtering your transcriptome for isoform-level analysis

Obtaining a custom GTF transcriptome annotation from a TALON database

Creating a TALON AnnData object

Citing TALON

License

talon's People

Contributors

Stargazers

Watchers

Forkers

talon's Issues

TODO

Requirements

Anticipated challenges

Tests

Ideas

High-level

Details: How will I assign transcripts to genes in practice?

Tasks

Tasks

Tasks

High Level

Details

Thoughts

Tasks

Solution:

4/5/2018

4/6/2018

4/9/2018

Recommend Projects

Recommend Topics

Recommend Org