mortazavilab / talon Goto Github PK
View Code? Open in Web Editor NEWTechnology agnostic long read analysis pipeline for transcriptomes
License: MIT License
Technology agnostic long read analysis pipeline for transcriptomes
License: MIT License
In order to track novel transcripts from run to run, it is crucial to maintain an external structure to store and update the annotations.
Example: Candidate 1 is assigned to the known transcript A.1 because its 5' difference is within some pre-determined threshold. However, when TALON encounters Candidate 2, a new transcript is created. If we see another transcript with the same structure as Candidate 1 (labeled 3 here), will the behavior be the same as for Candidate 1, or will it be assigned to Candidate 2? The concern here is to avoid different outcomes when the order of the input changes. We need TALON to behave deterministically.
I have been trying to figure out the structure of the TALON database created by "initialize_talon_database.py" to fully understand how the program works. However, I have run into a missing table.
Within the "exon_annotation" table a foreign key is created referring to a "exon" table, it seems however this table is never created.
The "exon_annotation" table does get filled at database initialization, is a "add_exon_table" function missing from the script by design?
CREATE TABLE exon_annotations (
ID integer,
annot_name text,
source text,
attribute text,
value text,
/* Keys */
PRIMARY KEY (ID, source, attribute),
/* Foreign keys */
CONSTRAINT Foreign_key01
FOREIGN KEY (ID)
REFERENCES exon(ID)
);
If the candidate transcript overlaps with an annotated gene, then we consider it a transcript of that gene. We could potentially allow more than one match at first and then select the one that is "better". In addition, we may want to think about how much overlap should count.
One option would be to use Bedtools intersect. This would be similar in spirit to TranscriptClean. Other ideas might include taking advantage of sorting. I did a brief Google search for "data structures for comparing numeric intervals", and the results suggest to me that there might be good ways of doing this that would not require an outside tool. For instance, there is a data structure called an interval tree that seems appropriate. This would be applicable to other areas of the program as well.
https://en.wikipedia.org/wiki/Interval_tree
This Python library seems like a useful implementation because you can store any object linked to a given interval, for instance, a hypothetical gene class object. And we can safely assume that two genes never have the exact same coordinates.
https://pypi.python.org/pypi/intervaltree
Additional considerations:
I decided it would be worthwhile to create a dedicated GeneTree class to manage the dict of interval trees, and create dedicated methods for it in order to keep the code readable. I have now implemented this class.
This is happening at the end of about 1/2 of my runs for talon, with other combinations of .sam files, config files, and db's working just fine.
For example, this command produces the error:
talon --f ../TranscriptClean-master/config.csv --db ds2.2.db --build ds2.2 --threads 4 --o talon_out1
and this does not:
talon --f ../TranscriptClean-master/config_2.csv --db ds2.2.db --build ds2.2 --threads 4 --o talon_out2
The difference of course being which sam file is in the config file. The sam files were all produced by TranscriptClean after mapping by minimap2. I checked for non-unique read headers, there are none.
Any help on tracking down this error would be helpful.
Hi, I tried to use TALON to process the alignment file I got from minimap2. It's a sam file.
python talon.py \ --f /private/.../TALON/hg38mini1BD.csv \ --db /private/.../TALON/hg38v27_talon.db \ --build hg38v27 --o /private/.../TALON/hg38tal1BD
The csv file has the alignment file name. however when I ran it, it gave me
Traceback (most recent call last):
File "talon.py", line 2475, in <module>
main()
File "talon.py", line 2452, in main
struct_collection, outprefix)
File "talon.py", line 1694, in process_all_sam_files
exon_annotations, abundance = annotate_sam_transcripts(sam, d_name, cursor, struct_collection, o)
File "talon.py", line 1919, in annotate_sam_transcripts
qc_metrics = check_read_quality(line, struct_collection)
File "talon.py", line 2069, in check_read_quality
raise ValueError("SAM transcript %s lacks an MD tag" % read_ID)
ValueError: SAM transcript 2d415520-3a14-44b9-af3c-1962ab413e36 lacks an MD tag
I am not sure what is happening here...
In the course of comparing my results to MatchAnnot, I have become aware that it is not enough to assign the query to the best gene match and then do transcript comparisons from there. I need to be directly comparing exons. I think it is appropriate to create a new branch for this: exon-wise-comparison.
This is a fairly difficult problem. The first version I implemented was greedy, seeking to choose the most closely matched exon for every query exon in the sam transcript. Here is an example that illustrates why this doesn't work.
Transcript c33255/f1p0/3238: The exon coordinates are as follows:
827670-827775
829003-829104
847654-847806
849484-849602
851927-852110
852671-852766
853391-853424
853474-855121
856449-857242
When we try to match the very first exon, the greedy match is 827669-827775, which is found only in annotated transcript ENST00000609139.5. Visually, we can see in the UCSC genome browser shot below that there were other options. For instance, the first four exons of ENST00000608189.4 (second transcript) match the query.
Should we be filtering out things like snoRNAs, miRNA, etc. At first, my plan was to stick to protein coding genes and lncRNAs, but then I visited the GENCODE page and found out that there is an absurd number of biotypes.
Sorry for all the questions.
I would like to use the R script to generate plots from my db generated through talon scripts.
I am not so familiar and I do not know how to input my files.
The Transcript class needs to
See #1 for additional considerations.
ENSE00003758511.1
Discovered this problem while working with #15
I am having issues getting python3.7 installed and currently version is 3.6.0 for python3.
make tests works with some warning,
but then make unit and make integration fail for most tests
I attach output.
Is this to be expected without using python3.7?
In any case I can get talon scripts to run and output in same environment
error_output.txt
When I started running TALON on the PB36 ROIs (pre-TranscriptClean), I found the following transcript because it was showing up as a known match with a 3' difference of -14894
m171202_193234_42198_c101417902550000001823303602281843_s1_p0/124708/718_57_CCS
Also this: m171202_130838_42198_c101417902550000001823303602281842_s1_p0/67217/29_1817_CCS
3' difference of -32724
At first I thought they were bugs. But I think what is really happening is that the exons in the query only match the transcript on file that also happens to have a really long 3' UTR. Our query ends much earlier, resulting in the big 3' difference.
Hi there,
I tried to run talon and everything run smoothly, but when I get to the point to run 'talon_generate_report', I got the following issue even though I have already installed all the necessary libraries in R.
$ talon_generate_report
Loading required packages...
One of the required packages is missing.
Here's the original error message:
there is no package called ‘DBI’Error in theme_bw(base_family = "Helvetica", base_size = 18) :
could not find function "theme_bw"
Calls: main ... custom_theme -> suppressMessages -> withCallingHandlers
Execution halted
I also couldn't find the exact arguments for running talon_generate_report. Is there somewhere in the documentation?
Thank you in advance!
Ali and I agreed that this is a good feature to have, but it does make filtering genes more complicated since a lot of these exons are less than 200 bp. And spike-ins fall into this category.
Hey,
I am testing out the new TALON version 4.4 with the same testset I used for 4.2 (which passes).
However, I am encountering the message in the title.
Database counter for 'dataset' does not match the number of entries in the table. Discarding changes to database and exiting...
table_count: 2
counter_value: 1
Traceback (most recent call last):
File "/home/miniconda/envs/py37/bin/talon", line 8, in <module>
sys.exit(main())
File "/home/miniconda/envs/py37/lib/python3.7/site-packages/talon/talon.py", line 2411, in main
update_database(database, batch_size, run_info.outfiles, dataset_db_entries)
File "/home/miniconda/envs/py37/lib/python3.7/site-packages/talon/talon.py", line 1786, in update_database
check_database_integrity(cursor)
File "/home/miniconda/envs/py37/lib/python3.7/site-packages/talon/talon.py", line 2087, in check_database_integrity
raise RuntimeError("Discrepancy found in database. " + \
RuntimeError: Discrepancy found in database. Discarding changes to database and exiting...
I checked the database using DB Browser for SQLite to see if the "dataset" table has 2 dataset entries, but it contains just one. My config file also contains just one dataset entry. Any idea how these numbers are not matching up?
I will need to read in transcripts from SAM files and create representations of them that allow for comparisons against the annotation.
I think it would make sense for the transcripts to belong to the same class as the annotation-derived transcripts. Or at least be a subclass of them.
While selecting test transcripts for the compute_transcript_end function, I stumbled upon this pair in GM12878_chr1_clean.sam: c30124/f1p15/2713 and c2986/f14p15/2748. It's interesting to me because the internal exon boundaries match, but there is a difference at the 3' end. The 5' ends are the same.
Because I added exon validity checks, the build_annotation script now crashes on the pseudoautosomal region because there are two exons on different chromosomes that have the same gencode exon ID (even though their transcript and gene IDs are different). The ID is ENSE00001702291.1. Because of this, I am moving over to using chrome_start_end as an ID instead, with the gencode ID stored as a name instead.
It seems to be fairly common for transcripts to overlap more than one gene, but I suspect that the genes in question are not protein-coding. I may run TALON on a gencode file that excludes non-poly-A transcripts, but even so, it may be better to write a function that takes into account the possibility of multiple gene overlaps (with some pretty bad matches). My concern here is mostly for transcripts belonging to a novel gene. We don't want to accidentally assign them to a snoRNA or something just because it's the only overlap there is.
When I added a check in the GeneTree add_gene class function to look for existing genes with the input gene_name before adding a new gene, I unexpectedly got a hit:
python talon.py --f test_files/GM12878_chr1_clean.sam -g test_files/gencode.v24.annotation.chr1.gtf
Traceback (most recent call last):
File "talon.py", line 92, in <module>
main()
File "talon.py", line 84, in main
read_gtf_file(gtf_file)
File "talon.py", line 50, in read_gtf_file
genes.add_gene_from_gtf(tab_fields)
File "/dfs1/bio/dwyman/pacbio_f2016/TALON/genetree.py", line 103, in add_gene_from_gtf
self.add_gene(gene_name, chromosome, start, end, strand)
File "/dfs1/bio/dwyman/pacbio_f2016/TALON/genetree.py", line 71, in add_gene
" is duplicated.")
KeyError: 'Gene names must be unique. TP73-AS1 is duplicated.'
I decided to look for this gene in the annotation to see what might be going on
awk '{if($3 == "gene") print $0}' test_files/gencode.v24.annotation.chr1.gtf | grep "TP73-AS1"
chr1 HAVANA gene 3735601 3747336 . - . gene_id "ENSG00000227372.10"; gene_type "antisense"; gene_status "KNOWN"; gene_name "TP73-AS1"; level 2; havana_gene "OTTHUMG00000003414.4";
chr1 ENSEMBL gene 3736943 3737103 . + . gene_id "ENSG00000276189.1"; gene_type "misc_RNA"; gene_status "NOVEL"; gene_name "TP73-AS1"; level 3;
Sure enough, these genes share the same name. They have different gene_id fields though. This means that one way around the problem is to use the gene_id instead of the gene_name to guard against collisions. I generally don't like using gene_ids because they are so difficult to read, but it is probably the responsible thing to do since they are must more precise than gene names. I can always retain use of the gene_name in the object printing functions.
I modified the Gene and GeneTree classes so that they now use the gene_id as the primary gene identifier, with the gene_name included mainly for readability. The collision problem no longer happens.
I was recovering only half of expected "score 5" matches because I didn't consider that for negative strand transcripts, the first exon processed is really its last exon. Need to handle this gracefully
Case observed in #15. This is a pathological case for MatchAnnot when the transcript in question has only one exon- the transcript erroneously gets assigned to the miRNA gene.
At present, TALON and MatchAnnot both assign c14773/f1p0/3486 to gene MIR4425, but TALON does not assign it a transcript match. I need to change this behavior so that c14773/f1p0/3486 gets assigned to a novel gene. This issue is closely related to #12.
In the course of working on GTF processing for the transcripts (#5), I found lines in the GTF file that went beyond the major three groups I work with (genes, transcripts, and exons). Here are some examples:
chr2 HAVANA CDS 41611 41627 . - 2 gene_id "ENSG00000184731.5"; transcript_id "ENST00000327669.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; exon_number 2; exon_id "ENSE00001537167.2"; level 2; protein_id "ENSP00000328347.4"; tag "basic"; transcript_support_level "1"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS42645.1"; havana_gene "OTTHUMG00000151321.1"; havana_transcript "OTTHUMT00000322220.1";
chr2 HAVANA stop_codon 41608 41610 . - 0 gene_id "ENSG00000184731.5"; transcript_id "ENST00000327669.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; exon_number 2; exon_id "ENSE00001537167.2"; level 2; protein_id "ENSP00000328347.4"; tag "basic"; transcript_support_level "1"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS42645.1"; havana_gene "OTTHUMG00000151321.1"; havana_transcript "OTTHUMT00000322220.1";
chr2 HAVANA UTR 38814 41610 . - . gene_id "ENSG00000184731.5"; transcript_id "ENST00000327669.4"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "FAM110C"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "FAM110C-001"; exon_number 2; exon_id "ENSE00001537167.2"; level 2; protein_id "ENSP00000328347.4"; tag "basic"; transcript_support_level "1"; tag "appris_principal_1"; tag "CCDS"; ccdsid "CCDS42645.1"; havana_gene "OTTHUMG00000151321.1"; havana_transcript "OTTHUMT00000322220.1";
The simplest option of course is to just ignore them. But will I be missing important distinctions if I do? The UTR components may be particularly important.
At the end of a TALON run, it is necessary to update the TALON SQLite database both to add novel genes/transcripts/exons, but also to update existing entries when new information comes to light.
Hello,
I used talon_initialize_database, there's error.
cmd : talon_initialize_database --f Mus_musculus.GRCm38.87.gtf --a gencode.v29 --g hg38 --o example_talon 1>x1.log
Traceback (most recent call last):
File "/home/luoluo/.local/bin/talon_initialize_database", line 10, in
sys.exit(main())
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 996, in main
genes, transcripts, exons = read_gtf_file(gtf_file)
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/initialize_talon_database.py", line 496, in read_gtf_file
transcript = Transcript.get_transcript_from_gtf(tab_fields)
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/transcript.py", line 286, in get_transcript_from_gtf
annotations = extract_transcript_annotations_from_GTF(transcript_info)
File "/home/luoluo/.local/lib/python3.7/site-packages/talon/transcript.py", line 306, in extract_transcript_annotations_from_GTF
key, val = pair.split()
ValueError: too many values to unpack (expected 2)
is there something wrong I did???
Hi,
I was trying to run talon on my minion data. But I am encountering index error. Bellow is the error. Can you please look into it.
Thanks
[ 2019-11-13 23:44:23 ] Started TALON run
Traceback (most recent call last):
File "/usr/local/bin/talon", line 8, in
sys.exit(main())
File "/usr/local/lib/python3.7/site-packages/talon/talon.py", line 2346, in main
sam_files, dset_metadata = check_inputs(options)
File "/usr/local/lib/python3.7/site-packages/talon/talon.py", line 1411, in check_inputs
curr_sam = line[3]
IndexError: list index out of range
I was confused as to how to move through talon scripts.
I downsampled my reads to 10e4, then mapped to hg38 using minimap2 with --MD flag included to generated > aln.sam
I then used TranscriptClean but without -v or -j flags to error correct and output clean.sam
Next I used this to initialize a talon db using the initialize script feeding in gencode.v29.annotation.gtf and --g hg38_build --a hg38_annot
Following this, I used talon.py script with the db, and config.csv pointing to clean.sam file and metadata. This gave and log_QC.txt file and update the hg38.db.
Finally I ran the create_abundance_file_from_database and this gave me a .tsv file.
However when I check the columns for exon and gene, I cannot find any unknown. Even with small number of reads would expect something?
Related to #23
c13856/f1p1/3056: In this case, TALON returns transcript match set(['ENST00000372597.5', 'ENST00000372596.5']) because it can't tell which of these transcripts is a better match. If we look closely, we can see that ENST00000372597.5 is better because ENST00000372596.5 contains all of the query exons, BUT it also contains one extra exon that is not in the query. Update: After looking more closely at this example, I noticed that neither of the candidates are a perfect match- they each contain a tiny exon at position 3 that the query does not.
I decided to address the spirit of the original problem by checking each match at the end to see if the query and match transcripts each contain the same number of exons. If they do, it's a match, and if not, it isn't.
In designing Python classes for genes, transcripts, and exons, it is really important that there be enough complexity to allow for flexible handling of different situations, but I don't want things to become needlessly complicated.
This need is illustrated well by the following example.
Here, we have an annotated gene, A, which has an annotated transcript, A.1. Transcript A.1 has three exons, each of which have defined start and end positions. Using a long read sequencing platform, we detect four transcripts that are candidate isoforms of gene A, and which all happen to resemble transcript A.1. My software needs to be able to efficiently compare each candidate to transcript A.1, and decide whether each one is a match or not. The definition of a match is flexible. Here are some different rules that could be applied alone or in combination:
Because of this structure, I think it is necessary to have exon entities be part of the transcript object. Then, I can have a method that creates an "exon string" representation of the transcript according to the rules in operation. If 3' variation is allowed, for instance, then the end_E3 position would be omitted from the exon string. If fuzzy junctions are permitted, then maybe I would need to create multiple permuted exon strings of transcript A.1 for the candidates to be compared to. But this could get really complicated fast for genes that have many exons, so I may need a smarter solution. For situations like that, it might make more sense to split the candidate transcript into exon "substrings" that could be matched adaptively to exons of the genes.
I can see already that I will need test cases to track the matching process under different rule structures.
I just had another thought- once we establish that a novel transcript has been found, the program should create a transcript object linked to the gene that is stored the same way that an annotated transcript would be. That way, every candidate matched to that gene from this point on will be screened against the novel transcript to see if it matches.
I had a discussion with Ali, and we decided to avoid fuzzy junctions for now. I should probably still try to find out how often novel transcripts are just wobbly versions of annotated ones, but it will not drive the structure of the code. This means I can go forward with my plan to represent transcripts as objects that have a list of exons (or introns). To compare transcripts to each other, we call a class function i.e. get_transcript_string() to create a string representation of each transcript, which can be subject to different rules as outlined in the previous section.
I need to work out how to deal with exons/introns without losing too much information.
The basic structure is in place, but the next phase will involve more detailed exon comparisons.
I was wondering if there is a faster way to run this on large SAM files (>30gb) - will there be a parallel version of TALON?
I haven't seen this case happen in my data yet, but I bet it will at some point so I should be prepared to deal with it.
Hi, I noticed that there was mention of an exon-based comparison tool but cannot find that option in the talon commands. I am trying to quantify differences in 3' and 5' exon boundaries compared with known isoforms and this would be extremely helpful. Thank you!
Tofu + STAR + MatchAnnot commands:
# Run Tofu
source activate myenv
collapse_isoforms_by_sam.py --fq --input GM12878_chr1_canonicalOnly.fq -s sorted.GM12878_chr1_canonicalOnly.sam --dun-merge-5-shorter -o Tofu_STAR_MatchAnnot/Tofu/tofu
source deactivate
# Remap with STAR using bash script. Here is the essential command though
STARlong --runThreadN 4 --genomeDir /bio/dwyman/pacbio_f2016/data/STAR_hg38_ENCODE \
--readFilesIn /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/Tofu/tofu.collapsed.rep.fq \
--sjdbGTFfile /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
--outFileNamePrefix /bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/ \
--outSAMattributes NH HI NM MD jM jI --readNameSeparator space --outFilterMultimapScoreRange 1 --outFilterMismatchNmax 2000 --scoreGapNoncan -20 --scoreGapGCAG -4 --scoreGapATAC -8 --scoreDelOpen -1 --scoreDelBase -1 --scoreInsOpen -1 --scoreInsBase -1 --alignEndsType Local --seedSearchStartLmax 50 --seedPerReadNmax 100000 --seedPerWindowNmax 1000 --alignTranscriptsPerReadNmax 100000 --alignTranscriptsPerWindowNmax 10000 --outSAMtype SAM
samtools sort Aligned.out.sam -o sorted.Aligned.out -O sam
# Run MatchAnnot
# Crashed when I didn't have --format alt. Must be some types in the v24 annotation
/bio/dwyman/pacbio_f2016/code/MatchAnnot-master/matchAnnot.py --format alt \
--gtf /bio/dwyman/pacbio_f2016/data/GENCODEv24/gencode.v24.annotation.gtf \
/bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/STAR/sorted.Aligned.out > \
/bio/dwyman/pacbio_f2016/TALON/test_files/Tofu_STAR_MatchAnnot/match_annot_results.txt
TALON
python talon.py --f test_files/sorted.GM12878_chr1_canonicalOnly.sam -g test_files/gencode.v24.annotation.chr1.gtf
Tofu + STAR + MatchAnnot Results:
grep "result" match_annot_results.txt | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 == 0 && $11 ==0 ) print $0}' | wc -l
ot_results.txt | grep "sc: 5" | awk
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($11 ==0 ) print $0}' | wc -l
grep "result" match_annot_results.txt | grep "sc: 5" | awk '{if($10 ==0 ) print $0}' | wc -l
TALON results (as of commit dewyman@568667cc4b1eaf59fc7b7eebd302756d5fafb1e):
I found this case while working on #15. MatchAnnot assigns transcript c33225/f1p0/3407 to gene PHTF1, while TALON does not assign it to a gene at all. The reason for this discrepancy is that TALON requires the transcript and gene strand to match, while MatchAnnot does not. That's all well and good, but in this particular case, the exons line up really well, so I'm surprised that we sequenced this, especially since there are a bunch of similar transcripts on the correct strand. We can see this below. Transcript c33225/f1p0/3407 is the one in blue
Further downstream,
Right now, I just store the start and end positions of the exons in the transcript object. It may make more sense in the future to change it to storing the exon objects themselves. The issue will be the SAM exons, which don't graduate to objects until the query stage.
While working on #15, I came across the following case c18234/f1p0/3056 (red bar). Visually, we can see that the transcript matches the NAP1L4P1 gene, and indeed, that is how MatchAnnot assigns it. However, TALON assigns the transcript to the CD58 gene, which appears incorrect. This may be a pretty insidious problem to fix because of how I do my gene assignments. Maybe I need to move to a more directly exon-based approach.
I found a second such pathological case: c26372/f1p0/3232
I decided to use the MatchAnnot output for PB14 to check how many sam_transcripts can be expected to exactly match an annotated transcript exon_string (including the 3' and 5' ends).
grep "result" /bio/dwyman/pacbio_f2016/GM12878/rep3/MatchAnnot/MatchAnnot_out/selection_2-10kb/selection_2-10kb_annotation | grep "sc: 5" | awk '{if($10 == 0 && $11 ==0 ) print $0}' | head
I only found two transcripts in the output:
result: PB.435.2|chr1:206557165-206589282(+)|c23251/f37p58/3495 RASSF5 RASSF5-004 ex: 5 sc: 5 5-3: 0 0
result: PB.2089.2|chr9:130579606-130638352(+)|c15546/f2p6/3123 FUBP3 FUBP3-002 ex: 19 sc: 5 5-3: 0 0
I was lucky- the first one is on chromosome 1, so it should be in my testing set. I think the transcript cluster ID is different because cluster was re-run at some point between my test set and the MatchAnnot file, but I was able to find the SAM entry in my test set using the start position
c25862/f52p88/3496 0 chr1 206557165 255 529M25575N111M1007N298M495N116M1530N2443M * 0 0 GAACTGCTTTACGCGAGGGGCAGGAAAGGCGCGGGAGGCGGGGGAGGTGCGGAGATGGCGCTCTGCACGGCGGCGGAGGGAGGGCGCTGGCGCCGGGGACACGAAACCGCAGAGCCCGGACGAGTCAGGGAGTGAGGCGCGAGCCGGGCGCCCGGGGCTCTGCAGGCGCAGGCGGCGCGGGGACAGGAGCAGGTTACCGGGCCGCCCGAGCGCTCGCACCCCGCTGAAAGAAACGCAGGCGGCCCGCCGGCTCTGCCTGGTCCGCTACCCGACCAGCTCCCGGCTCGGGGCTCAGAGCTAGGGGCTTACGCCAAGCGGAGCCCGGGGAGGGGTGCCCACCTCCCTCCGCCGCATCCCAAGCCCGGCCCCCTTGATGCGCTGGCGGCCTCGGCCGGGAACTCCGGGGTAGATGACCGTGGACAGCAGCATGAGCAGTGGGTACTGCAGCCTGGACGAGGAACTGGAAGACTGCTTCTTCACTGCTAAGACTACCTTTTTCAGAAATGCGCAGAGCAAACATCTTTCAAAGAATGTCTGTAAACCTGTGGAGGAGACACAGCGCCCGCCCACACTGCAGGAGATCAAGCAGAAGATCGACAGCTACAACACGCGAGAGAAGAACTGCCTGGGCATGAAACTGAGTGAAGACGGCACCTACACGGGTTTCATCAAAGTGCATCTGAAACTCCGGCGGCCTGTGACGGTGCCTGCTGGGATCCGGCCCCAGTCCATCTATGATGCCATCAAGGAGGTGAACCTGGCGGCTACCACGGACAAGCGGACATCCTTCTACCTGCCCCTAGATGCCATCAAGCAGCTGCACATCAGCAGCACCACCACCGTCAGTGAGGTCATCCAGGGGCTGCTCAAGAAGTTCATGGTTGTGGACAATCCCCAGAAGTTTGCACTTTTTAAGCGGATACACAAGGACGGACAAGTGCTCTTCCAGAAACTCTCCATTGCTGACCGCCCCCTCTACCTGCGCCTGCTTGCTGGGCCTGACACGGAGGTCCTCAGCTTTGTGCTAAAGGAGAATGAAACTGGAGAGGTAGAGTGGGATGCCTTCTCCATCCCTGAACTTCAGAACTTCCTAACAATCCTGGAAAAAGAGGAGCAGGACAAAATCCAACAAGTGCAAAAGAAGTATGACAAGTTTAGGCAGAAACTGGAGGAGGCCTTAAGAGAATCCCAGGGCAAACCTGGGTAACCGGTCCTGCTTCCTCTCCTCCTGGTGCATTCAGATTTATTTGTATTATTAATTATTATTTTGCAACAGACACTTTTTCTCAGGACATCTCTGGCAGGTGCATTTGTGCCTGCCCAGCAGTTCCAGCTGTGGCAAAAGTCTCTTCCATGGACAAGTGTTTGCATGGGGGTTCAGCTGTGCCCGCCCCCAGGCTGTGCCCCACCACAGATTCTGCCAAGGATCAGAACTCATGTGAAACAAACAGCTGACGTCCTCTCTCGATCTGCAAGCCTTTCACCAACCAAATAGTTGCCTCTCTCGTCACCAAACTGGAACCTCACACCAGCCGGCAAAGGAAGGAAGAAAGGTTTTAGAGCTGTGTGTTCTTTCTCTGGCTTTGATTCTTCTTTGAGTTCTCTTACTTGCCACGTACAGGACCATTATTTATGAGTGAAAAGTTGTAGCACATTCCTTTTGCAGGTCTGAGCTAAGCCCCTGAAAGCAGGGTAATGCTCATAAAAGGACTGTTCCCGCGGCCCCAAGGTGCCTGTTGTTCACACTTAAGGGAAGTTTATAAAGCTACTGGCCCCAGATGCTCAGGGTAAGGAGCACCAAAGCTGAGGCTGGCTCAGAGATCTCCAGAGAAGCTGCAGCCTGCCCTGGCCCTGGCTCTGGCCCTGGCCCACATTGCACATGGAAACCCAAAGGCATATATCTGCGTATGTGTGGTACTTAGTCACATCTTTGTCAACAAACTGTTCGTTTTTAAGTTACAAATTTGAATTTAATGTTGTCATCATCGTCATGTGTTTCCCCAAAGGGAAGCCAGTCATTGACCATTTAAAAAGTCTCCTGCTAAGTATGGAAATCAGACAGTAAGAGAAAGCCAAAAAGCAATGCAGAGAAAGGTGTCCAAGCTGTCTTCAGCCTTCCCCAGCTAAAGAGCAGAGGAGGGCCTGGGCTACTTGGGTTCCCCATCGGCCTCCAGCACTGCCTCCCTCCTCCCACTGCGACTCTGGGATCTCCAGGTGCTGCCCAAGGAGTTGCCTTGATTACAGAGAGGGGAGCCTCCAATTCGGCCAACTTGGAGTCCTTTCTGTTTTGAAGCATGGGCCAGACCCGGCACTGCGCTCGGAGAGCCGGTGGGCCTGGCCTCCCCGTCGACCTCAGTGCCTTTTTGTTTTCAGAGAGAAATAGGAGTAGGGCGAGTTTGCCTGAAGCTCTGCTGCTGGCTTCTCCTGCCAGGAAGTGAACAATGGCGGCGGTGTGGGAGACAAGGCCAGGAGAGCCCGCGTTCAGTATGGGTTGAGGGTCACAGACCTCCCTCCCATCTGGGTGCCTGAGTTTTGACTCCAATCAGTGATACCAGACCACATTGACAGGGAGGATCAAATTCCTGACTTACATTTGCACTGGCTTCTTGTTTAGGCTGAATCCTAAAATAAATTAGTCAAAAAATTCCAACAAGTAGCCAGGACTGCAGAGACACTCCAGTGCAGAGGGAGAAGGACTTGTAATTTTCAAAGCAGGGCTGGTTTTCCAACCCAGCCTCTGAGAAACCATTTCTTTGCTATCCTCTGCCTTCCCAAGTCCCTCTTGGGTCGGTTCAAGCCCAAGCTTGTTCGTGTAGCTTCAGAAGTTCCCTCTCTGACCCAGGCTGAGTCCATACTGCCCCTGATCCCAGAAGGAATGCTGACCCCTCGTCGTATGAACTGTGCATAGTCTCCAGAGCTTCAAAGGCAACACAAGCTCGCAACTCTAAGATTTTTTTAAACCACAAAAACCCTGGTTAGCCATCTCATGCTCAGCCTTATCACTTCCCTCCCTTTAGAAACTCTCTCCCTGCTGTATATTAAAGGGAGCAGGTGGAGAGTCATTTTCCTTCGTCCTGCATGTCTCTAACATTAATAGAAGGCATGGCTCCTGCTGCAACCGCTGTGAATGCTGCTGAGAACCTCCCTCTATGGGGATGGCTATTTTATTTTTGAGAAGGAAAAAAAAAGTCATGTATATATACACATAAAGGCATATAGCTATATATAAAGAGATAAGGGTGTTTATGAAATGAGAAAATTATTGGACAATTCAGACTTTACTAAAGCACAGTTAGACCCAAGGCCTATGCTGAGGTCTAAACCTCTGAAAAAAGTATAGTATCGAGTACCCGTTCCCTCCCAGAGGTGGGAGTAACTGCTGGTAGTGCCTTCTTTGGTTGTGTTGCTCAGTGTGTAAGTGTTTGTTTCCAGGATATTTTCTTTTTAAATGTCTTTCTTATATGGGTTTTAAAAAAAAGTAATAAAAGCCTGTTGCA * NH:i:1 HI:i:1 NM:i:4 MD:Z:1370C1A209A7C1906 jM:B:c,21,21,21,21 jI:B:i,206557694,206583268,206583380,206584386,206584685,206585179,206585296,206586825
I looked up this transcript in the UCSC genome browser. There is an off-by-one error at the 3' end.
Hand-checked exons of transcript:
This function finds the end position of a SAM transcript with respect to the reference genome. To write a test, I will identify some SAM examples and look at them in the UCSC genome browser so that I can record the correct end positions. Then, my test will run the function on these examples and check the results against the correct answers.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.