Git Product home page Git Product logo

gggenomes's Introduction

gggenomes

A grammar of graphics for comparative genomics

gggenomes is a versatile graphics package for comparative genomics. It extends the popular R visualization package ggplot2 by adding dedicated plot functions for genes, syntenic regions, etc. and verbs to manipulate the plot to, for example, quickly zoom in into gene neighborhoods.

A realistic use case comparing six viral genomes

gggenomes makes it easy to combine data and annotations from different sources into one comprehensive and elegant plot. Here we compare the genomic architecture of 6 viral genomes initially described in Hackl et al.: Endogenous virophages populate the genomes of a marine heterotrophic flagellate

library(gggenomes)

# to inspect the example data shipped with gggenomes
data(package="gggenomes")

gggenomes(
  genes = emale_genes, seqs = emale_seqs, links = emale_ava,
  feats = list(emale_tirs, ngaros=emale_ngaros, gc=emale_gc)) |> 
  add_sublinks(emale_prot_ava) |>
  sync() + # synchronize genome directions based on links
  geom_feat(position="identity", size=6) +
  geom_seq() +
  geom_link(data=links(2)) +
  geom_bin_label() +
  geom_gene(aes(fill=name)) +
  geom_gene_tag(aes(label=name), nudge_y=0.1, check_overlap = TRUE) +
  geom_feat(data=feats(ngaros), alpha=.3, size=10, position="identity") +
  geom_feat_note(aes(label="Ngaro-transposon"), data=feats(ngaros),
      nudge_y=.1, vjust=0) +
  geom_wiggle(aes(z=score, linetype="GC-content"), feats(gc),
      fill="lavenderblush4", position=position_nudge(y=-.2), height = .2) +
  scale_fill_brewer("Genes", palette="Dark2", na.value="cornsilk3")
  
ggsave("emales.png", width=8, height=4)

For a reproducible recipe describing the full evolution of an earlier version of this plot with an older version of gggenomes starting from a mere set of contigs, and including the bioinformatics analysis workflow, have a look at From a few sequences to a complex map in minutes.

Motivation & concept

Visualization is a corner stone of both exploratory analysis and science communication. Bioinformatics workflows, unfortunately, tend to generate a plethora of data products often in adventurous formats making it quite difficult to integrate and co-visualize the results. Instead of trying to cater to the all these different formats explicitly, gggenomes embraces the simple tidyverse-inspired credo:

  • Any data set can be transformed into one (or a few) tidy data tables
  • Any data set in a tidy data table can be easily and elegantly visualized

As a result gggenomes helps bridge the gap between data generation, visual exploration, interpretation and communication, thereby accelerating biological research.

Under the hood gggenomes uses a light-weight track system to accommodate a mix of related data sets, essentially implementing ggplot2 with multiple tidy tables instead of just one. The data in the different tables are tied together through a global genome layout that is automatically computed from the input and defines the positions of genomic sequences (chromosome/contigs) and their associated features in the plot.

Inspiration

gggenomes is draws inspiration from some brilliant packages, in particular:

Installation

gggenomes is at this point in an alpha release state, and only available as a developmental package from github.

# if you don't have it
install.packages("devtools") 

# install gggenomes
devtools::install_github("thackl/gggenomes")

# optionally install ggtree to plot genomes next to trees
# https://bioconductor.org/packages/release/bioc/html/ggtree.html
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ggtree")

gggenomes's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gggenomes's Issues

generating a file like emale_prot_ava

Hello again @thackl

I was wondering how the emale_prot_ava file was generated. I created a blastp file with outfmt 6 and read that in using read_blast() as so:

FR_prot_ava <-read_blast("genes.blastp.tsv")

which looks like this:
FR_prot_ava

A tibble: 671 x 12

qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore

1 contig_26_pilon_925301_962441_1 contig_1560_pilon_1 100 147 0 0 1 147 5 151 9.58e-112 306
2 contig_26_pilon_925301_962441_1 contig_26_pilon_925301_962441_1 100 147 0 0 1 147 1 147 1.21e-111 305
3 contig_26_pilon_925301_962441_1 contig_1560_pilon_26 26.2 80 45 3 34 100 141 219 3.8 e- 1 21.9

I seem to missing the file_id feat_id feat_id2 columns. Any tips would be great!

Thanks,
Andy

Add non-R parser scripts to exec/

read_seq_len() and read_gbk() currently are wrappers around non-R based parsers. Add those to /exec. Add warning that they likely only work on unix-like OS.

Check how non-coding geneish features like tRNAs, ncRNAs, are handle by geom_gene()

geom_gene() parses the type column and specifically processes "mRNA" and "CDS" to handle spliced genes. Not sure, though, how it handles geneish features of other types.

I think they should by default be displayed as non-coding genes, in the same way that a mRNA is shown (this might even include the possibility of splicing, as known for some tRNAs)

A filter_track() family might be quite useful after all

Right now, filtering can easily be done at runtime using ... in pull_* functions. But it might make sense to provide this also as an external function. In particular in connection with computations, such as locate, this seems intuitive

read_gff3 makes all features introns of ID tag is missing

Currently, features are grouped by type and feat_id. However, in some gff, ID (which is harmonized to feat_id), is missing. That leads to feat_id=NA, and, thus, a single big group with all features...

Needs some form of fallback ID for is.na(ID) cases.

Successive ribbon and track colors

Hi, Thanks for your amazing package. I am trying to describe an inversion scénario between 2 genomes. Here is my code:

# a minimal seq track
s0 <- tibble(
  seq_id = c("M", "S1", "P"),
  length = c(14011714, 16938810, 16938810)
)

# a minimal gene track
g0 <- tibble(
  seq_id = c("M", "M", "M", "M", "M", "M",
             "S1", "S1", "S1", "S1", "S1", "S1",
             "P", "P", "P", "P", "P", "P"),
  start = c(1, 1957564, 7345956, 7944221, 10852458, 12757211,
            1, 1957564, 10253430, 10851695, 10852458, 12757211,
            1, 7612881, 12493219, 15053107, 15072904, 15671252),
  end = c(1830774, 6805033, 7898617, 10851694, 12679667, 14011714,
          1830774, 6805033, 7345956, 10299033, 12679667, 14011714,
          1832933, 1879467, 7830998, 12534142, 15670067, 16938810)
)

# a simple link track
l0 <- tibble(
  seq_id = c("M", "M", "M", "M", "M"),
  start = c(1, 1957564, 7345956, 10852458, 12757211),
  end = c(1830774, 6805033, 10851694, 12679667, 14011714),
  seq_id2 = c("S1", "S1", "S1", "S1", "S1"),
  start2 = c(1, 1957564, 10851695, 10852458, 12757211),
  end2 = c(1830774, 6805033, 7345956, 12679667, 14011714),  
)

p <- gggenomes(genes=g0, seqs=s0, links=l0)
p + 
  geom_seq() +         # draw contig/chromosome lines
  geom_seq_label() +   # label each sequence 
  geom_gene() +        # draw genes as arrow
  geom_link()          # draw some connections between syntenic regions

Sorry for trivial questions but I am trying to understand how to :

  1. Add a distinct color for each track and the corresponding ribbon.

  2. The "M" is overlapping with a ribbon, how can I put it on the side instead of under the 1st track ?

  3. How can I put ribbons between S1 and P ?

  4. Put a legend for each track (and the corresponding colors)

Thanks for your help !

Synteny plot from several segments to one specific segment in the plot

Hello, first of all I would like to thank you for this package (I find a lack of packages that allow to really plot genomic data in synteny).

I am writing this post to seek some help in the particular use of gggenomes. Indeed in my data I need to realize a rather simple but particular figure.

Here are my data:

> Segments
   Seq_id Name Start  End Strand
1      S1   G1   300  330     -1
2      S1   G2    40   65      1
3      S2   G3  1500 1530     -1
4      S3  A10   170  230     -1
5      S3   G5   600  630      1
6      S3   G6  2600 2670      1
7   Virus   G1    20   50      1
8   Virus   G2    70  110      1
9   Virus   G3   250  270     -1
10  Virus   G4   400  500     -1
11  Virus   G5   800  830      1
12  Virus   G6   900  950      1

With this dataframe I managed with the ggenes package to create the following plot :

Capture d’écran 2021-08-20 à 09 56 37

And then I would need to add synteny links between the Sn scaffolds and their corresponding Name in the Virus scaffold and get something like this :

Capture d’écran 2021-08-20 à 10 00 14

I thing ggenomes cand do that but I did not find a way to do it this way ... Can you help me to better understand the code please ? Thanks a lot for your help.

Here is the dput table format :

structure(list(Seq_id = structure(c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 
4L, 4L, 4L, 4L, 4L), .Label = c("S1", "S2", "S3", "Virus"), class = "factor"), 
    Name = structure(c(2L, 3L, 4L, 1L, 6L, 7L, 2L, 3L, 4L, 5L, 
    6L, 7L), .Label = c("A10", "G1", "G2", "G3", "G4", "G5", 
    "G6"), class = "factor"), Start = c(300L, 40L, 1500L, 170L, 
    600L, 2600L, 20L, 70L, 250L, 400L, 800L, 900L), End = c(330L, 
    65L, 1530L, 230L, 630L, 2670L, 50L, 110L, 270L, 500L, 830L, 
    950L), Strand = c(-1L, 1L, -1L, -1L, 1L, 1L, 1L, 1L, -1L, 
    -1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -12L
))

Unable to import tutorial data

Hi @thackl,

Thank you so much for developing this package! There is a HUGE need for open-source solutions to visualize genomic loci I am super excited to leverage this tool :)

I was trying your tutorial and unfortunately, I am stuck at the first step:

# parse sequence length and some metadata from fasta file
emale_seqs <- read_fai("emales.fna") %>%
  extract(seq_desc, into = c("emale_type", "is_typespecies"), "=(\\S+) \\S+=(\\S+)",
    remove=F, convert=T) %>%
  arrange(emale_type, length)

I tried following your "raw data" link but the link is broken. I was however able to find emales.fna in your repo here: data-raw/emales/emales.fna after unzipping emales.tgz . Is this the correct file to start the tutorial?

When I use the file data-raw/emales/emales.fna this is the error I get:

> # parse sequence length and some metadata from fasta file
> emale_seqs <- read_fai("data-raw/emales/emales.fna") %>%
  extract(seq_desc, into = c("emale_type", "is_typespecies"), "=(\\S+) \\S+=(\\S+)",
          remove=F, convert=T) %>%
  arrange(emale_type, length)

Error: arrange() failed at implicit mutate() step. 
* Problem with `mutate()` input `..2`.
x Input `..2` must be a vector, not a primitive function.
ℹ Input `..2` is `length`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two. 
2: Too many col_names, ignoring extra ones

Thanks again for making this open-source package!

Cheers,
Matt

Can I cite an alpha release version in my paper

Hi.
Thank you very much for developing useful software.

gggenomes is currently in alpha release (0.0.0.9000) but is it possible to cite it in a paper using the current package?

Thomas Hackl and Markus J. Ankenbrand (2020). gggenomes: A Grammar of Graphics for Comparative Genomics. R package version 0.0.0.9000. https://github.com/thackl/gggenomes

gggenomes make my bacterial genome data very interpretable using the example code.
It is very cool that the protein domain label can be shown in comparative genomics results!

If you do not recommend using it yet, I would consider using another software.

How to handle buggy Augustus gffs

Came across this when working on #64 with Augustus example data. Related to Gaius-Augustus/Augustus#74.

Augustus does not set CDS and exon IDs correctly - one ID per line/range instead of one ID for all lines belonging to the same connected CDS/exons. This messes with the plots, because I assume that every CDS with a unique ID is a complete CDS and thus should have its own mRNA...

Not sure how and to what extent to address this within gggenomes. Some fixes are described here Gaius-Augustus/Augustus#74

f0 <- read_gff3("aten_0.1.m1.802.gff3")
gggenomes(f0) + geom_gene(position="pile")

augustus-cds-issue

Text associated with overlapping gene regions

Hi Thomas,
Thank you for letting me know about gggenomes for dealing with overlapping gene regions. I am having some trouble getting the gene labels to line up with the offset gene regions. I attached an image, and I have my code below. Is there a way to deal with this? It will be perfect if I can just get them lined up!
Thanks for your help!
Julia

library(gggenomes)
colourCount = 30
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
genesDF<-read.table("sars-cov-2_Annotations_gggenomes.txt",sep="\t",header=TRUE)
gggenomes(genes=genesDF) +
  geom_gene(aes(fill=gene), position="pile",
            size = 4, shape = c(0), rna_size = 2, intron_shape = 0, stroke=0) +
  geom_gene_tag(aes(label=gene, check_overlap = TRUE)) +
  scale_fill_manual(values = getPalette(colourCount)) + 
  theme(legend.position = "none")
ggsave("gggenome.png", width=8, height=2)

gggenome

Add read_alitv() to context-specific read_* functions

I've added smart read_* functions (read_seqs/read_feats/read_links, ...) with automatic detection of input formats. These functions determine format-specific read_* functions such as read_alitv based on file extensions and the given context (seqs/feats/links/...). For example read_seqs(.gbk) will interpret .gbk in context "seqs" as format "seq_len" and call read_seq_len(.gbk) to parse sequence length information, while read_feats(.gbk) will interpret .gbk in context "feats" as format "gbk" and call read_gbk(.gbk) to read in genes, etc.

read_alitv currently returns a list of tables already representing the three different contexts the .json file can be read in. It should be quite straight-forward to make alitv files readable via read_seqs/feats/links.

Have a look at gggenomes_global$file_formats (global.R) which holds a list of file extensions, formats and contexts known to gggenomes, and used by the smart read functions. You can just add context-specific alitv-formats to seqs and feats. And you will have to add a new links context. So far, that wasn't necessary because for all other formats "links" and "feats" context had the same parser (blast tab files for example). Note, the name of the format should be identical to the suffix of the read_ function. So you might have to add the format alitv_seqs, alitv_genes and alitv_links (or whatever is appropriate), and could just add quick function aliases ala read_alitv_seqs <- function(...) read_alitv(...)[["seqs"]] to make read_seqs(.json) work.

If you think, .json is not specific to alitv, you can add it to the format "ambiguous". Every extension in this group requires the user to explicitly provide the format...

Oh, and you should put all read_alitv functions etc. in their own file.

Also, this is all still quite experimental, so let me know if you think it is bull or similar ;)

pick() broken - returns NULL

When using pick() on a plot, no updated plot is generated, and nothing is returned. But there aren't any errors either...

Support for short and long read alignments - read _bam / geom_read

Support read mapping visualization. Simple viz should be straight-forward ala geom_feat(position="pile") with some sensible defaults for large numbers of features. For reading .bam could borrow from bioconductor.

Might make sense to add some sort of cigar support to visualize errors - one issue would be, though that mismatches aren't explicitly encoded in bam (unless -X). Would need reference sequence to infer...

Down the line, would be cool to handle complex SVs from long reads - with supplementary alignments in style similar to linked exons.

Bug in focus() when supplying loci

Thanks for writing gggenomes, it's great. There is a bug in focus():

if(!has_name("locus_id")){

I think this should be has_name(loci, "locus_id"). Without it, I am getting the following error when using focus(.loci = loci) with a tibble loci with bin_id, seq_id, start and end columns:

Error : argument "name" is missing, with no default

make pick() work on bins not seqs

Currently, pick_bins() works on bins while pick() works on seqs. But playing around I noticed that the most common case and the more natural API would be to have pick() work on bins and provide an additional method for pick_seqs().

issue with read_seqs()

Hi @thackl

I am very interested in trying out your package (looks amazing!). However i am having trouble importing data:

read_seqs(ex("emales/emales.fna"), parse_desc=FALSE)
Reading 'fasta' with read_seq_len():

  • file_id: emales [C:/Users/leua/Documents/R/win-library/4.0/gggenomes/extdata/emales/emales.fna]

A tibble: 0 x 1

... with 1 variable: file_id

Do you know why that is?

Thanks,
Andy

"#" character in attributes breaks read_gff3()

read_gff3() is based on readr::read_tsv(). I use its comment="#" to skip comment lines. However, read_tsv() treats every occurrence of a "#" as the start of a comment, not just at the start of a line, and therefore, trims lines containing "#"...

Draw all links and not only those between adjacent genomes

By default links are only drawn between adjacent genomes. In most cases, computing all links is slow and drawing all links is slow and usually way too messy to be informative.

adjacent_only=TRUE is implemented as flag for layout_links, however, currently cannot be modified because gggenomes only passes through arguments to layout_seqs. Internally, there is already a args_links parameter, but it is not yet used. Related to #69. Will need to modify gggenomes API for layout args to make this work properly.

Also, geom_link() should also get an adjacent_only flag, which if set to FALSE should throw an error if the layout wasn't computed with adjacent_only=FALSE in the first place.

Test piping through non-R parsers (/exex/{gb2gff,seq-len}) on non-linux systems

read_seqs(".fna/.gbk/.gff") and read_gbk() use non-R parsers to preprocess the data before reading them into R. This is just done by turning the file argument provided to the read_* function ("foo.fna") into a connection ala pipe("/path/to/seq-len foo.fna"). This should work on linux systems, but probably not on Windows, and I'm not sure about MacOS...

flip_nicely() failes on protein links

sha-1bfbdfe

gggenomes(genes=emale_genes) %>%
  add_sublinks(emale_prot_ava) %>%
  flip_nicely() +
  geom_seq() + geom_gene() + geom_link()

Says "All bins appear to be flipped nicely" - which they clearly aren't. Might have something to do with how "strand" information is processed when transforming sublinks to links...

About read_gff3 and intron visualization

Hi!

Thank you for creating this excellent tool.
I have a question about drawing intron using read_gff3 and geom_gene.

I am interested in the gene structure of corals and tried to draw them from the attached gff file (This is data from a public database).
I know that the structure is as follows based on the database.

スクリーンショット 2021-07-22 2 21 06

read_gff3("~/Desktop/aten_0.1.m1.802.gff") %>% 
            gggenomes() + geom_gene(aes(fill=type, group=type), position = "pile")

スクリーンショット 2021-07-22 2 24 43

However, the mRNA structure is different from what I expected.

I noticed that the order of the numeric in vectors in the introns column was wrong, so I did some sorting.

read_gff3("~/Desktop/aten_0.1.m1.802.gff") %>% 
            mutate(introns = map(introns, sort)) %>%
            gggenomes() + geom_gene(aes(fill=type, group=type), position = "pile")

スクリーンショット 2021-07-22 2 25 48

However, the structure is still different.

I really appreciate any help you can provide.

Keigo

aten_0.1.m1.802.gff.txt

Find better way for x-axis than to overwrite `scale_x_continuous`

Currently I export scale_x_continuous(labels=labels_bp()) to get a nice x-axis. It's convenient and works without the warning that I get when I just manually add an x-scale during plot construction. Problem is, the exported function overwrites scale_x_continuous also for ggplots... And that's definitely not a behavior I want.

Maybe it would work if just not exported the function...

Package examples fail with: `Error in set_class(., "gggenomes_layout", "prepend")`

Hi, I'm having an issue with the package with examples failing:

library(gggenomes)

# to inspect the example data shipped with gggenomes
data(package="gggenomes")

gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava) %>%
  add_feats(ngaros=emale_ngaros, gc=emale_gc) %>%
  add_sublinks(emale_prot_ava) %>%
  flip_by_links() +
  geom_feat(position="identity", size=6) +
  geom_seq() +
  geom_link(data=links(2)) +
  geom_bin_label() +
  geom_gene(aes(fill=name)) +
  geom_gene_tag(aes(label=name), nudge_y=0.1, check_overlap = TRUE) +
  geom_feat(data=feats(ngaros), alpha=.3, size=10, position="identity") +
  geom_feat_note(aes(label="Ngaro-transposon"), feats(ngaros),
      nudge_y=.1, vjust=0) +
  geom_ribbon(aes(x=(x+xend)/2, ymax=y+.24, ymin=y+.38-(.4*score),
      group=seq_id, linetype="GC-content"), feats(gc),
      fill="lavenderblush4", position=position_nudge(y=-.1)) +
  scale_fill_brewer("Genes", palette="Dark2", na.value="cornsilk3")
## Error in set_class(., "gggenomes_layout", "prepend") : 
##  3 arguments passed to 'class<-' which requires 2

This error is minimally reproduced as follows:

gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava)
## Error in set_class(., "gggenomes_layout", "prepend") : 
##  3 arguments passed to 'class<-' which requires 2

My session:

sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 33 (Workstation Edition)

## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.17.so

## locale:
##  [1] LC_CTYPE=es_UY.UTF-8       LC_NUMERIC=C               LC_TIME=es_UY.UTF-8        LC_COLLATE=es_UY.UTF-8     LC_MONETARY=es_UY.UTF-8   
##  [6] LC_MESSAGES=es_UY.UTF-8    LC_PAPER=es_UY.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_UY.UTF-8 LC_IDENTIFICATION=C       

## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

## other attached packages:
## [1] profvis_0.3.7        magrittr_2.0.1       S4Vectors_0.28.1     BiocGenerics_0.36.1  gggenomes_0.9.3.9000 snakecase_0.11.0    
##  [7] jsonlite_1.7.2       tibble_3.1.4         thacklr_0.0.0.9000   tidyr_1.1.3          stringr_1.4.0        readr_2.0.1         
## [13] purrr_0.3.4          gggenes_0.4.1        ggplot2_3.3.5        dplyr_1.0.7         

## loaded via a namespace (and not attached):
## [1] pillar_1.6.2      compiler_4.0.1    tools_4.0.1       digest_0.6.27     ggfittext_0.9.1   lifecycle_1.0.0   gtable_0.3.0     
##  [8] pkgconfig_2.0.3   rlang_0.4.11      cli_3.0.1         rstudioapi_0.13   DBI_1.1.1         yaml_2.2.1        fastmap_1.1.0    
## [15] withr_2.4.2       htmlwidgets_1.5.3 generics_0.1.0    vctrs_0.3.8       hms_1.1.0         grid_4.0.1        tidyselect_1.1.1 
## [22] glue_1.4.2        R6_2.5.1          fansi_0.5.0       tzdb_0.1.2        htmltools_0.5.2   scales_1.1.1      ellipsis_0.3.2   
## [29] assertthat_0.2.1  colorspace_2.0-2  utf8_1.2.2        stringi_1.7.4     munsell_0.5.0     crayon_1.4.1     

Great package btw, thanks!

write a help-topic on gggenomes table design and coordinate system

Here a somewhat random thought on coordinates:

There's a bunch of different standards for specifing feature coordinates along biological sequences, i.e. start-end or start-length, 0-offset or 1-offset, end-open or end-included.

https://angus.readthedocs.io/en/2014/_static/2014-lecture4-genomic-intervals.pdf
https://www.biostars.org/p/84686/
https://www.biostars.org/p/6373/
https://books.google.com/books?id=FRERCgAAQBAJ&lpg=PT412&ots=GKmlzvAYey&dq=gff%20zero-width&pg=PT412#v=onepage&q=gff%20zero-width&f=false

1[start,end]   1 2 3 4   length: end-start+1
               | | | | 
Sequence       A T G C
              | | | | |
0[start,end)  0 1 2 3 4  length: end-start

Strand:
all but blast give coords on forward strand + specify strand by +/-

1[start,end] GTF/GFF, SAM, VCF, Wiggle, GenomicRanges, GenBank/EMBL Feature Table, (Blast)

  • start-end - start <= end !!
  • 1-offset
  • end-inclusive
  • zero-length to right of indicated base
  • strand +/-
  • Blast feature are not annotated on forward-strand only
  • http://gmod.org/wiki/GFF3
  • Coluns 4 & 5: "start" and "end"
  • The start and end of the feature, in 1-based integer coordinates, relative to
    the landmark given in column 1. Start is always less than or equal to end.
  • For zero-length features, such as insertion sites, start equals end and the
    implied site is to the right of the indicated base in the direction of the
    landmark. These fields are required.

0[start,end) BED, BAM, BCF

  • 0[start,end)
  • start-end
  • 0-offset
  • end-exclusive [ start,end )
  • strand +/-

reverse_trans() for y-axis has odd side-effects

Currently I use reverse_trans on the y-axis to get an intuitive order of genomes with the first at the top. This, however, has odd side-effects on y-based operations. For example, one need to use negative values to move thing upward, or use y-limits vom bigger to smaller to keep the reversed scale.

I think I do definitely want to keep the order of the bins with the first bin at the top, so just doing bin1: y=1, ... isn't an option. It might, however, work if I do an inverse mapping: bin_n: y=1, bin_n-1: y=2, bin_1: y=n. It's easy to change in the layout. Seems to mostly work. It does, however, break wrapping (which I would just have to reimplement for handed inverted coords).

This change, would, however, also introduce another inconsistency: currently pick(1) can be read as pick the first bin or the bin at y=1. If I do an inverted mapping, pick(1) should probably still reference the first bin, but it would then no longer match y...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.