thackl / gggenomes Goto Github PK

View Code? Open in Web Editor NEW

554.0 12.0 62.0 96.99 MB

A grammar of graphics for comparative genomics

Home Page: https://thackl.github.io/gggenomes/

License: Other

R 95.09% CSS 0.08% JavaScript 0.43% Perl 4.40%

comparative-genomics ggplot2 biological-data ggplot-extension genomics-visualization

gggenomes's Introduction

gggenomes

A grammar of graphics for comparative genomics

gggenomes is a versatile graphics package for comparative genomics. It extends the popular R visualization package ggplot2 by adding dedicated plot functions for genes, syntenic regions, etc. and verbs to manipulate the plot to, for example, quickly zoom in into gene neighborhoods.

A realistic use case comparing six viral genomes

gggenomes makes it easy to combine data and annotations from different sources into one comprehensive and elegant plot. Here we compare the genomic architecture of 6 viral genomes initially described in Hackl et al.: Endogenous virophages populate the genomes of a marine heterotrophic flagellate

library(gggenomes)

# to inspect the example data shipped with gggenomes
data(package="gggenomes")

gggenomes(
  genes = emale_genes, seqs = emale_seqs, links = emale_ava,
  feats = list(emale_tirs, ngaros=emale_ngaros, gc=emale_gc)) |> 
  add_sublinks(emale_prot_ava) |>
  sync() + # synchronize genome directions based on links
  geom_feat(position="identity", size=6) +
  geom_seq() +
  geom_link(data=links(2)) +
  geom_bin_label() +
  geom_gene(aes(fill=name)) +
  geom_gene_tag(aes(label=name), nudge_y=0.1, check_overlap = TRUE) +
  geom_feat(data=feats(ngaros), alpha=.3, size=10, position="identity") +
  geom_feat_note(aes(label="Ngaro-transposon"), data=feats(ngaros),
      nudge_y=.1, vjust=0) +
  geom_wiggle(aes(z=score, linetype="GC-content"), feats(gc),
      fill="lavenderblush4", position=position_nudge(y=-.2), height = .2) +
  scale_fill_brewer("Genes", palette="Dark2", na.value="cornsilk3")
  
ggsave("emales.png", width=8, height=4)

For a reproducible recipe describing the full evolution of an earlier version of this plot with an older version of gggenomes starting from a mere set of contigs, and including the bioinformatics analysis workflow, have a look at From a few sequences to a complex map in minutes.

Motivation & concept

Visualization is a corner stone of both exploratory analysis and science communication. Bioinformatics workflows, unfortunately, tend to generate a plethora of data products often in adventurous formats making it quite difficult to integrate and co-visualize the results. Instead of trying to cater to the all these different formats explicitly, gggenomes embraces the simple tidyverse-inspired credo:

Any data set can be transformed into one (or a few) tidy data tables
Any data set in a tidy data table can be easily and elegantly visualized

As a result gggenomes helps bridge the gap between data generation, visual exploration, interpretation and communication, thereby accelerating biological research.

Under the hood gggenomes uses a light-weight track system to accommodate a mix of related data sets, essentially implementing ggplot2 with multiple tidy tables instead of just one. The data in the different tables are tied together through a global genome layout that is automatically computed from the input and defines the positions of genomic sequences (chromosome/contigs) and their associated features in the plot.

Inspiration

gggenomes is draws inspiration from some brilliant packages, in particular:

Installation

gggenomes is at this point in an alpha release state, and only available as a developmental package from github.

# if you don't have it
install.packages("devtools") 

# install gggenomes
devtools::install_github("thackl/gggenomes")

# optionally install ggtree to plot genomes next to trees
# https://bioconductor.org/packages/release/bioc/html/ggtree.html
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("ggtree")

gggenomes's People

Stargazers

Watchers

gggenomes's Issues

generating a file like emale_prot_ava

Hello again @thackl

I was wondering how the emale_prot_ava file was generated. I created a blastp file with outfmt 6 and read that in using read_blast() as so:

FR_prot_ava <-read_blast("genes.blastp.tsv")

which looks like this:
FR_prot_ava

A tibble: 671 x 12

qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore

1 contig_26_pilon_925301_962441_1 contig_1560_pilon_1 100 147 0 0 1 147 5 151 9.58e-112 306
2 contig_26_pilon_925301_962441_1 contig_26_pilon_925301_962441_1 100 147 0 0 1 147 1 147 1.21e-111 305
3 contig_26_pilon_925301_962441_1 contig_1560_pilon_26 26.2 80 45 3 34 100 141 219 3.8 e- 1 21.9

I seem to missing the file_id feat_id feat_id2 columns. Any tips would be great!

Thanks,
Andy

Add non-R parser scripts to exec/

read_seq_len() and read_gbk() currently are wrappers around non-R based parsers. Add those to /exec. Add warning that they likely only work on unix-like OS.

Check how non-coding geneish features like tRNAs, ncRNAs, are handle by geom_gene()

geom_gene() parses the type column and specifically processes "mRNA" and "CDS" to handle spliced genes. Not sure, though, how it handles geneish features of other types.

I think they should by default be displayed as non-coding genes, in the same way that a mRNA is shown (this might even include the possibility of splicing, as known for some tRNAs)

Support Methylation and ChIP-Seq data

Add functionality to visualize methylation patterns or ChIPseq results. This needs good example data first.

Last gggenomes example failed with mysterious error

https://thackl.github.io/gggenomes/reference/gggenomes.html

After reading the second file apparently successful, it just says "Error:" but without any error message...

Update landing page example to match breaking changes

order of genes/seqs, flip_nicely, ...

A filter_track() family might be quite useful after all

Right now, filtering can easily be done at runtime using ... in pull_* functions. But it might make sense to provide this also as an external function. In particular in connection with computations, such as locate, this seems intuitive

read_gff3 makes all features introns of ID tag is missing

Currently, features are grouped by type and feat_id. However, in some gff, ID (which is harmonized to feat_id), is missing. That leads to feat_id=NA, and, thus, a single big group with all features...

Needs some form of fallback ID for is.na(ID) cases.

Reading gff3 with ##FASTA section broken

sha-1bfbdfe

Add a relocate()-like analog for pick(), maybe put()?

pick() is for genomes what dplyr::select() is for columns.

put() (or a similar name) could become an analog to dplyr::relocate() for bins and seqs.

Successive ribbon and track colors

Hi, Thanks for your amazing package. I am trying to describe an inversion scénario between 2 genomes. Here is my code:

# a minimal seq track
s0 <- tibble(
  seq_id = c("M", "S1", "P"),
  length = c(14011714, 16938810, 16938810)
)

# a minimal gene track
g0 <- tibble(
  seq_id = c("M", "M", "M", "M", "M", "M",
             "S1", "S1", "S1", "S1", "S1", "S1",
             "P", "P", "P", "P", "P", "P"),
  start = c(1, 1957564, 7345956, 7944221, 10852458, 12757211,
            1, 1957564, 10253430, 10851695, 10852458, 12757211,
            1, 7612881, 12493219, 15053107, 15072904, 15671252),
  end = c(1830774, 6805033, 7898617, 10851694, 12679667, 14011714,
          1830774, 6805033, 7345956, 10299033, 12679667, 14011714,
          1832933, 1879467, 7830998, 12534142, 15670067, 16938810)
)

# a simple link track
l0 <- tibble(
  seq_id = c("M", "M", "M", "M", "M"),
  start = c(1, 1957564, 7345956, 10852458, 12757211),
  end = c(1830774, 6805033, 10851694, 12679667, 14011714),
  seq_id2 = c("S1", "S1", "S1", "S1", "S1"),
  start2 = c(1, 1957564, 10851695, 10852458, 12757211),
  end2 = c(1830774, 6805033, 7345956, 12679667, 14011714),  
)

p <- gggenomes(genes=g0, seqs=s0, links=l0)
p + 
  geom_seq() +         # draw contig/chromosome lines
  geom_seq_label() +   # label each sequence 
  geom_gene() +        # draw genes as arrow
  geom_link()          # draw some connections between syntenic regions

Sorry for trivial questions but I am trying to understand how to :

Add a distinct color for each track and the corresponding ribbon.
The "M" is overlapping with a ribbon, how can I put it on the side instead of under the 1st track ?
How can I put ribbons between S1 and P ?
Put a legend for each track (and the corresponding colors)

Thanks for your help !

Synteny plot from several segments to one specific segment in the plot

Hello, first of all I would like to thank you for this package (I find a lack of packages that allow to really plot genomic data in synteny).

I am writing this post to seek some help in the particular use of gggenomes. Indeed in my data I need to realize a rather simple but particular figure.

Here are my data:

> Segments
   Seq_id Name Start  End Strand
1      S1   G1   300  330     -1
2      S1   G2    40   65      1
3      S2   G3  1500 1530     -1
4      S3  A10   170  230     -1
5      S3   G5   600  630      1
6      S3   G6  2600 2670      1
7   Virus   G1    20   50      1
8   Virus   G2    70  110      1
9   Virus   G3   250  270     -1
10  Virus   G4   400  500     -1
11  Virus   G5   800  830      1
12  Virus   G6   900  950      1

With this dataframe I managed with the ggenes package to create the following plot :

And then I would need to add synteny links between the Sn scaffolds and their corresponding Name in the Virus scaffold and get something like this :

I thing ggenomes cand do that but I did not find a way to do it this way ... Can you help me to better understand the code please ? Thanks a lot for your help.

Here is the dput table format :

structure(list(Seq_id = structure(c(1L, 1L, 2L, 3L, 3L, 3L, 4L, 
4L, 4L, 4L, 4L, 4L), .Label = c("S1", "S2", "S3", "Virus"), class = "factor"), 
    Name = structure(c(2L, 3L, 4L, 1L, 6L, 7L, 2L, 3L, 4L, 5L, 
    6L, 7L), .Label = c("A10", "G1", "G2", "G3", "G4", "G5", 
    "G6"), class = "factor"), Start = c(300L, 40L, 1500L, 170L, 
    600L, 2600L, 20L, 70L, 250L, 400L, 800L, 900L), End = c(330L, 
    65L, 1530L, 230L, 630L, 2670L, 50L, 110L, 270L, 500L, 830L, 
    950L), Strand = c(-1L, 1L, -1L, -1L, 1L, 1L, 1L, 1L, -1L, 
    -1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -12L
))

Unable to import tutorial data

Hi @thackl,

Thank you so much for developing this package! There is a HUGE need for open-source solutions to visualize genomic loci I am super excited to leverage this tool :)

I was trying your tutorial and unfortunately, I am stuck at the first step:

# parse sequence length and some metadata from fasta file
emale_seqs <- read_fai("emales.fna") %>%
  extract(seq_desc, into = c("emale_type", "is_typespecies"), "=(\\S+) \\S+=(\\S+)",
    remove=F, convert=T) %>%
  arrange(emale_type, length)

I tried following your "raw data" link but the link is broken. I was however able to find emales.fna in your repo here: data-raw/emales/emales.fna after unzipping emales.tgz . Is this the correct file to start the tutorial?

When I use the file data-raw/emales/emales.fna this is the error I get:

> # parse sequence length and some metadata from fasta file
> emale_seqs <- read_fai("data-raw/emales/emales.fna") %>%
  extract(seq_desc, into = c("emale_type", "is_typespecies"), "=(\\S+) \\S+=(\\S+)",
          remove=F, convert=T) %>%
  arrange(emale_type, length)

Error: arrange() failed at implicit mutate() step. 
* Problem with `mutate()` input `..2`.
x Input `..2` must be a vector, not a primitive function.
ℹ Input `..2` is `length`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two. 
2: Too many col_names, ignoring extra ones

Thanks again for making this open-source package!

Cheers,
Matt

add_subfeats .transform="none" broken

https://thackl.github.io/gggenomes/reference/add_tracks.html

Improve docu for add_track functions

Describe all add_feats/links/subfeats/sublinks/clusters in one, improved doc with examples

Can I cite an alpha release version in my paper

Hi.
Thank you very much for developing useful software.

gggenomes is currently in alpha release (0.0.0.9000) but is it possible to cite it in a paper using the current package?

Thomas Hackl and Markus J. Ankenbrand (2020). gggenomes: A Grammar of Graphics for Comparative Genomics. R package version 0.0.0.9000. https://github.com/thackl/gggenomes

gggenomes make my bacterial genome data very interpretable using the example code.
It is very cool that the protein domain label can be shown in comparative genomics results!

If you do not recommend using it yet, I would consider using another software.

How to handle buggy Augustus gffs

Came across this when working on #64 with Augustus example data. Related to Gaius-Augustus/Augustus#74.

Augustus does not set CDS and exon IDs correctly - one ID per line/range instead of one ID for all lines belonging to the same connected CDS/exons. This messes with the plots, because I assume that every CDS with a unique ID is a complete CDS and thus should have its own mRNA...

Not sure how and to what extent to address this within gggenomes. Some fixes are described here Gaius-Augustus/Augustus#74

f0 <- read_gff3("aten_0.1.m1.802.gff3")
gggenomes(f0) + geom_gene(position="pile")

Text associated with overlapping gene regions

Hi Thomas,
Thank you for letting me know about gggenomes for dealing with overlapping gene regions. I am having some trouble getting the gene labels to line up with the offset gene regions. I attached an image, and I have my code below. Is there a way to deal with this? It will be perfect if I can just get them lined up!
Thanks for your help!
Julia

library(gggenomes)
colourCount = 30
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
genesDF<-read.table("sars-cov-2_Annotations_gggenomes.txt",sep="\t",header=TRUE)
gggenomes(genes=genesDF) +
  geom_gene(aes(fill=gene), position="pile",
            size = 4, shape = c(0), rna_size = 2, intron_shape = 0, stroke=0) +
  geom_gene_tag(aes(label=gene, check_overlap = TRUE)) +
  scale_fill_manual(values = getPalette(colourCount)) + 
  theme(legend.position = "none")
ggsave("gggenome.png", width=8, height=2)

Mark internal functions with @keywords internal to hide them from reference index

pkgdown throws errors if documented functions are "missing topics" when building the reference index. Right now we have all functions in the index, also some just have doc and relevance for internal usage, and ideally should not be part of the index.

Apparently @keywords internal can be used to hide those r-lib/pkgdown#1378 (comment)

Add life-cycle badges to function

There still a lot of stuff in flux. Would be very useful to have some sort of idea which functions are stable/maturing/experimental, ...

https://lifecycle.r-lib.org/articles/lifecycle.html

include with r lifecycle::badge("deprecated"), see https://github.com/tidyverse/tidyr/blob/master/R/nest.R

Clean up text geoms and add gggenes fit label support

Add gggenomes() usage doc/examples

Add read_alitv() to context-specific read_* functions

I've added smart read_* functions (read_seqs/read_feats/read_links, ...) with automatic detection of input formats. These functions determine format-specific read_* functions such as read_alitv based on file extensions and the given context (seqs/feats/links/...). For example read_seqs(.gbk) will interpret .gbk in context "seqs" as format "seq_len" and call read_seq_len(.gbk) to parse sequence length information, while read_feats(.gbk) will interpret .gbk in context "feats" as format "gbk" and call read_gbk(.gbk) to read in genes, etc.

read_alitv currently returns a list of tables already representing the three different contexts the .json file can be read in. It should be quite straight-forward to make alitv files readable via read_seqs/feats/links.

Have a look at gggenomes_global$file_formats (global.R) which holds a list of file extensions, formats and contexts known to gggenomes, and used by the smart read functions. You can just add context-specific alitv-formats to seqs and feats. And you will have to add a new links context. So far, that wasn't necessary because for all other formats "links" and "feats" context had the same parser (blast tab files for example). Note, the name of the format should be identical to the suffix of the read_ function. So you might have to add the format alitv_seqs, alitv_genes and alitv_links (or whatever is appropriate), and could just add quick function aliases ala read_alitv_seqs <- function(...) read_alitv(...)[["seqs"]] to make read_seqs(.json) work.

If you think, .json is not specific to alitv, you can add it to the format "ambiguous". Every extension in this group requires the user to explicitly provide the format...

Oh, and you should put all read_alitv functions etc. in their own file.

Also, this is all still quite experimental, so let me know if you think it is bull or similar ;)

pick() broken - returns NULL

When using pick() on a plot, no updated plot is generated, and nothing is returned. But there aren't any errors either...

Support for short and long read alignments - read _bam / geom_read

Support read mapping visualization. Simple viz should be straight-forward ala geom_feat(position="pile") with some sensible defaults for large numbers of features. For reading .bam could borrow from bioconductor.

Might make sense to add some sort of cigar support to visualize errors - one issue would be, though that mismatches aren't explicitly encoded in bam (unless -X). Would need reference sequence to infer...

Down the line, would be cool to handle complex SVs from long reads - with supplementary alignments in style similar to linked exons.

Bug in focus() when supplying loci

Thanks for writing gggenomes, it's great. There is a bug in focus():

gggenomes/R/focus.R

Line 113 in a3b92d8

if(!has_name("locus_id")){

I think this should be has_name(loci, "locus_id"). Without it, I am getting the following error when using focus(.loci = loci) with a tibble loci with bin_id, seq_id, start and end columns:

Error : argument "name" is missing, with no default

Example and maybe support for RNA-Seq data

Use tracks to show expression over different conditions/timepoints. Show, for example, viral infection cycle - link to genomic organization

make pick() work on bins not seqs

Currently, pick_bins() works on bins while pick() works on seqs. But playing around I noticed that the most common case and the more natural API would be to have pick() work on bins and provide an additional method for pick_seqs().

issue with read_seqs()

Hi @thackl

I am very interested in trying out your package (looks amazing!). However i am having trouble importing data:

read_seqs(ex("emales/emales.fna"), parse_desc=FALSE)
Reading 'fasta' with read_seq_len():

file_id: emales [C:/Users/leua/Documents/R/win-library/4.0/gggenomes/extdata/emales/emales.fna]

A tibble: 0 x 1

... with 1 variable: file_id

Do you know why that is?

Thanks,
Andy

Update landing page figure with new style

Use fancy example from gggenomes docs

Add a function similar to focus but that just "hilights" new loci as features (no zoom)

Could be locate. Same semantics as focus, but adds a new track instead with loci as features.

piling isoforms broken in pkgdown examples

This is odd - in the function reference produced with pkgdown the geom_gene() examples with "pile"ed up isoforms are broken. On my local machine, however, they work just fine...

@iimog can you run that example on your end and let me know whether it works for you?

Document all the geom's

Make sure all geom's have proper documentation and examples

"#" character in attributes breaks read_gff3()

read_gff3() is based on readr::read_tsv(). I use its comment="#" to skip comment lines. However, read_tsv() treats every occurrence of a "#" as the start of a comment, not just at the start of a line, and therefore, trims lines containing "#"...

Fix column width in @eval file_formats_rd() and add to def_names()/def_types()

Draw all links and not only those between adjacent genomes

By default links are only drawn between adjacent genomes. In most cases, computing all links is slow and drawing all links is slow and usually way too messy to be informative.

adjacent_only=TRUE is implemented as flag for layout_links, however, currently cannot be modified because gggenomes only passes through arguments to layout_seqs. Internally, there is already a args_links parameter, but it is not yet used. Related to #69. Will need to modify gggenomes API for layout args to make this work properly.

Also, geom_link() should also get an adjacent_only flag, which if set to FALSE should throw an error if the layout wasn't computed with adjacent_only=FALSE in the first place.

Make intron coords consistently [1-based) the same way start/end are

sha-1bfbdfe

f0 <- tibble(
  seq_id="a",
  start=1,
  end = 3,
  introns=list(c(2,2)),
  )
gggenomes(genes=f0) +
  geom_gene(position="pile") + theme_bw()

Currently produces a 0-length intron between 2 and 3. Should, however produce a 1bp intron covering pos 2.

Test piping through non-R parsers (/exex/{gb2gff,seq-len}) on non-linux systems

read_seqs(".fna/.gbk/.gff") and read_gbk() use non-R parsers to preprocess the data before reading them into R. This is just done by turning the file argument provided to the read_* function ("foo.fna") into a connection ala pipe("/path/to/seq-len foo.fna"). This should work on linux systems, but probably not on Windows, and I'm not sure about MacOS...

flip_nicely() failes on protein links

sha-1bfbdfe

gggenomes(genes=emale_genes) %>%
  add_sublinks(emale_prot_ava) %>%
  flip_nicely() +
  geom_seq() + geom_gene() + geom_link()

Says "All bins appear to be flipped nicely" - which they clearly aren't. Might have something to do with how "strand" information is processed when transforming sublinks to links...

Add Hmisc package to CI

For the new geom_wiggle() function I am using some stats methods from Hmisc (which is also used by some ggplot2 stat summary functions). Could you add it to the CI @iimog?

parse key=value tags from fasta headers

# something along the lines of:
extract(seq_desc, into = c("emale_type", "is_typespecies"), "=(\\S+) \\S+=(\\S+)", remove=FALSE, convert=TRUE)

Error in pick_by_tree documentation example

The examples for 'pick_by_tree` are broken in the documentation https://thackl.github.io/gggenomes/reference/pick.html. They do work in local builds. There seems to be an issue with reading either the tree file or with converting it to a ggtree object. @iimog, maybe you can take a look...

About read_gff3 and intron visualization

Hi!

Thank you for creating this excellent tool.
I have a question about drawing intron using read_gff3 and geom_gene.

I am interested in the gene structure of corals and tried to draw them from the attached gff file (This is data from a public database).
I know that the structure is as follows based on the database.

read_gff3("~/Desktop/aten_0.1.m1.802.gff") %>% 
            gggenomes() + geom_gene(aes(fill=type, group=type), position = "pile")

However, the mRNA structure is different from what I expected.

I noticed that the order of the numeric in vectors in the introns column was wrong, so I did some sorting.

read_gff3("~/Desktop/aten_0.1.m1.802.gff") %>% 
            mutate(introns = map(introns, sort)) %>%
            gggenomes() + geom_gene(aes(fill=type, group=type), position = "pile")

However, the structure is still different.

I really appreciate any help you can provide.

Keigo

aten_0.1.m1.802.gff.txt

Find better way for x-axis than to overwrite `scale_x_continuous`

Currently I export scale_x_continuous(labels=labels_bp()) to get a nice x-axis. It's convenient and works without the warning that I get when I just manually add an x-scale during plot construction. Problem is, the exported function overwrites scale_x_continuous also for ggplots... And that's definitely not a behavior I want.

Maybe it would work if just not exported the function...

Default position for geom_feat should be "identity", or pile with "0" offset

It's more often than not irritating to have features "hover" above the sequence, instead of just being plotted right on top. That happens because pile by default has an offset from the 0-line

Package examples fail with: `Error in set_class(., "gggenomes_layout", "prepend")`

Hi, I'm having an issue with the package with examples failing:

library(gggenomes)

# to inspect the example data shipped with gggenomes
data(package="gggenomes")

gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava) %>%
  add_feats(ngaros=emale_ngaros, gc=emale_gc) %>%
  add_sublinks(emale_prot_ava) %>%
  flip_by_links() +
  geom_feat(position="identity", size=6) +
  geom_seq() +
  geom_link(data=links(2)) +
  geom_bin_label() +
  geom_gene(aes(fill=name)) +
  geom_gene_tag(aes(label=name), nudge_y=0.1, check_overlap = TRUE) +
  geom_feat(data=feats(ngaros), alpha=.3, size=10, position="identity") +
  geom_feat_note(aes(label="Ngaro-transposon"), feats(ngaros),
      nudge_y=.1, vjust=0) +
  geom_ribbon(aes(x=(x+xend)/2, ymax=y+.24, ymin=y+.38-(.4*score),
      group=seq_id, linetype="GC-content"), feats(gc),
      fill="lavenderblush4", position=position_nudge(y=-.1)) +
  scale_fill_brewer("Genes", palette="Dark2", na.value="cornsilk3")

## Error in set_class(., "gggenomes_layout", "prepend") : 
##  3 arguments passed to 'class<-' which requires 2

This error is minimally reproduced as follows:

gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava)

## Error in set_class(., "gggenomes_layout", "prepend") : 
##  3 arguments passed to 'class<-' which requires 2

My session:

sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 33 (Workstation Edition)

## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.17.so

## locale:
##  [1] LC_CTYPE=es_UY.UTF-8       LC_NUMERIC=C               LC_TIME=es_UY.UTF-8        LC_COLLATE=es_UY.UTF-8     LC_MONETARY=es_UY.UTF-8   
##  [6] LC_MESSAGES=es_UY.UTF-8    LC_PAPER=es_UY.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=es_UY.UTF-8 LC_IDENTIFICATION=C       

## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

## other attached packages:
## [1] profvis_0.3.7        magrittr_2.0.1       S4Vectors_0.28.1     BiocGenerics_0.36.1  gggenomes_0.9.3.9000 snakecase_0.11.0    
##  [7] jsonlite_1.7.2       tibble_3.1.4         thacklr_0.0.0.9000   tidyr_1.1.3          stringr_1.4.0        readr_2.0.1         
## [13] purrr_0.3.4          gggenes_0.4.1        ggplot2_3.3.5        dplyr_1.0.7         

## loaded via a namespace (and not attached):
## [1] pillar_1.6.2      compiler_4.0.1    tools_4.0.1       digest_0.6.27     ggfittext_0.9.1   lifecycle_1.0.0   gtable_0.3.0     
##  [8] pkgconfig_2.0.3   rlang_0.4.11      cli_3.0.1         rstudioapi_0.13   DBI_1.1.1         yaml_2.2.1        fastmap_1.1.0    
## [15] withr_2.4.2       htmlwidgets_1.5.3 generics_0.1.0    vctrs_0.3.8       hms_1.1.0         grid_4.0.1        tidyselect_1.1.1 
## [22] glue_1.4.2        R6_2.5.1          fansi_0.5.0       tzdb_0.1.2        htmltools_0.5.2   scales_1.1.1      ellipsis_0.3.2   
## [29] assertthat_0.2.1  colorspace_2.0-2  utf8_1.2.2        stringi_1.7.4     munsell_0.5.0     crayon_1.4.1

Great package btw, thanks!

write a help-topic on gggenomes table design and coordinate system

Here a somewhat random thought on coordinates:

There's a bunch of different standards for specifing feature coordinates along biological sequences, i.e. start-end or start-length, 0-offset or 1-offset, end-open or end-included.

https://angus.readthedocs.io/en/2014/_static/2014-lecture4-genomic-intervals.pdf
https://www.biostars.org/p/84686/
https://www.biostars.org/p/6373/
https://books.google.com/books?id=FRERCgAAQBAJ&lpg=PT412&ots=GKmlzvAYey&dq=gff%20zero-width&pg=PT412#v=onepage&q=gff%20zero-width&f=false

1[start,end]   1 2 3 4   length: end-start+1
               | | | | 
Sequence       A T G C
              | | | | |
0[start,end)  0 1 2 3 4  length: end-start

Strand:
all but blast give coords on forward strand + specify strand by +/-

1[start,end] GTF/GFF, SAM, VCF, Wiggle, GenomicRanges, GenBank/EMBL Feature Table, (Blast)

start-end - start <= end !!
1-offset
end-inclusive
zero-length to right of indicated base
strand +/-
Blast feature are not annotated on forward-strand only
http://gmod.org/wiki/GFF3
Coluns 4 & 5: "start" and "end"
The start and end of the feature, in 1-based integer coordinates, relative to
the landmark given in column 1. Start is always less than or equal to end.
For zero-length features, such as insertion sites, start equals end and the
implied site is to the right of the indicated base in the direction of the
landmark. These fields are required.

0[start,end) BED, BAM, BCF

0[start,end)
start-end
0-offset
end-exclusive [ start,end )
strand +/-

harmonize different coordinate systems - gff 1[start,end] vs bed 0[start,end)

gggenomes tables use 1[start,end]-based coordinates with start < end and strand given by "+/-", but some annotation files such as bed are 0[start,end)-based. Need to take care of that on import.

reverse_trans() for y-axis has odd side-effects

Currently I use reverse_trans on the y-axis to get an intuitive order of genomes with the first at the top. This, however, has odd side-effects on y-based operations. For example, one need to use negative values to move thing upward, or use y-limits vom bigger to smaller to keep the reversed scale.

I think I do definitely want to keep the order of the bins with the first bin at the top, so just doing bin1: y=1, ... isn't an option. It might, however, work if I do an inverse mapping: bin_n: y=1, bin_n-1: y=2, bin_1: y=n. It's easy to change in the layout. Seems to mostly work. It does, however, break wrapping (which I would just have to reimplement for handed inverted coords).

This change, would, however, also introduce another inconsistency: currently pick(1) can be read as pick the first bin or the bin at y=1. If I do an inverted mapping, pick(1) should probably still reference the first bin, but it would then no longer match y...

Support for SNP data - read_vcf, geom_snp

Add functionality to visualize GWAS-style data etc. Would need to find good example data first, though.

thackl / gggenomes Goto Github PK

gggenomes's Introduction

gggenomes

A grammar of graphics for comparative genomics

A realistic use case comparing six viral genomes

Motivation & concept

Inspiration

Installation

gggenomes's People

Stargazers

Watchers

Forkers

gggenomes's Issues

A tibble: 671 x 12

A tibble: 0 x 1

... with 1 variable: file_id

1[start,end] GTF/GFF, SAM, VCF, Wiggle, GenomicRanges, GenBank/EMBL Feature Table, (Blast)

0[start,end) BED, BAM, BCF

Recommend Projects

Recommend Topics

Recommend Org