thackl / gggenomes Goto Github PK
View Code? Open in Web Editor NEWA grammar of graphics for comparative genomics
Home Page: https://thackl.github.io/gggenomes/
License: Other
A grammar of graphics for comparative genomics
Home Page: https://thackl.github.io/gggenomes/
License: Other
Right now, filtering can easily be done at runtime using ...
in pull_*
functions. But it might make sense to provide this also as an external function. In particular in connection with computations, such as locate, this seems intuitive
pick()
is for genomes what dplyr::select()
is for columns.
put()
(or a similar name) could become an analog to dplyr::relocate()
for bins and seqs.
Hi!
Thank you for creating this excellent tool.
I have a question about drawing intron using read_gff3 and geom_gene.
I am interested in the gene structure of corals and tried to draw them from the attached gff file (This is data from a public database).
I know that the structure is as follows based on the database.
read_gff3("~/Desktop/aten_0.1.m1.802.gff") %>%
gggenomes() + geom_gene(aes(fill=type, group=type), position = "pile")
However, the mRNA structure is different from what I expected.
I noticed that the order of the numeric in vectors in the introns column was wrong, so I did some sorting.
read_gff3("~/Desktop/aten_0.1.m1.802.gff") %>%
mutate(introns = map(introns, sort)) %>%
gggenomes() + geom_gene(aes(fill=type, group=type), position = "pile")
However, the structure is still different.
I really appreciate any help you can provide.
Keigo
Could be locate
. Same semantics as focus, but adds a new track instead with loci as features.
https://thackl.github.io/gggenomes/reference/gggenomes.html
After reading the second file apparently successful, it just says "Error:" but without any error message...
Hi, I'm having an issue with the package with examples failing:
library(gggenomes)
# to inspect the example data shipped with gggenomes
data(package="gggenomes")
gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava) %>%
add_feats(ngaros=emale_ngaros, gc=emale_gc) %>%
add_sublinks(emale_prot_ava) %>%
flip_by_links() +
geom_feat(position="identity", size=6) +
geom_seq() +
geom_link(data=links(2)) +
geom_bin_label() +
geom_gene(aes(fill=name)) +
geom_gene_tag(aes(label=name), nudge_y=0.1, check_overlap = TRUE) +
geom_feat(data=feats(ngaros), alpha=.3, size=10, position="identity") +
geom_feat_note(aes(label="Ngaro-transposon"), feats(ngaros),
nudge_y=.1, vjust=0) +
geom_ribbon(aes(x=(x+xend)/2, ymax=y+.24, ymin=y+.38-(.4*score),
group=seq_id, linetype="GC-content"), feats(gc),
fill="lavenderblush4", position=position_nudge(y=-.1)) +
scale_fill_brewer("Genes", palette="Dark2", na.value="cornsilk3")
## Error in set_class(., "gggenomes_layout", "prepend") :
## 3 arguments passed to 'class<-' which requires 2
This error is minimally reproduced as follows:
gggenomes(emale_genes, emale_seqs, emale_tirs, emale_ava)
## Error in set_class(., "gggenomes_layout", "prepend") :
## 3 arguments passed to 'class<-' which requires 2
My session:
sessionInfo()
## R version 4.0.1 (2020-06-06)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Fedora 33 (Workstation Edition)
## Matrix products: default
## BLAS/LAPACK: /usr/lib64/libopenblas-r0.3.17.so
## locale:
## [1] LC_CTYPE=es_UY.UTF-8 LC_NUMERIC=C LC_TIME=es_UY.UTF-8 LC_COLLATE=es_UY.UTF-8 LC_MONETARY=es_UY.UTF-8
## [6] LC_MESSAGES=es_UY.UTF-8 LC_PAPER=es_UY.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=es_UY.UTF-8 LC_IDENTIFICATION=C
## attached base packages:
## [1] parallel stats4 stats graphics grDevices utils datasets methods base
## other attached packages:
## [1] profvis_0.3.7 magrittr_2.0.1 S4Vectors_0.28.1 BiocGenerics_0.36.1 gggenomes_0.9.3.9000 snakecase_0.11.0
## [7] jsonlite_1.7.2 tibble_3.1.4 thacklr_0.0.0.9000 tidyr_1.1.3 stringr_1.4.0 readr_2.0.1
## [13] purrr_0.3.4 gggenes_0.4.1 ggplot2_3.3.5 dplyr_1.0.7
## loaded via a namespace (and not attached):
## [1] pillar_1.6.2 compiler_4.0.1 tools_4.0.1 digest_0.6.27 ggfittext_0.9.1 lifecycle_1.0.0 gtable_0.3.0
## [8] pkgconfig_2.0.3 rlang_0.4.11 cli_3.0.1 rstudioapi_0.13 DBI_1.1.1 yaml_2.2.1 fastmap_1.1.0
## [15] withr_2.4.2 htmlwidgets_1.5.3 generics_0.1.0 vctrs_0.3.8 hms_1.1.0 grid_4.0.1 tidyselect_1.1.1
## [22] glue_1.4.2 R6_2.5.1 fansi_0.5.0 tzdb_0.1.2 htmltools_0.5.2 scales_1.1.1 ellipsis_0.3.2
## [29] assertthat_0.2.1 colorspace_2.0-2 utf8_1.2.2 stringi_1.7.4 munsell_0.5.0 crayon_1.4.1
Great package btw, thanks!
read_seq_len()
and read_gbk()
currently are wrappers around non-R based parsers. Add those to /exec. Add warning that they likely only work on unix-like OS.
Currently I use reverse_trans
on the y-axis to get an intuitive order of genomes with the first at the top. This, however, has odd side-effects on y-based operations. For example, one need to use negative values to move thing upward, or use y-limits vom bigger to smaller to keep the reversed scale.
I think I do definitely want to keep the order of the bins with the first bin at the top, so just doing bin1: y=1, ... isn't an option. It might, however, work if I do an inverse mapping: bin_n: y=1, bin_n-1: y=2, bin_1: y=n. It's easy to change in the layout. Seems to mostly work. It does, however, break wrapping (which I would just have to reimplement for handed inverted coords).
This change, would, however, also introduce another inconsistency: currently pick(1)
can be read as pick the first bin or the bin at y=1. If I do an inverted mapping, pick(1)
should probably still reference the first bin, but it would then no longer match y...
Make sure all geom's have proper documentation and examples
read_seqs(".fna/.gbk/.gff")
and read_gbk()
use non-R parsers to preprocess the data before reading them into R. This is just done by turning the file argument provided to the read_*
function ("foo.fna") into a connection ala pipe("/path/to/seq-len foo.fna")
. This should work on linux systems, but probably not on Windows, and I'm not sure about MacOS...
The examples for 'pick_by_tree` are broken in the documentation https://thackl.github.io/gggenomes/reference/pick.html. They do work in local builds. There seems to be an issue with reading either the tree file or with converting it to a ggtree object. @iimog, maybe you can take a look...
sha-1bfbdfe
Use fancy example from gggenomes docs
pkgdown throws errors if documented functions are "missing topics" when building the reference index. Right now we have all functions in the index, also some just have doc and relevance for internal usage, and ideally should not be part of the index.
Apparently @keywords internal
can be used to hide those r-lib/pkgdown#1378 (comment)
Here a somewhat random thought on coordinates:
There's a bunch of different standards for specifing feature coordinates along biological sequences, i.e. start-end or start-length, 0-offset or 1-offset, end-open or end-included.
https://angus.readthedocs.io/en/2014/_static/2014-lecture4-genomic-intervals.pdf
https://www.biostars.org/p/84686/
https://www.biostars.org/p/6373/
https://books.google.com/books?id=FRERCgAAQBAJ&lpg=PT412&ots=GKmlzvAYey&dq=gff%20zero-width&pg=PT412#v=onepage&q=gff%20zero-width&f=false
1[start,end] 1 2 3 4 length: end-start+1
| | | |
Sequence A T G C
| | | | |
0[start,end) 0 1 2 3 4 length: end-start
Strand:
all but blast give coords on forward strand + specify strand by +/-
Describe all add_feats/links/subfeats/sublinks/clusters in one, improved doc with examples
Add functionality to visualize methylation patterns or ChIPseq results. This needs good example data first.
By default links are only drawn between adjacent genomes. In most cases, computing all links is slow and drawing all links is slow and usually way too messy to be informative.
adjacent_only=TRUE
is implemented as flag for layout_links
, however, currently cannot be modified because gggenomes
only passes through arguments to layout_seqs
. Internally, there is already a args_links
parameter, but it is not yet used. Related to #69. Will need to modify gggenomes API for layout args to make this work properly.
Also, geom_link()
should also get an adjacent_only
flag, which if set to FALSE should throw an error if the layout wasn't computed with adjacent_only=FALSE
in the first place.
There still a lot of stuff in flux. Would be very useful to have some sort of idea which functions are stable/maturing/experimental, ...
https://lifecycle.r-lib.org/articles/lifecycle.html
include with r lifecycle::badge("deprecated")
, see https://github.com/tidyverse/tidyr/blob/master/R/nest.R
# something along the lines of:
extract(seq_desc, into = c("emale_type", "is_typespecies"), "=(\\S+) \\S+=(\\S+)", remove=FALSE, convert=TRUE)
sha-1bfbdfe
gggenomes(genes=emale_genes) %>%
add_sublinks(emale_prot_ava) %>%
flip_nicely() +
geom_seq() + geom_gene() + geom_link()
Says "All bins appear to be flipped nicely" - which they clearly aren't. Might have something to do with how "strand" information is processed when transforming sublinks to links...
Currently I export scale_x_continuous(labels=labels_bp())
to get a nice x-axis. It's convenient and works without the warning that I get when I just manually add an x-scale during plot construction. Problem is, the exported function overwrites scale_x_continuous
also for ggplots... And that's definitely not a behavior I want.
Maybe it would work if just not exported the function...
Hello again @thackl
I was wondering how the emale_prot_ava file was generated. I created a blastp file with outfmt 6 and read that in using read_blast() as so:
FR_prot_ava <-read_blast("genes.blastp.tsv")
which looks like this:
FR_prot_ava
qaccver saccver pident length mismatch gapopen qstart qend sstart send evalue bitscore
1 contig_26_pilon_925301_962441_1 contig_1560_pilon_1 100 147 0 0 1 147 5 151 9.58e-112 306
2 contig_26_pilon_925301_962441_1 contig_26_pilon_925301_962441_1 100 147 0 0 1 147 1 147 1.21e-111 305
3 contig_26_pilon_925301_962441_1 contig_1560_pilon_26 26.2 80 45 3 34 100 141 219 3.8 e- 1 21.9
I seem to missing the file_id feat_id feat_id2 columns. Any tips would be great!
Thanks,
Andy
Hi @thackl,
Thank you so much for developing this package! There is a HUGE need for open-source solutions to visualize genomic loci I am super excited to leverage this tool :)
I was trying your tutorial and unfortunately, I am stuck at the first step:
# parse sequence length and some metadata from fasta file
emale_seqs <- read_fai("emales.fna") %>%
extract(seq_desc, into = c("emale_type", "is_typespecies"), "=(\\S+) \\S+=(\\S+)",
remove=F, convert=T) %>%
arrange(emale_type, length)
I tried following your "raw data" link but the link is broken. I was however able to find emales.fna
in your repo here: data-raw/emales/emales.fna
after unzipping emales.tgz
. Is this the correct file to start the tutorial?
When I use the file data-raw/emales/emales.fna
this is the error I get:
> # parse sequence length and some metadata from fasta file
> emale_seqs <- read_fai("data-raw/emales/emales.fna") %>%
extract(seq_desc, into = c("emale_type", "is_typespecies"), "=(\\S+) \\S+=(\\S+)",
remove=F, convert=T) %>%
arrange(emale_type, length)
Error: arrange() failed at implicit mutate() step.
* Problem with `mutate()` input `..2`.
x Input `..2` must be a vector, not a primitive function.
ℹ Input `..2` is `length`.
Run `rlang::last_error()` to see where the error occurred.
In addition: Warning messages:
1: Unnamed `col_types` should have the same length as `col_names`. Using smaller of the two.
2: Too many col_names, ignoring extra ones
Thanks again for making this open-source package!
Cheers,
Matt
Came across this when working on #64 with Augustus example data. Related to Gaius-Augustus/Augustus#74.
Augustus does not set CDS and exon IDs correctly - one ID per line/range instead of one ID for all lines belonging to the same connected CDS/exons. This messes with the plots, because I assume that every CDS with a unique ID is a complete CDS and thus should have its own mRNA...
Not sure how and to what extent to address this within gggenomes. Some fixes are described here Gaius-Augustus/Augustus#74
f0 <- read_gff3("aten_0.1.m1.802.gff3")
gggenomes(f0) + geom_gene(position="pile")
This is odd - in the function reference produced with pkgdown the geom_gene() examples with "pile"ed up isoforms are broken. On my local machine, however, they work just fine...
@iimog can you run that example on your end and let me know whether it works for you?
It's more often than not irritating to have features "hover" above the sequence, instead of just being plotted right on top. That happens because pile by default has an offset from the 0-line
Currently, features are grouped by type and feat_id. However, in some gff, ID (which is harmonized to feat_id), is missing. That leads to feat_id=NA, and, thus, a single big group with all features...
Needs some form of fallback ID for is.na(ID)
cases.
sha-1bfbdfe
f0 <- tibble(
seq_id="a",
start=1,
end = 3,
introns=list(c(2,2)),
)
gggenomes(genes=f0) +
geom_gene(position="pile") + theme_bw()
Currently produces a 0-length intron between 2 and 3. Should, however produce a 1bp intron covering pos 2.
I've added smart read_* functions (read_seqs/read_feats/read_links, ...) with automatic detection of input formats. These functions determine format-specific read_* functions such as read_alitv
based on file extensions and the given context (seqs/feats/links/...). For example read_seqs(.gbk)
will interpret .gbk in context "seqs" as format "seq_len" and call read_seq_len(.gbk)
to parse sequence length information, while read_feats(.gbk)
will interpret .gbk in context "feats" as format "gbk" and call read_gbk(.gbk)
to read in genes, etc.
read_alitv currently returns a list of tables already representing the three different contexts the .json file can be read in. It should be quite straight-forward to make alitv files readable via read_seqs/feats/links
.
Have a look at gggenomes_global$file_formats
(global.R) which holds a list of file extensions, formats and contexts known to gggenomes, and used by the smart read functions. You can just add context-specific alitv-formats to seqs
and feats
. And you will have to add a new links
context. So far, that wasn't necessary because for all other formats "links" and "feats" context had the same parser (blast tab files for example). Note, the name of the format should be identical to the suffix of the read_
function. So you might have to add the format alitv_seqs, alitv_genes and alitv_links (or whatever is appropriate), and could just add quick function aliases ala read_alitv_seqs <- function(...) read_alitv(...)[["seqs"]]
to make read_seqs(.json)
work.
If you think, .json is not specific to alitv, you can add it to the format "ambiguous". Every extension in this group requires the user to explicitly provide the format...
Oh, and you should put all read_alitv functions etc. in their own file.
Also, this is all still quite experimental, so let me know if you think it is bull or similar ;)
Hi.
Thank you very much for developing useful software.
gggenomes is currently in alpha release (0.0.0.9000) but is it possible to cite it in a paper using the current package?
Thomas Hackl and Markus J. Ankenbrand (2020). gggenomes: A Grammar of Graphics for Comparative Genomics. R package version 0.0.0.9000. https://github.com/thackl/gggenomes
gggenomes make my bacterial genome data very interpretable using the example code.
It is very cool that the protein domain label can be shown in comparative genomics results!
If you do not recommend using it yet, I would consider using another software.
geom_gene() parses the type column and specifically processes "mRNA" and "CDS" to handle spliced genes. Not sure, though, how it handles geneish features of other types.
I think they should by default be displayed as non-coding genes, in the same way that a mRNA is shown (this might even include the possibility of splicing, as known for some tRNAs)
When using pick()
on a plot, no updated plot is generated, and nothing is returned. But there aren't any errors either...
Hi Thomas,
Thank you for letting me know about gggenomes for dealing with overlapping gene regions. I am having some trouble getting the gene labels to line up with the offset gene regions. I attached an image, and I have my code below. Is there a way to deal with this? It will be perfect if I can just get them lined up!
Thanks for your help!
Julia
library(gggenomes)
colourCount = 30
getPalette = colorRampPalette(brewer.pal(9, "Set1"))
genesDF<-read.table("sars-cov-2_Annotations_gggenomes.txt",sep="\t",header=TRUE)
gggenomes(genes=genesDF) +
geom_gene(aes(fill=gene), position="pile",
size = 4, shape = c(0), rna_size = 2, intron_shape = 0, stroke=0) +
geom_gene_tag(aes(label=gene, check_overlap = TRUE)) +
scale_fill_manual(values = getPalette(colourCount)) +
theme(legend.position = "none")
ggsave("gggenome.png", width=8, height=2)
Currently, pick_bins()
works on bins while pick()
works on seqs. But playing around I noticed that the most common case and the more natural API would be to have pick()
work on bins and provide an additional method for pick_seqs()
.
Hello, first of all I would like to thank you for this package (I find a lack of packages that allow to really plot genomic data in synteny).
I am writing this post to seek some help in the particular use of gggenomes. Indeed in my data I need to realize a rather simple but particular figure.
Here are my data:
> Segments
Seq_id Name Start End Strand
1 S1 G1 300 330 -1
2 S1 G2 40 65 1
3 S2 G3 1500 1530 -1
4 S3 A10 170 230 -1
5 S3 G5 600 630 1
6 S3 G6 2600 2670 1
7 Virus G1 20 50 1
8 Virus G2 70 110 1
9 Virus G3 250 270 -1
10 Virus G4 400 500 -1
11 Virus G5 800 830 1
12 Virus G6 900 950 1
With this dataframe I managed with the ggenes package to create the following plot :
And then I would need to add synteny links between the Sn scaffolds and their corresponding Name in the Virus scaffold and get something like this :
I thing ggenomes cand do that but I did not find a way to do it this way ... Can you help me to better understand the code please ? Thanks a lot for your help.
Here is the dput table format :
structure(list(Seq_id = structure(c(1L, 1L, 2L, 3L, 3L, 3L, 4L,
4L, 4L, 4L, 4L, 4L), .Label = c("S1", "S2", "S3", "Virus"), class = "factor"),
Name = structure(c(2L, 3L, 4L, 1L, 6L, 7L, 2L, 3L, 4L, 5L,
6L, 7L), .Label = c("A10", "G1", "G2", "G3", "G4", "G5",
"G6"), class = "factor"), Start = c(300L, 40L, 1500L, 170L,
600L, 2600L, 20L, 70L, 250L, 400L, 800L, 900L), End = c(330L,
65L, 1530L, 230L, 630L, 2670L, 50L, 110L, 270L, 500L, 830L,
950L), Strand = c(-1L, 1L, -1L, -1L, 1L, 1L, 1L, 1L, -1L,
-1L, 1L, 1L)), class = "data.frame", row.names = c(NA, -12L
))
Support read mapping visualization. Simple viz should be straight-forward ala geom_feat(position="pile")
with some sensible defaults for large numbers of features. For reading .bam could borrow from bioconductor.
Might make sense to add some sort of cigar support to visualize errors - one issue would be, though that mismatches aren't explicitly encoded in bam (unless -X). Would need reference sequence to infer...
Down the line, would be cool to handle complex SVs from long reads - with supplementary alignments in style similar to linked exons.
Hi @thackl
I am very interested in trying out your package (looks amazing!). However i am having trouble importing data:
read_seqs(ex("emales/emales.fna"), parse_desc=FALSE)
Reading 'fasta' withread_seq_len()
:
Do you know why that is?
Thanks,
Andy
read_gff3()
is based on readr::read_tsv()
. I use its comment="#"
to skip comment lines. However, read_tsv()
treats every occurrence of a "#" as the start of a comment, not just at the start of a line, and therefore, trims lines containing "#"...
gggenomes tables use 1[start,end]
-based coordinates with start < end
and strand given by "+/-", but some annotation files such as bed are 0[start,end)
-based. Need to take care of that on import.
Add functionality to visualize GWAS-style data etc. Would need to find good example data first, though.
Thanks for writing gggenomes, it's great. There is a bug in focus():
Line 113 in a3b92d8
I think this should be has_name(loci, "locus_id")
. Without it, I am getting the following error when using focus(.loci = loci)
with a tibble loci
with bin_id, seq_id, start and end columns:
Error : argument "name" is missing, with no default
Hi, Thanks for your amazing package. I am trying to describe an inversion scénario between 2 genomes. Here is my code:
# a minimal seq track
s0 <- tibble(
seq_id = c("M", "S1", "P"),
length = c(14011714, 16938810, 16938810)
)
# a minimal gene track
g0 <- tibble(
seq_id = c("M", "M", "M", "M", "M", "M",
"S1", "S1", "S1", "S1", "S1", "S1",
"P", "P", "P", "P", "P", "P"),
start = c(1, 1957564, 7345956, 7944221, 10852458, 12757211,
1, 1957564, 10253430, 10851695, 10852458, 12757211,
1, 7612881, 12493219, 15053107, 15072904, 15671252),
end = c(1830774, 6805033, 7898617, 10851694, 12679667, 14011714,
1830774, 6805033, 7345956, 10299033, 12679667, 14011714,
1832933, 1879467, 7830998, 12534142, 15670067, 16938810)
)
# a simple link track
l0 <- tibble(
seq_id = c("M", "M", "M", "M", "M"),
start = c(1, 1957564, 7345956, 10852458, 12757211),
end = c(1830774, 6805033, 10851694, 12679667, 14011714),
seq_id2 = c("S1", "S1", "S1", "S1", "S1"),
start2 = c(1, 1957564, 10851695, 10852458, 12757211),
end2 = c(1830774, 6805033, 7345956, 12679667, 14011714),
)
p <- gggenomes(genes=g0, seqs=s0, links=l0)
p +
geom_seq() + # draw contig/chromosome lines
geom_seq_label() + # label each sequence
geom_gene() + # draw genes as arrow
geom_link() # draw some connections between syntenic regions
Sorry for trivial questions but I am trying to understand how to :
Add a distinct color for each track and the corresponding ribbon.
The "M" is overlapping with a ribbon, how can I put it on the side instead of under the 1st track ?
How can I put ribbons between S1
and P
?
Put a legend for each track (and the corresponding colors)
Thanks for your help !
For the new geom_wiggle() function I am using some stats methods from Hmisc
(which is also used by some ggplot2 stat summary functions). Could you add it to the CI @iimog?
Use tracks to show expression over different conditions/timepoints. Show, for example, viral infection cycle - link to genomic organization
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.