oist / genomicbreaks Goto Github PK

View Code? Open in Web Editor NEW

4.0 3.0 2.0 17.26 MB

Pairwise Alignment Breakpoint Analysis

Home Page: https://oist.github.io/GenomicBreaks/

License: Other

Dockerfile 0.13% R 94.88% C++ 4.99%

comparative-genomics r

genomicbreaks's Introduction

GenomicBreaks

Purpose

GenomicBreaks is a R package using Bioconductor libraries to analyse pairwise alignments of whole genomes in which the gene order has been scrambled by evolution, like in the picture below that represents the comparison of homologous chromosomes in two distantly related molds, N. crassa (chrIII) and P. comata (chr7).

This package is especially designed to parse and process the alignment files produced by the our pairwise genome alignment pipeline, but should be capable to import output of other pipelines as well.

Installation

Install the package.

If you have Bioconductor and the remotes already installed, the following should work.

Rscript -e 'remotes::install_github("oist/GenomicBreaks", repos=BiocManager::repositories())'

Add dependencies=TRUE if you would like to install the packages needed to build the vignettes.

How to create a Singularity container with GenomicBreaks installed.

See the Singularity reciepe file

singularity build --fakeroot GenomicBreaks.sif Singularity.def

GenomicBreaks in brief:

A pairwise alignment of two genomes is loaded in GBreaks objects wrapping the GRanges class. Here is an example:

GBreaks object with 505 ranges and 2 metadata columns:
        seqnames      ranges strand |     score                   query
           <Rle>   <IRanges>  <Rle> | <numeric>               <GRanges>
    [1]     chrI  5860-10010      + |     10609  NC_047487.1:8723-12716
    [2]     chrI 11157-11782      + |      1253 NC_047487.1:13035-13680
    [3]     chrI 25371-26528      + |      3651 NC_047487.1:15380-16537
    [4]     chrI 26849-29699      + |      6130 NC_047487.1:17263-20175
    [5]     chrI 29937-30607      + |       767 NC_047487.1:20203-20866
    ...      ...         ...    ... .       ...                     ...
  [501]     chrM 67039-67870      + |      1441   NC_018044.1:7758-8439
  [502]     chrM 68205-68580      + |       830   NC_018044.1:8783-9180
  [503]     chrM 69178-76168      + |     14528  NC_018044.1:9650-16261
  [504]     chrM 77356-80022      + |      8066 NC_018044.1:53206-55865
  [505]     chrM 80919-85779      + |      6712 NC_018044.1:57885-61592
  -------
  seqinfo: 17 sequences (1 circular) from sacCer3 genome

See “Get started” on https://oist.github.io/GenomicBreaks for further details.

genomicbreaks's People

Contributors

Stargazers

Watchers

Forkers

yuzhenpeng johannesnicolaus

genomicbreaks's Issues

A function to return "double gaps"

Pilot below. The final function should better workaround zero-width ranges. See also Issue #12

filterMappedUnaligned <- function(gb) {
  colinearRegions <- filterColinearRegions(flagColinearAlignments(gb), rename = FALSE)
  # Turning the sequence of [(TRUE)n, FALSE]n into indices.
  idx <- c(0, head(cumsum(!colinearRegions$colinear), -1))
  gbl <- split(colinearRegions, idx)
  unalMap <- endoapply(gbl, \(gb) {
    # Check strand
    onMinus <- all(strand(gb) == "-")
    # Need to subtract 1 because some ranges are adjacent (no gap)
    GBreaks(target = cleanGaps(gb      - 1),
            query  = cleanGaps(gb$query -1) |> sort(decreasing = onMinus))
  }) |> unlist()
  names(unalMap) <- NULL
  unalMap
}

plotApairOfChrs(exampleTranslocation)

For instance the output of plotApairOfChrs(exampleTranslocation) (below) should show the chrA:201-300 target range.

> exampleTranslocation
GBreaks object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |        query
         <Rle> <IRanges>  <Rle> |    <GRanges>
  [1]     chrA   100-200      + | chrB:100-200
  [2]     chrA   201-300      + | chrC:201-300
  [3]     chrA   301-400      + | chrB:301-400
  -------
  seqinfo: 1 sequence from an unspecified genome

Add a new function to fill inversions.

Is that useful enough to be added ?

fillInversions <- function(gb) {
  Invs <- which(flagInversions(gb)$inv) + 1
  strand(gb)[Invs] <- ifelse(strand(gb)[Invs] == "+", "-", "+")
  coalesce_contigs(gb)
}

`flagColinearAlignments` fails to detect colinearity of strandless ranges

> b
GRanges object with 3 ranges and 1 metadata column:
      seqnames    ranges strand |      query
         <Rle> <IRanges>  <Rle> |  <GRanges>
  [1]      XSR   101-180      * | S1:101-200
  [2]      XSR   201-300      * | S1:201-300
  [3]      XSR   320-400      * | S1:301-400
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> flagColinearAlignments(b)
GRanges object with 3 ranges and 2 metadata columns:
      seqnames    ranges strand |      query colinear
         <Rle> <IRanges>  <Rle> |  <GRanges>    <Rle>
  [1]      XSR   101-180      * | S1:101-200    FALSE
  [2]      XSR   201-300      * | S1:201-300    FALSE
  [3]      XSR   320-400      * | S1:301-400    FALSE
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> coalesce_contigs(b)
GBreaks object with 1 range and 2 metadata columns:
      seqnames    ranges strand |      query     score
         <Rle> <IRanges>  <Rle> |  <GRanges> <integer>
  [1]      XSR   101-400      * | S1:101-400       300
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Add functions to compute percent identity etc.

bridgeRegions crashes when there are no colinear regions.

bridgeRegions() crashes with a cryptic error message when there are no colinear regions.

> bridgeRegions(exampleInsertion)
 Error in stop_if_wrong_length("'seqnames'", ans_len) : 
'seqnames' must have the length of the object to construct (1) or length 1

A typical object with no colinear regions is an object that was already coalesced. In that case, it would be better to either stop with a clear error message, or to return an empty GBreaks object.

`flagInversions` misclassifies non-inversions

> flagInversions(inv4)
GRanges object with 3 ranges and 2 metadata columns:
      seqnames    ranges strand |      query   inv
         <Rle> <IRanges>  <Rle> |  <GRanges> <Rle>
  [1]      XSR   101-180      - | S1:101-200  TRUE
  [2]      XSR 1201-1300      + | S1:201-300 FALSE
  [3]      XSR   320-400      - | S1:301-400 FALSE
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

isReduced()

identical(granges(gb), reduce(gb)|>sort(i=T))

`flagColinearAlignments` does not detect colinearity on strandless objects.

> exampleColinear |> plyranges::mutate(strand = "*") |> flagColinearAlignments()
GBreaks object with 2 ranges and 2 metadata columns:
      seqnames    ranges strand |        query colinear
         <Rle> <IRanges>  <Rle> |    <GRanges>    <Rle>
  [1]     chrA   100-200      * | chrB:100-200    FALSE
  [2]     chrA   201-300      * | chrB:201-300    FALSE
  -------
  seqinfo: 1 sequence from an unspecified genome

Implement ρ and λ measures of synteny from Housworth and Postlethwait 2002

https://doi.org/10.1093/genetics/162.1.441

`cleanGaps`: support colinear `GBreaks` objects and zero-width ranges.

Some GBreak objects contain adjacent ranges, that return nothing when passed through cleanGaps.

> gr <- GRanges(c("a:1-10:+", "a:11-20:+"))
> gr
GRanges object with 2 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a      1-10      +
  [2]        a     11-20      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> cleanGaps(gr)
GRanges object with 0 ranges and 0 metadata columns:
   seqnames    ranges strand
      <Rle> <IRanges>  <Rle>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

The function needs to be modified so that it returns a zero-width range. Here is an idea:

> cleanGaps(gr - 1)
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a     10-11      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> cleanGaps(gr - 1) -1
GRanges object with 1 range and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]        a     11-10      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> width(cleanGaps(gr - 1) -1)
[1] 0

Once this is done, it should be possible to pass a GBreaks object containing only colinear ranges, and to turn it into colinear double-gap regions.

guessSeqLengths should return a vector, not an array

.guessSeqLengths <- function(gb) {
   zz <- guessSeqLengths(gb)
   zzz <- as.vector(zz)
   names(zzz) <- dimnames(zz)
   zzz
 }

Some functions, like tileGenome, strictly request vector input…

Add a converter from _CNEr_'s `Axt` and `GPairs` format to `GBreaks`

Something like:

GPairs2GBreaks <- function(GPairs, tGenome = NULL, qGenome = NULL, fix = TRUE) {
  if(isTRUE(fix))
    GPairs <- fixCoordinates(GPairs)
  target <- granges(first(GPairs))
  if(! is.null(tGenome))
    target <- GRanges(target, seqinfo = seqinfo(tGenome))
  query <- granges(second(GPairs))
  if(! is.null(qGenome))
    query <- GRanges(query, seqinfo = seqinfo(qGenome))
  gb <- GBreaks(target = target, query = query)
  ### Remove this once Issue 9 is resolved
  strand(gb) <- strand(gb$query)
  strand(gb$query) <- "*"
  ###
  mcols(gb) <- cbind(mcols(gb), mcols(GPairs))
  sort(gb, ignore.strand = TRUE)
}

Add L50 to summaries related to widths

L50width <- function(gb) weighted.mean(width(gb), as.numeric(width(gb))) # as.num to avoid integer overflow
summaryWidth <- function(gb) {
  s <- summary(width(gb))
  s["L50"] <- L50width(gb)
  s
}

A `stringDist()` method for `GBreaks`

Proof of concept:

stringDist_GBreaks <- function(x, method = "levenshtein", ignoreCase = FALSE, diag = FALSE, upper = FALSE, ...) {
  targetS <- getSeq(x)
  queryS  <- getSeq(x$query)
  seqs <- Pairs(targetS, queryS)
  sapply(seqs, \(pair){
    stringDist(c(first(pair), second(pair)))
  })
}

Implement a r distance index

Tissue-specificity index called τ in https://academic.oup.com/bioinformatics/article/21/5/650/220059 but used as synteny index called r in https://royalsocietypublishing.org/doi/full/10.1098/rspb.2021.2183.

flagInversions(GBreaks()) fails

Dockerfile fails to build

The current version of the Dockerfile fails to build (below).

I have written an alternative Dockerfile that goes through the process of installing the appropriate dependencies via apt, CRAN, and so on manually.

Note also that the dependencies libbz2-dev, liblzma-dev, and libfftw3-dev are currently missing from the instructions.

I have made these changes in the following branch and will submit a PR if desired:

https://github.com/oist/GenomicBreaks/tree/dockerfile_manual_install

> docker build  .

Sending build context to Docker daemon  10.65MB
Step 1/3 : FROM bioconductor/bioconductor_docker:RELEASE_3_13
RELEASE_3_13: Pulling from bioconductor/bioconductor_docker
35807b77a593: Pull complete 
765af7b305b2: Pull complete 
47299a17438e: Pull complete 
02782102bff3: Pull complete 
1cc728779689: Pull complete 
77e79bd7a547: Pull complete 
02dc39cd2d4e: Pull complete 
be310f0bcd62: Pull complete 
6e3f6a2d645d: Pull complete 
fb912e359468: Pull complete 
a49649d56f66: Pull complete 
d8ff0af25b9b: Pull complete 
9387799dfa7e: Pull complete 
65c9bbda106d: Pull complete 
23c068cb3e38: Pull complete 
Digest: sha256:f6b9fc2c7fa85cfaffeb36b64188d7986a807978285f55ad7ba317d7c02bdebb
Status: Downloaded newer image for bioconductor/bioconductor_docker:RELEASE_3_13
 ---> 39cb1274f9a5
Step 2/3 : LABEL authors="[email protected]"       description="GenomicBreaks package in the Bioconductor Docker image"
 ---> Running in 6e11e0ad03e2
Removing intermediate container 6e11e0ad03e2
 ---> 63a2dd4269c6
Step 3/3 : RUN R CMD INSTALL .
 ---> Running in 4c2f499f9289
Warning: invalid package ‘.’
Error: ERROR: no packages specified
The command '/bin/sh -c R CMD INSTALL .' returned a non-zero code: 1

flagLongShort outputs wrong flages on non-sorted GBreaks objects.

Also, flagLongShort does not really belong to GenomicBreaks after all.

`plotApairOfChrs` should not use arrows.

In GBreaks objects, the strand() data represents a property of the alignment, not of the genomic regions, therefore they should not be represented by arrows in plotApairOfChrs.

`+` operator should extend both the target and the query ranges

(and of course - should also operate on both)

`plotApairOfChrs` should display seqnames instead of just "target" and "query"

`?GBreaks` shouild open the help of the constructor

At the moment it opens the help of the class...

Functions to align flanks of breakpoing regions.

                   TSD
query       …──────────━━━━━─────────━━━━━─────────────…
alignment    ...............              .............
target      …──────────━━━━━              ─────────────…

Service functions

Align breakpoints

pat          ++++
range  --------==========----------
sub                    ++++             (or ---- if rev = F)

alignBreakPoints <- function(gr, n = 20, reverseComplement = FALSE) {
  if (isFALSE(reverseComplement)) reverseComplement <- identity
  pairwiseAlignment(getSeq(gr |> flank(n, both = T, start = T)),
                    getSeq(gr |> flank(n, both = T, start = F)) |> reverseComplement(),
                    type="global")
}

Align ends

pat            ++++
range  --------==========----------
sub                  ++++             (or ---- if rev = F)

alignEnds <- function(gr, n = 20, reverseComplement = FALSE) {
  resize_ <- function(...) resize(...) |> trim() |> suppressWarnings()
  if (isFALSE(reverseComplement)) reverseComplement <- identity
  pairwiseAlignment(getSeq(gr |> resize_(n, fix = "start")),
                    getSeq(gr |> resize_(n, fix = "end")) |> reverseComplement(),
                    type="global")
}

Search for TSDs

pat            +
range  --------==========----------
sub                      +

searchForTSDs <- function(gr, n = 8) {
  pairwiseAlignment(getSeq(gr |> resize(n, fix = "start")),
                    getSeq(gr |>  flank(n, start = F)),
                    type="global")
}

Align indels

                          ++++          ++++
query       …──────────────────────────────────────────…
target      …───────────────              ─────────────…
                          ++..............++

alignInDel <- function(gb) {
  list(
    q_l = getSeq(gb$query[1] |> flank(20, both = T, start = T))[[1]],
    trg = getSeq(gb[1]       |> flank(20, both = T, start = F))[[1]],
    q_r = getSeq(gb$query[2] |> flank(20, both = T, start = F))[[1]]
  ) |> as("DNAStringSet") |> msa::msaClustalW(order = "input")
}

Align to self

alignToSelf <- function(gr) {
  pairwiseAlignment(getSeq(gr),
                    getSeq(gr) |> reverseComplement(),
                    type="global")
}

+CHANGES IN VERSION 1.54.0
+-------------------------
+
+NEW FEATURES
+
+    o Add 'ignore.strand' arg to gaps() method for GenomicRanges objects.

The `GBreaks` constructor should better handle the strand of the target and query GRanges.

In the example below, the minus sign should be on the target range.

> GBreaks(target = GRanges("chr1:1-100:+"), query = GRanges("chr1:1-100:-"))
GBreaks object with 1 range and 1 metadata column:
      seqnames    ranges strand |        query
         <Rle> <IRanges>  <Rle> |    <GRanges>
  [1]     chr1     1-100      + | chr1:1-100:-
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths