bioconductor / summarizedexperiment Goto Github PK

SummarizedExperiment container

Home Page: https://bioconductor.org/packages/SummarizedExperiment

R 100.00%

summarizedexperiment's Introduction

SummarizedExperiment is an R/Bioconductor package that implements the SummarizedExperiment container.

See https://bioconductor.org/packages/SummarizedExperiment for more information including how to install the release version of the package (please refrain from installing directly from GitHub).

summarizedexperiment's People

Contributors

Stargazers

Watchers

Forkers

liubuntu ltla matthieurouland jiefei-wang ddyoyo98 vjcitn kevinrue shuostat

summarizedexperiment's Issues

Adding checkDimnames=FALSE to the SE constructor

Don't know when this happened, but it looks like the SE constructor now checks for consistency in the dimnames between the rowData and colData and assays. This is fine as a default, but I'd like an option to avoid this, in much the same way that I can set withDimnames=FALSE in assays<- and assay<- to force in an assay without manual renaming.

This is motivated by two use cases:

I have a few assays and a data frame of column data. The data frame has the correct sample names but the assays do not (different casing). It is convenient to construct the SE and rely on the getters to override the incorrect column names.
My assays are DelayedMatrices and need to be renamed during SE construction. However, this would add an extra delayed layer, which persists when extracting each assay with withDimnames=FALSE. This in turn causes some downstream code to decide that the DM is not pristine and re-save the data into a new file. (The downstream functions can't avoid this with ignore.dimnames= because there are other cases where renaming is intentional and should trigger a save.) Ideally I would preserve the input state of my DelayedMatrices so that the downstreams can correctly detect duplicates.

Remove all dimnames in `combine_assays_by` to avoid constructor check failures

I can't remember why I decided to be so selective at:

SummarizedExperiment/R/combine-methods.R

Lines 143 to 149 in 71872cc

 # Prevent the SummarizedExperiment() constructor function from choking 

 # on the rownames or colnames of the combined assays. 

 if (by.row) { 

 endoapply(combined, `colnames<-`, NULL) 

 } else { 

 endoapply(combined, `rownames<-`, NULL) 

 }

but it's probably better to just wipe both sets of dimnames in all scenarios, rather than just the dimnames along the non-combined dimension. There's no guarantee that either of the dimnames of the combined object will be the same as the corresponding dimnames of the SE, given that (i) we call the assays getter with withDimnames=FALSE and (ii) we may add any number of unnamed ConstantArrays into the mix.

My proposal is to just replace the code chunk above with:

    for (x in seq_along(combined)) {
        dimnames(combined[[x]]) <- NULL
    }
    combined

or anything that wipes the names off the combined assays. Local testing indicates that unit tests should be happy with this.

`all.equal(x, y)` vs `all(x == y)`

Hi Hervé, @hpages

When there are missing (NA) row names, the dim names check in assays_have_expected_dimnames returns NA whereas identical returns TRUE when row names are missing in both comparisons.
Perhaps it should use all.equal? It less strict than identical.

a_rownames <- c("XKRX", "XKRY2", "XKRY", NA, "XPA", "XPC", "XPNPEPL", "XPNPEP2", 
"XPNPEP3", "XPO1", "XPO4", "XPO5", "XPO6", "XPO7", "XPOT", "XPR1", 
"XRCC1", "XRCC2", "XRCC3", "XRCC4", "XRCC5")
expected_dimnames <- list(a_rownames)
all(a_rownames == expected_dimnames[[1L]])
#' [1] NA
identical(a_rownames, expected_dimnames[[1L]])
#' [1] TRUE
all.equal(a_rownames, expected_dimnames[[1L]])
#' [1] TRUE

SummarizedExperiment/R/SummarizedExperiment-class.R

Lines 267 to 274 in dd615ee

 ok1 <- is.null(a_rownames) || 

 expected_rownames_is_NULL || 

 (length(a_rownames) == length(expected_dimnames[[1L]]) 

 && all(a_rownames == expected_dimnames[[1L]])) 

 ok2 <- is.null(a_colnames) || 

 expected_colnames_is_NULL || 

 (length(a_colnames) == length(expected_dimnames[[2L]]) 

 && all(a_colnames == expected_dimnames[[2L]]))

Reproducible example:

m <- matrix(c(1L, 2L, 3L, 5L, 6L, 7L, 9L, 10L, 11L), nrow = 3, dimnames = list(c("a", "b", NA), c("a", "b", "c")))
m
#>      a b  c
#> a    1 5  9
#> b    2 6 10
#> <NA> 3 7 11
SummarizedExperiment::SummarizedExperiment(m)
#> Error in if (!ok) {: missing value where TRUE/FALSE needed

^{Created on 2022-05-13 by the reprex package (v2.0.1)}

Should the show() method for SummarizedExperiment objects suggest saveHDF5SummarizedExperiment()?

Vince (@vjcitn) suggested:

One could imagine the show method for SummarizedExperiment checking to see if there is evidence of HDF5 in the assay and if it finds it, it adds a line 'To serialize, use saveHDF5SummarizedExperiment.' It may not be completely foolproof but it might help.

Motivated by https://community-bioc.slack.com/archives/C6KJHH0M9/p1633536980007800

Another approach that was suggested is to add save and saveRDS in BiocGenerics, and making them fail (advising use of special method) if handed an HDF5SummarizedExperiment derivate.

However there are several complications with this:

There's no HDF5SummarizedExperiment class. These objects are just SummarizedExperiment objects or derivatives and dispatch cannot be used to distinguish between those that have on-disk data from those that have in-memory data.
On-disk data could be present in any object, not just SummarizedExperiment objects. For example a GRanges object could have a TileDBMatrix object in its metadata columns.
save() cannot easily be turned into a generic.
There are situations where it's ok to call save() or saveRDS() on an object with on-disk data.

Feedback and suggestions are welcome.

colData assignment should fail but does not

library(SummarizedExperiment)
mymat = matrix((1:6)*1.0, nc=3)
dimnames(mymat) = list(c("A", "B"), letters[1:3])
se1 = SummarizedExperiment(mymat)
validObject(se1)
library(S4Vectors)
ndf = DataFrame(v1=10:12)
rownames(ndf) = letters[3:1]
ndf
colData(se1) = ndf 
newAa = assay(se1["A", "a"])
newAa
newAa == oldAa

The content of the sample denoted a in the assay has changed through colData assignment. This seems incompatible with

297     \item{\code{colData(x)}, \code{colData(x) <- value}:}{Get or set the
298       column data. \code{value} is a \link{DataFrame} object. Row
299       names of \code{value} must be NULL or consistent with the existing
300       column names of \code{x}.}

in SummarizedExperiment-class.Rd. I propose that we tighten the "or consistent with" with "identical to" and throw an error in the code if that condition does not hold.

Add rownames check for cbind()

I've recently realised that if two objects have different rownames, cbind() does not seem to throw an error. Whereas rbind() does throws an error if colnames are different (the behaviour I would expect, for safety).

So, if we take these:

library(SummarizedExperiment)

se1 <- SummarizedExperiment(assays = matrix(c(1, 2, 1, 2), ncol = 2))
colnames(se1) <- c("sample1", "sample2")
rownames(se1) <- c("gene1", "gene2")

se2 <- SummarizedExperiment(assays = matrix(c(1, 2, 1, 2), ncol = 2))
colnames(se2) <- c("sample2", "sample1")
rownames(se2) <- c("gene2", "gene1")

Then rbind(se1, se2) throws an error.
Whereas cbind(se1, se2), does not.

I can see that it does check for identical rowRanges, but perhaps there should also be the analogous check from rbind(), something like:

    if (!.compare(lapply(args, rownames)))
            stop("'...' objects must have the same rownames")
    if (!.compare(lapply(args, nrow)))
            stop("'...' objects must have the same number of features")

Or at least issue a warning.
Happy to send a pull request if this makes sense.

rowData does not contain the rownames

library(SummarizedExperiment)
example(SummarizedExperiment)
rownames(se) <- paste0("Gene", seq_len(nrow(se)))
rownames(se)[1:10]
##  [1] "Gene1"  "Gene2"  "Gene3"  "Gene4"  "Gene5"  "Gene6"  "Gene7"  "Gene8" 
##  [9] "Gene9"  "Gene10"
rowData(se)
##      feature_id
##     <character>
## 1         ID001
## 2         ID002
## 3         ID003
## 4         ID004
## 5         ID005
## ...         ...
## 196       ID196
## 197       ID197
## 198       ID198
## 199       ID199
## 200       ID200

One would expect that the rownames of the rowData would be equal to rownames(se), given that this is the case for rownames(colData(se)) and colnames(se).

Session info:

R version 3.4.2 Patched (2017-10-30 r73642)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/R-3-4-branch_release/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/R-3-4-branch_release/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] SummarizedExperiment_1.8.0 DelayedArray_0.4.1        
[3] matrixStats_0.52.2         Biobase_2.38.0            
[5] GenomicRanges_1.30.0       GenomeInfoDb_1.14.0       
[7] IRanges_2.12.0             S4Vectors_0.16.0          
[9] BiocGenerics_0.24.0       

loaded via a namespace (and not attached):
 [1] lattice_0.20-35         bitops_1.0-6            grid_3.4.2             
 [4] zlibbioc_1.24.0         XVector_0.18.0          Matrix_1.2-12          
 [7] tools_3.4.2             RCurl_1.95-4.8          compiler_3.4.2         
[10] GenomeInfoDbData_0.99.1

Allow clearing of colData

There doesn't seem to be an easy way to empty out the colData in a SE object. I would have liked to do:

rowData(se) <- NULL # works
colData(se) <- NULL # fails, no method for "NULL"

This is occasionally necessary when trying to cbind different SE objects of different origins, where we want to clear out any differences in the column metadata. Perhaps the following bit of code would suffice:

setReplaceMethod("colData", c("SummarizedExperiment", "NULL"),
    function(x, ..., value)
{
    value <- new("DataFrame", nrows=nrow(x))
    BiocGenerics:::replaceSlots(x, colData=value, check=FALSE)
})

Happy to put in a PR if people think that this is a good idea.

Slightly relaxing the rowData checks for cbind

Inspired by https://support.bioconductor.org/p/126722/.

The problem is what to do when a rowData field is not identical across cbind'd elements. Currently cbind throws an error, which is understandable (and safest) but inconvenient.

A more convenient approach would be to simply drop non-identical fields with a warning. This would allow people to proceed with cbinded results - most of the time, they won't even care that the fields are lost, given that it wouldn't make sense for them to be there anyway.

cbind already accommodates situations where the elements have non-identical rowData with non-overlapping names, so it doesn't seem much of a stretch to just drop fields instead of erroring.

Difficult bug when colnames are arrays

This bug was hell to track down:

a <- matrix(10, 10, 26)
dimnames(a)[[2]] <- array(LETTERS, 26) 
se <- SummarizedExperiment(list(counts=a))
## Error in FUN(X[[i]], ...) : 
##  assay colnames() must be NULL or equal colData rownames()

colnames(a) is an array but gets converted to a character vector in the SummarizedExperiment() constructor. This results in the error as the array is not identical to the vector, presumably. However, it was very difficult to identify, not least because an array with 1 dimension looks identical to a vector (only by checking dim() did I figure it out). This is especially dangerous if anyone is manually pulling out dimension names from a HDF5 file, as rhdf5 automatically seems to pull things out as arrays.

I'm testing this on BioC-release (1.8.1) as I'm not at my other computer right now, but I guess this would be the same for the current devel?

Conversion to SingleCellExpriment object

Is there a way to convert SummarizedExperiment object to SingleCellExperiment object? My final goal is to create a Seurat object from the SummarizedExperiment object, but the conversion functions available can only be applied to SingleCellExperiment objects

all.equal does not work for SE objects

This was a rather unpleasant little surprise:

library(SummarizedExperiment)
A <- SummarizedExperiment(matrix(1, 10, 10))
B <- SummarizedExperiment(matrix(2, 10, 10))
all.equal(A, B) # TRUE

I assume that this is due to some voodoo with environments inside assays. But this is worrying as it puts into question all of my unit tests across packages that work on SummarizedExperiments or their subclasses.

It seems that all.equal is a S3 generic, so perhaps a special method could be defined for the relevant inner class of the assays field (probably ShallowSimpleListAssays) to achieve the correct behaviour?

Session information

R Under development (unstable) (2018-11-02 r75535)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SummarizedExperiment_1.13.0 DelayedArray_0.9.0         
 [3] BiocParallel_1.17.0         matrixStats_0.54.0         
 [5] Biobase_2.43.0              GenomicRanges_1.35.0       
 [7] GenomeInfoDb_1.19.0         IRanges_2.17.0             
 [9] S4Vectors_0.21.0            BiocGenerics_0.29.1        

loaded via a namespace (and not attached):
 [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0            
 [4] zlibbioc_1.29.0        XVector_0.23.0         Matrix_1.2-15         
 [7] tools_3.6.0            RCurl_1.95-4.11        compiler_3.6.0        
[10] GenomeInfoDbData_1.2.0

rowData rownames issue explored at BioC2018

Here is a reproducing example for a potential issue in the devel branch version of SummarizedExperiment. It seems an inconsistency in rownames of rowData and rownames of the SE, which would prevent creation of an SE, is nevertheless allowed during rowData re-assignment.

https://gist.github.com/mikelove/8fb94c610b588d4a6a64848d3fcbdf76

We tried with SummarizedExperiment v1.11.5 (so use.names=TRUE is the default).

Thanks

Stop wrapping the user-supplied assays in a ref object when storing them in the "assays" slot of a SummarizedExperiment object

Looks like reference classes as implemented in the methods package are a bit problematic e.g. false negatives returned by all.equal() after a serialization/deserialization cycle on a ref object as reported here https://stat.ethz.ch/pipermail/bioc-devel/2019-May/015112.html

Their use in the assays slot of a SummarizedExperiment object is probably not needed anymore now that R is "shallow copy by default" (according to Michael's comment here). However there are hundreds (if not thousands) of serialized instances around so we need to be careful when changing this.

The good news is that the assays slot only needs to be an Assays object and Assays is a virtual class with no slot:

showClass("Assays")
# Virtual Class "Assays" [package "SummarizedExperiment"]
#
# No Slots, prototype of class "SimpleListAssays"
#
# Known Subclasses: "SimpleListAssays", "ShallowSimpleListAssays", "AssaysInEnv"

ShallowSimpleListAssays is only 1 concrete subclass among others and is the one currently used by the SummarizedExperiment() constructor and assays() setter to wrap the assays passed to them. ShallowSimpleListAssays is a reference class.

So we should consider changing this to wrap the user-supplied assays in a SimpleListAssays object instead. SimpleListAssays is a direct SimpleList extension. Serialized instances that use ShallowSimpleListAssays should keep working.

SummarizedExperiment::cbind

Hello,

this is just a suggestion. If a have to SE, with the same data, but one has a metadata column that the other does not have.

The package gives the following error:

Error in map_x_colnames_to_object_colnames(colnames(object)) : 
  the DataFrame objects to rbind must have the same number of columns

But instead of using a rbind wouldn't it be better to use a rbind.fill (https://www.rdocumentation.org/packages/plyr/versions/1.8.4/topics/rbind.fill)
in this part code

SummarizedExperiment/R/SummarizedExperiment-class.R

Line 711 in 01aa07a

colData <- do.call(rbind, lapply(args, colData))

In that way, it will populate the metadata with NA and shows show a warning instead.
I am not sure if it was discussed before. Otherwise, it would be nice to specify more clearly that the problem is within the metadata.

Here is an example of the problem if we add year_to_death column.

nrows <- 200; ncols <- 6
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)
rowRanges <- GRanges(rep(c("chr1", "chr2"), c(50, 150)),
                     IRanges(floor(runif(200, 1e5, 1e6)), width=100),
                     strand=sample(c("+", "-"), 200, TRUE),
                     feature_id=sprintf("ID%03d", 1:200))
colData <- DataFrame(Treatment=rep(c("ChIP", "Input"), 3),
                     row.names=LETTERS[1:6])
rse <- SummarizedExperiment(assays=SimpleList(counts=counts),
                            rowRanges=rowRanges, colData=colData)

colData <- DataFrame(Treatment=rep(c("ChIP", "Input"), 3),
                     years_to_death = 1:6,
                     row.names=LETTERS[1:6])
rse2 <- SummarizedExperiment(assays=SimpleList(counts=counts),
                            rowRanges=rowRanges, colData=colData)

SummarizedExperiment::cbind(rse2, rse)

Enforce unique assay names

This started as a more general discussion about empty strings in List names but the real concern seems to be more specifically about the names of the assays. It comes down to these basic questions:

Should we enforce names on the assays? Right now assay names are optional:

library(SummarizedExperiment)

m1 <- matrix(1:12, ncol=3)
m2 <- m1 + 100.5
se <- SummarizedExperiment(list(m1, m2))

assayNames(se)
# NULL

## Note that the show() method is misleading here, suggesting that the names are empty strings:
se
# class: SummarizedExperiment 
# dim: 4 3 
# metadata(0):
# assays(2): '' ''
# rownames: NULL
# rowData names(0):
# colnames: NULL
# colData names(0):

If the user does not supply assay names, should we make automatic names? (the other option would be to complain in an error message)

Should we enforce their uniqueness? Right now they can have duplicates:

se <- SummarizedExperiment(list(A=m1, A=m2))
assayNames(se)
# [1] "A" "A"

Should we also forbid empty or NA names? Right now they are allowed:

se <- SummarizedExperiment(setNames(list(m1, m2), c("", NA)))
assayNames(se)
# [1] "" NA

My answer would be "yes" to all 4 questions.

Note that the situation is very similar to what data.frame() and DataFrame() do with column names (when check.names=TRUE). So the last question is:

Should we just use make.names(., unique=TRUE) like data.frame() and DataFrame() do to fix the user-supplied names?

@LTLA @vjcitn @lawremi Comments? Suggesttions?

assay<- is too pedantic when the dimensions are named

Identified in drisso/SingleCellExperiment#59:

mat <- matrix(rnorm(1000), ncol=20)
colnames(mat) <- LETTERS[1:20]

library(SummarizedExperiment)
se <- SummarizedExperiment(list(thingy=mat))

names(dim(mat)) <- c("Gene", "Cell")
assay(se) <- mat
## Error in `assays<-`(`*tmp*`, withDimnames = withDimnames, ..., value = `*vtmp*`) : 
##   please use 'assay(x, withDimnames=FALSE)) <- value' or 'assays(x,
##   withDimnames=FALSE)) <- value' when the dimnames on the supplied
##   assay(s) are not identical to the dimnames on SummarizedExperiment
##   object 'x'

I'll put aside the question of whether it's a good idea to rename the dim vector, because someone or something is clearly doing it, so assay<- should be robust to it. I'm guessing there's an identical() call on the old and new dimnames() that is getting tripped by the new names on the dimnames().

Session information

R version 4.1.0 Patched (2021-05-20 r80347)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.5 LTS

Matrix products: default
BLAS:   /home/luna/Software/R/R-4-1-branch-dev/lib/libRblas.so
LAPACK: /home/luna/Software/R/R-4-1-branch-dev/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SingleCellExperiment_1.15.0 SummarizedExperiment_1.23.0
 [3] Biobase_2.53.0              GenomicRanges_1.45.0       
 [5] GenomeInfoDb_1.29.0         IRanges_2.27.0             
 [7] S4Vectors_0.31.0            BiocGenerics_0.39.0        
 [9] MatrixGenerics_1.5.0        matrixStats_0.58.0         
[11] BiocFileCache_2.1.0         dbplyr_2.1.1               

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.6             XVector_0.33.0         pillar_1.6.1          
 [4] compiler_4.1.0         zlibbioc_1.39.0        bitops_1.0-7          
 [7] tools_4.1.0            bit_4.0.4              lattice_0.20-44       
[10] RSQLite_2.2.7          memoise_2.0.0          lifecycle_1.0.0       
[13] tibble_3.1.2           pkgconfig_2.0.3        rlang_0.4.11          
[16] Matrix_1.3-3           DelayedArray_0.19.0    DBI_1.1.1             
[19] filelock_1.0.2         curl_4.3.1             fastmap_1.1.0         
[22] GenomeInfoDbData_1.2.6 withr_2.4.2            dplyr_1.0.6           
[25] httr_1.4.2             generics_0.1.0         vctrs_0.3.8           
[28] rappdirs_0.3.3         grid_4.1.0             bit64_4.0.5           
[31] tidyselect_1.1.1       glue_1.4.2             R6_2.5.0              
[34] fansi_0.4.2            purrr_0.3.4            blob_1.2.1            
[37] magrittr_2.0.1         ellipsis_0.3.2         assertthat_0.2.1      
[40] utf8_1.2.1             RCurl_1.98-1.3         cachem_1.0.5          
[43] crayon_1.4.1

Error in coercion SummarizedExperiment <-> ExpressionSet

> library(SummarizedExperiment)
> data(ALL, package="ALL")
> ALL
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12625 features, 128 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 01005 01010 ... LAL4 (128 total)
  varLabels: cod diagnosis ... date last seen (21 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 14684422 16243790 
Annotation: hgu95av2

Coercion to SummarizedExperiment: works.

> se <- as(ALL, "SummarizedExperiment")
> se
class: SummarizedExperiment 
dim: 12625 128 
metadata(3): experimentData annotation protocolData
assays(1): exprs
rownames(12625): 1000_at 1001_at ... AFFX-YEL021w/URA3_at
  AFFX-YEL024w/RIP1_at
rowData names(0):
colnames(128): 01005 01010 ... 83001 LAL4
colData names(21): cod diagnosis ... f.u date.last.seen

Coercion back to ExpressionSet: breaks.

> eset <- as(se, "ExpressionSet")
Error in `rownames<-`(`*tmp*`, value = c("1000_at", "1001_at", "1002_f_at",  : 
  invalid rownames length

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SummarizedExperiment_1.11.6 DelayedArray_0.7.3         
 [3] BiocParallel_1.15.8         matrixStats_0.54.0         
 [5] Biobase_2.41.2              GenomicRanges_1.33.11      
 [7] GenomeInfoDb_1.17.1         IRanges_2.15.13            
 [9] S4Vectors_0.19.19           BiocGenerics_0.27.1        

loaded via a namespace (and not attached):
 [1] lattice_0.20-35        bitops_1.0-6           grid_3.5.1            
 [4] zlibbioc_1.27.0        XVector_0.21.1         Matrix_1.2-14         
 [7] tools_3.5.1            RCurl_1.95-4.11        compiler_3.5.1        
[10] GenomeInfoDbData_1.1.0

Consider assayData DataFrame for assays

I asked on Twitter then Bioc Stack, then @vjcitn suggested I ask here. :)

“using SummarizedExperiment:
I want something like assayData() to hold tabular data about each matrix in assays(), one row per assay.
When storing more than one assay matrix, I encode too much into the assay name.

Has this idea been discussed?”

I basically want an empty DataFrame with slot “assayData” with one row per entry in the assays slot. (I see your post asking about assay name constraints, that could be useful or necessary here as well.)

I can add some driving use cases in the next post.

Two basic utilities:

mechanism to annotate each entry in assays, using column-based info, not just assay name
method to subset SE assays which also subsets this assayData

Vignette Doesn't Specify Dimensionality Rules

One important item of information which I feel should be included in SummarizedExperiment vignette is that all of the assays need to have the same dimensions. For example, if the user had a gene expression microarray for which each gene was measured by two or three probes, the row number would be smaller for preprocessed data compared to the raw data and a MultiAssayExperiment might be used instead to store both varieties.

`combineCols()` and `combineRows()` not listed in 'Usage' section

These are mentioned in the 'Combining' section but not up the top in the 'Usage'.
Want me to add them around here?

SummarizedExperiment/man/SummarizedExperiment-class.Rd

Lines 133 to 136 in 7aff324

 ## Combining 

 \S4method{cbind}{SummarizedExperiment}(..., deparse.level=1) 

 \S4method{rbind}{SummarizedExperiment}(..., deparse.level=1)

c() does not work on SummarizedExperiment

I would have assumed that c() would be equivalent to rbind() on SummarizedExperiment. However, it fails, because bindROWS,Vector() does not handle the assays. Perhaps rbind,SummarizedExperiment() should be refactored into a bindROWS() method?

Constructor assigns wrong assay values to wrong samples and features

Hi!

I found out a possible bug from SE constructor.

assay data's sample and feature names are ordered based on colData and rowData, respectively. However, values are not ordered correspondingly --> wrong values goes to wrong samples and features

Here is an example that shows the behavior:

library(SummarizedExperiment)

# assay data
counts <- rbind(rep(0, 3), matrix(1:9, nrow = 3))
colnames(counts) <- paste0("sample", 1:3)
rownames(counts) <- paste0("OTU", seq_len(4))
# row data
rowData <- data.frame(Species = rep(c("S1", "S2"), each = 2),
                      row.names = rownames(counts),
                      stringsAsFactors = FALSE)
# col data
colData <- DataFrame(Treatment=c("X", "X", "Y"),
                     row.names=colnames(counts))

# Modify the order of samples in assay data
counts_sample_mod <- counts[, c("sample2", "sample3", "sample1")]
# Modify the order of features in assay data
counts_feat_mod <- counts[c( "OTU2", "OTU4", "OTU1", "OTU3"), ]

# Create three different SE objects, which have differently ordered assay data
se <- SummarizedExperiment(assays=SimpleList(counts=counts),
                           colData=colData,
                           rowData = rowData)
se_sample_mod <- SummarizedExperiment(assays=SimpleList(counts=counts_sample_mod),
                                      colData=colData,
                                      rowData = rowData)
se_feat_mod <- SummarizedExperiment(assays=SimpleList(counts=counts_feat_mod),
                                    colData=colData,
                                    rowData = rowData)

# Sample and feature names are ordered based on colData and rowData. 
# However, values are not ordered correspondingly.
# Original
counts
assay(se, "counts")
# Samples modified
counts_sample_mod
assay(se_sample_mod, "counts")
# Features modified
counts_feat_mod
assay(se_feat_mod, "counts")

Outputs of those assay tables:

> # Original
> counts
     sample1 sample2 sample3
OTU1       0       0       0
OTU2       1       4       7
OTU3       2       5       8
OTU4       3       6       9
> assay(se, "counts")
     sample1 sample2 sample3
OTU1       0       0       0
OTU2       1       4       7
OTU3       2       5       8
OTU4       3       6       9
> # Samples modified
> counts_sample_mod
     sample2 sample3 sample1
OTU1       0       0       0
OTU2       4       7       1
OTU3       5       8       2
OTU4       6       9       3
> assay(se_sample_mod, "counts")
     sample1 sample2 sample3
OTU1       0       0       0
OTU2       4       7       1
OTU3       5       8       2
OTU4       6       9       3
> # Features modified
> counts_feat_mod
     sample1 sample2 sample3
OTU2       1       4       7
OTU4       3       6       9
OTU1       0       0       0
OTU3       2       5       8
> assay(se_feat_mod, "counts")
     sample1 sample2 sample3
OTU1       1       4       7
OTU2       3       6       9
OTU3       0       0       0
OTU4       2       5       8

BR,
Tuomas Borman

Add a showAsCell method for SummarizedExperiment objects

Consider the following:

library(SummarizedExperiment)
se <- SummarizedExperiment(list(counts=cbind(1:10)))
y <- DataFrame(SE=I(se))
## DataFrame with 10 rows and 1 column
## Error: C stack usage  7969716 is too close to the limit
## In addition: There were 50 or more warnings (use warnings() to see the first 50)

One quick solution would be to just admit that we can't do anything sensible here:

setMethod("showAsCell", "SummarizedExperiment", function(object) rep.int("####", NROW(object)))
y
## DataFrame with 10 rows and 1 column
##                        SE
##    <SummarizedExperiment>
## 1                    ####
## 2                    ####
## 3                    ####
## 4                    ####
## 5                    ####
## 6                    ####
## 7                    ####
## 8                    ####
## 9                    ####
## 10                   ####

This is relevant to LTLA/TrajectoryUtils#4; tagging @kstreet13.

Should `rowData` be merged with `mcols(gr)`?

Hi Hervé, @hpages
I have an example here:

library(GenomicRanges)
library(SummarizedExperiment)
example(GRanges)
example(SummarizedExperiment)
ss <- se[1:10, ]
rowData(ss)
rowRanges(ss) <- gr
rowData(ss)
## equivalently
mcols(ss)

Shouldn't rowData(ss) be a combination of the previous object's rowData and the mcols of the merged GRanges, i.e., DataFrame(rowData(ss), mcols(gr))?

Thank you,
Marcel

fail to create `rbind` for SE-derived class due to the validity check

Hi,

I am creating rbind for TreeSummarizedExperiment (a SE-derived class). In TreeSummarizedExperiment, it has a slot rowLinks that has the same row dimension as assays of SE.

What I did is:

use callNextMethod() to rbind slots inherited from SE
write my own functions to work on new slots, and finally use BiocGenerics:::replaceSlots() to update new slots

The difficulty I have now is that the output of step 1 can't pass the validity check of TreeSummarizedExperiment(the row dimension of rowLinks doesn't match with assays of SE before step 2) to run step 2.

The same situation should also occur in the subset [, which however has no problem due to the use of check=FALSE in BiocGenerics:::replaceSlots (e.g., https://github.com/Bioconductor/SummarizedExperiment/blob/master/R/SummarizedExperiment-class.R#L454).

So, I wonder whether it would be also reasonable to set check=FALSE in BiocGenerics:::replaceSlots for rbind & cbind (https://github.com/Bioconductor/SummarizedExperiment/blob/master/R/SummarizedExperiment-class.R#L733
https://github.com/Bioconductor/SummarizedExperiment/blob/master/R/SummarizedExperiment-class.R#L766).

Or could you please kindly help how to get around this issue here? The Backtrace is shown below

Error: 
rowLinks: 20 rows are expected
Backtrace:
     █
  1. ├─testthat::expect_s4_class(rbind(tse_b, tse_e), "TreeSummarizedExperiment") test_combine.R:69:4
  2. │ └─testthat::quasi_label(enquo(object), arg = "object")
  3. │   └─rlang::eval_bare(expr, quo_get_env(quo))
  4. └─BiocGenerics::rbind(tse_b, tse_e)
  5.   ├─BiocGenerics:::standardGeneric("rbind")
  6.   │ ├─BiocGenerics::eval(mc, env)
  7.   │ └─base::eval(mc, env)
  8.   │   └─base::eval(mc, env)
  9.   └─TreeSummarizedExperiment::rbind(...)
 10.     ├─methods::callNextMethod()
 11.     └─SingleCellExperiment:::.nextMethod(... = ...)
 12.       ├─methods::callNextMethod()
 13.       └─SummarizedExperiment:::.nextMethod(... = ...)
 14.         └─SummarizedExperiment:::.rbind.SummarizedExperiment(args)
 15.           └─BiocGenerics:::replaceSlots(...)
 16.             └─methods::validObject(object)
 17.               ├─methods:::anyStrings(validityMethod(object))
 18.               │ └─base::isTRUE(x)
 19.               └─TreeSummarizedExperiment:::validityMethod(object)

Thank you!

Extraction isn't working with ShallowSimpleListAssays on Bioc Devel 3.10

I'm seeing a potential issue with ShallowSimpleListAssays pop up on Bioc Devel. Extraction of an assay (i.e. from a SummarizedExperiment object) doesn't seem to be working as expected, and returns:

Error in assays[[1]] : wrong arguments for subsetting an environment

Here's an example on a RangedSummarizedExperiment:

assays <- slot(object, "assays")

showClass("ShallowSimpleListAssays")
## Class "ShallowSimpleListAssays" [package "SummarizedExperiment"]
##
## Slots:
##
## Name:       .xData
## Class: environment
##
## Extends:
## Class "ShallowData", directly
## Class "Assays", directly
## Class "envRefClass", by class "ShallowData", distance 2
## Class ".environment", by class "ShallowData", distance 3
## Class "refClass", by class "ShallowData", distance 3
## Class "environment", by class "ShallowData", distance 4, with explicit
##   coerce
## Class "refObject", by class "ShallowData", distance 4
## Class "AssayData", by class "ShallowData", distance 5, with explicit
##   coerce
##

is.environment(assays)
## [1] TRUE

lapply(
    X = seq_along(assays),
    FUN = function(a) {
        assays[[a]]
    }
)
## Error in assays[[a]] : wrong arguments for subsetting an environment

See Travis CI job log for an example:
https://travis-ci.com/acidgenomics/bcbioRNASeq/jobs/219620463

SummarizedExperiment doesn't work with data.frame input

I would have expected the following to work:

a <- data.frame(A=1:5, B=2:6)
library(SummarizedExperiment)
SummarizedExperiment(a)
## Error in seq_len(ncol(assay)) : 
##   argument must be coercible to non-negative integer
## In addition: Warning message:
## In seq_len(ncol(assay)) : first element used of 'length.out' argument

This is attributable to the fact that a data.frame is treated as a list of column vectors and ends up being processed as such. The only way I get the desired result is to do:

SummarizedExperiment(list(list(B=a)))

One could argue that I shouldn't be storing a data.frame as an SE assay anyway, and that would be fair enough. But in that case, I would expect a more useful error message (e.g., instructing the user to coerce to an ordinary matrix) than what we currently get, which is pretty uninterpretable.

Session information

R Under development (unstable) (2018-11-02 r75535)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRblas.so
LAPACK: /home/cri.camres.org/lun01/Software/R/trunk/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] SummarizedExperiment_1.13.0 DelayedArray_0.9.0         
 [3] BiocParallel_1.17.1         matrixStats_0.54.0         
 [5] Biobase_2.43.0              GenomicRanges_1.35.1       
 [7] GenomeInfoDb_1.19.1         IRanges_2.17.1             
 [9] S4Vectors_0.21.4            BiocGenerics_0.29.1        

loaded via a namespace (and not attached):
 [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0            
 [4] zlibbioc_1.29.0        XVector_0.23.0         Matrix_1.2-15         
 [7] tools_3.6.0            RCurl_1.95-4.11        compiler_3.6.0        
[10] GenomeInfoDbData_1.2.0

Improving the performance when calling assays accessor on a SummarizedExperiment object?

Hi,

I found an unnecessary memory duplication when operates on the object that is returned by the assays function. I think it is an unintentional behavior so I would like to discuss it here. Here is a simple example.

nrows <- 10; ncols <- 10
counts <- matrix(0.0, nrows,ncols)
rse <- SummarizedExperiment(assays=SimpleList(counts=counts))
rse

We can call assays to get the assay data back and compare it with the slot data.

asys <- assays(rse)
identical(asys, rse@assays@data)

It is not surprising to see that they are identical, and I will expect the is no duplication of the data. However, when I check their internal data structure, it turns out that a duplication happened here(truncated output below)

> .Internal(inspect(rse@assays@data@listData$counts))
@0x000001c00e41aaf8 14 REALSXP g0c7 [REF(17),ATT] (len=100, tl=0) 0,0,0,0,0,...
> .Internal(inspect(asys@listData$counts))
@0x000001c00a1cb0a0 14 REALSXP g0c0 [REF(4),ATT]  wrapper [srt=-2147483648,no_na=0]
  @0x000001c018e1ad48 14 REALSXP g0c7 [REF(1),ATT] (len=100, tl=0) 0,0,0,0,0,...

Surprisingly, the duplication is not triggered by the assays accessor, but the identical function. The root cause is that the assays function will put a wrapper on the object it returns(asys@listData$counts in this example), and the wrapper duplicates its data when a data pointer is required at C level. This unnecessary duplication can bring a significant cost when the data is large. By looking into the assays function(signature "SummarizedExperiment"), I found where the wrapper is created

assays <- endoapply(assays, function(a) {
            a_dimnames <- dimnames(a)
            a_dimnames[1:2] <- x_dimnames
            dimnames(a) <- DelayedArray:::simplify_NULL_dimnames(a_dimnames)
            a
        })

The key function that creates the wrapper is dimnames<-. No matter whether the dimnames of an object will be changed, it always creates a wrapper on the object. Since there are many functions that rely on the data pointer to do the calculation(That's why I find it when I was working on the shared object: a shared object becomes unshared unexpectedly.), I suggest making the following change to prevent the creation of the wrapper if not necessary.

assays <- endoapply(assays, function(a) {
    a_dimnames <- dimnames(a)
    a_dimnames[1:2] <- x_dimnames
    a_dimnames <- DelayedArray:::simplify_NULL_dimnames(a_dimnames)
    if(!identical(dimnames(a), a_dimnames))
        dimnames(a) <- a_dimnames
    a
})

This will solve the wrapper problem and can provide some performance benefits for some functions.

Best,
Jiefei

SummarizedExperiment() constructor with SummarizedExperiment given as argument should return the same SummarizedExperiment

I need to make a suggestion for a slight change in SummarizedExperiment. The constructor for a SummarizedExperiment or RangedSummarizedExperiment is SummarizedExperiment(). Given an existing RangedSummarizedExperiment object, rse, I would expect running the constructor on an existing object, it should simply return the same object. This is not the case, as it instead returns a new object with the existing object stored as an assay.

> example(SummarizedExperiment)
> rse
class: RangedSummarizedExperiment
dim: 200 6
metadata(0):
assays(1): counts
rownames: NULL
rowData names(1): feature_id
colnames(6): A B ... E F
colData names(1): Treatment

> rse2 <- SummarizedExperiment(rse)
> rse2
class: SummarizedExperiment
dim: 200 6
metadata(0):
assays(1): ''
rownames: NULL
rowData names(0):
colnames(6): A B ... E F
colData names(0):

> assays(rse2)[[1]]
class: RangedSummarizedExperiment
dim: 200 6
metadata(0):
assays(1): counts
rownames: NULL
rowData names(1): feature_id
colnames(6): A B ... E F
colData names(1): Treatment

The constructor should return the same object if used on the constructor, correct? If so, this change should be implemented.

Build an assay element to work from my own normalized count matrix

Hello, I have used your scripts and work very well, thanks a lot! However I would like to build a summarizexperiment (se) from my own normalized matrix, I mean I want to store my matrix in an assay object to run it with summarizexperiment and create a summarise element. Sorry I completely got lost doing it, is there a parameter to use in summarizexperiment to tell the program what data use? I have the features=ebg that has worked with the bam files, so I want to use it with a matrix of FPKM´s I built using other software. I have just found tutorial to count from scratch from the bam files....

I appreciate a lot your help...

import / export as loom

@hpages I implemented an .export_loom() function on the loom-export branch to complement makeSummarizedExperimentFromLoom(), but I was wondering where it really belongs? In some ways it could go in HDF5Array, with functionality like saveHDF5SummarizedExperimet(). And ultimately it could well end up in a loom package (see also loomR and hdf5r). Suggestions?

There are some pretty severe limitations on functionality, e.g., not supporting GRangesList, any of the *List or other column classes, or even factor levels.

error object 'normalize_names_replacement_value' not found

Hi,
I encountered the problem with using SummarizedExperiment version 1.16.1.
The error message is
"Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object 'normalize_names_replacement_value' not found".

I am using R 3.6.1 on my mac system Catalina 10.15.1. The SummarizedExperiment version is 1.16.1

Thank you in advance for your support!

subset fails when subsetting a SummarizedExperiment object with no samples

Environment:

Code to reproduce:

library(MultiAssayExperiment)
no_samples <- MultiAssayExperiment::subsetByColData(miniACC, y = miniACC$gender %in% c())
subset(no_samples[["gistict"]])

output:

I expect it to behave the same way as subsetting an empty data.frame:

subset(data.frame())

This is an issue for us because in our use case we don't want to check if there are no samples in the SummarizedExperiment object.

loading package taking too much memory

SummarizedExperiment is using too much memory and time to load. The code bellow and results indicate that it is taking 328MB of memory. For a single use this is not a big issue. However using library(SummarizedExperiment) on high performance computing is causing the code to crash due this high memory use.

library(profvis)			
profvis({
  library(SummarizedExperiment)
})

SummarizedExperiment R package not allowing installation.

H, I have been attempting to install SummarizedExperiment on R 3.6.1 with no success. I have all other dependencies installed. Error is shown below:

** using staged installation
** R
** inst
** byte-compile and prepare package for lazy loading
Warning messages:
1: multiple methods tables found for ‘splitAsList’
2: replacing previous import ‘S4Vectors::splitAsList’ by ‘IRanges::splitAsList’ when loading ‘GenomeInfoDb’
3: replacing previous import ‘S4Vectors::splitAsList’ by ‘IRanges::splitAsList’ when loading ‘GenomicRanges’
4: replacing previous import ‘S4Vectors::splitAsList’ by ‘IRanges::splitAsList’ when loading ‘XVector’
Warning: replacing previous import ‘S4Vectors::splitAsList’ by ‘IRanges::splitAsList’ when loading ‘SummarizedExperiment’
Warning: undefined slot classes in definition of "SummarizedExperiment": NAMES(class "characterORNULL")
Warning: undefined slot classes in definition of "SummarizedExperiment0": NAMES(class "characterORNULL")
Warning: undefined slot classes in definition of "SummarizedExperiment": NAMES(class "characterORNULL")
Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
object 'normalize_names_replacement_value' not found
Error: unable to load R code in package ‘SummarizedExperiment’
Execution halted
ERROR: lazy loading failed for package ‘SummarizedExperiment’

Constructor could check that assay is in matric format

Hi,

assay in assays needs to be in matrix format. However, constructor does not check that the assay is in right format

--> you can store DataFrame object as an assay --> because many functions expects assay in matrix format this leads to an error

We have noticed that many beginners make this same mistake. This is quite hard to solve especially when some function uses internal function that expects the data in matrix format, e.g. MatrixGenerics::rowSums2.

I was wondering if constructor could check that assay is in right format. If not, the constructor could throw an error, convert DataFrame into matrix (maybe not), or print a message (maybe the most preferable).

Below is an example that illustrates this issue

library(SummarizedExperiment)
library(MatrixGenerics)

nrows <- 20; ncols <- 6
counts <- matrix(runif(nrows * ncols, 1, 1e4), nrows)

# Correctly constructed object
rse <- SummarizedExperiment(assays=SimpleList(counts=counts))
# No error
rowSums2(assay(rse))

# Convert into df
counts_df <- DataFrame(counts)
colnames(counts_df) <- rownames(colData)

# Constructor works with df as an assay
rse_df <- SummarizedExperiment(assays=SimpleList(counts=counts_df))

# Fails because requires matrix
rowSums2(assay(rse_df))

Error in MatrixGenerics:::.load_next_suggested_package_to_search(x) : 
    Failed to find a rowSums2() method for DFrame objects.

Session info

R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=fi_FI.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=fi_FI.UTF-8
[6] LC_MESSAGES=en_US.UTF-8 LC_PAPER=fi_FI.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fi_FI.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets methods base

other attached packages:
[1] viridis_0.6.2 viridisLite_0.4.0 vegan_2.6-2 permute_0.9-7
[5] stringr_1.4.0 scater_1.22.0 scuttle_1.4.0 scales_1.1.1
[9] plotROC_2.2.1 patchwork_1.1.1 mia_1.3.22 MultiAssayExperiment_1.20.0
[13] TreeSummarizedExperiment_2.2.0 Biostrings_2.62.0 XVector_0.34.0 SingleCellExperiment_1.16.0
[17] SummarizedExperiment_1.24.0 Biobase_2.54.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.1
[21] IRanges_2.28.0 S4Vectors_0.32.4 BiocGenerics_0.40.0 MatrixGenerics_1.6.0
[25] matrixStats_0.61.0 knitr_1.38 ggsignif_0.6.3 ggrepel_0.9.1
[29] ggplotify_0.1.0 ggord_1.1.6 ggdendro_0.1.23 dplyr_1.0.8
[33] dendextend_1.15.2 cutpointr_1.1.1 ComplexHeatmap_2.11.1 circlize_0.4.14
[37] caret_6.0-90 lattice_0.20-45 ggplot2_3.3.5 BiocManager_1.30.16
[41] rmarkdown_2.13 phyloseq_1.38.0

loaded via a namespace (and not attached):
[1] utf8_1.2.2 tidyselect_1.1.2 RSQLite_2.2.11 BiocParallel_1.28.3
[5] Rtsne_0.15 pROC_1.18.0 munsell_0.5.0 ScaledMatrix_1.2.0
[9] codetools_0.2-18 xgboost_1.5.2.1 future_1.24.0 withr_2.5.0
[13] colorspace_2.0-3 highr_0.9 rstudioapi_0.13 listenv_0.8.0
[17] Rdpack_2.3 labeling_0.4.2 GenomeInfoDbData_1.2.7 bit64_4.0.5
[21] farver_2.1.0 rhdf5_2.38.1 parallelly_1.30.0 vctrs_0.3.8
[25] treeio_1.18.1 generics_0.1.2 ipred_0.9-12 xfun_0.30
[29] R6_2.5.1 doParallel_1.0.17 ggbeeswarm_0.6.0 clue_0.3-60
[33] rsvd_1.0.5 bitops_1.0-7 rhdf5filters_1.6.0 microbiome_1.16.0
[37] cachem_1.0.6 gridGraphics_0.5-1 DelayedArray_0.20.0 assertthat_0.2.1
[41] nnet_7.3-17 beeswarm_0.4.0 gtable_0.3.0 beachmat_2.10.0
[45] Cairo_1.5-15 globals_0.14.0 timeDate_3043.102 rlang_1.0.2
[49] GlobalOptions_0.1.2 splines_4.1.2 lazyeval_0.2.2 ModelMetrics_1.2.2.2
[53] yaml_2.3.5 reshape2_1.4.4 tools_4.1.2 lava_1.6.10
[57] ellipsis_0.3.2 decontam_1.14.0 jquerylib_0.1.4 biomformat_1.22.0
[61] RColorBrewer_1.1-2 proxy_0.4-26 Rcpp_1.0.8.3 plyr_1.8.7
[65] sparseMatrixStats_1.6.0 zlibbioc_1.40.0 purrr_0.3.4 RCurl_1.98-1.6
[69] rpart_4.1-15 GetoptLong_1.0.5 cowplot_1.1.1 cluster_2.1.2
[73] DECIPHER_2.22.0 magrittr_2.0.2 data.table_1.14.2 evaluate_0.15
[77] gridExtra_2.3 shape_1.4.6 compiler_4.1.2 tibble_3.1.6
[81] crayon_1.5.1 htmltools_0.5.2 mgcv_1.8-38 tidyr_1.2.0
[85] ANCOMBC_1.4.0 lubridate_1.8.0 DBI_1.1.2 MASS_7.3-56
[89] Matrix_1.4-0 ade4_1.7-18 cli_3.2.0 rbibutils_2.2.7
[93] parallel_4.1.2 gower_1.0.0 igraph_1.2.11 pkgconfig_2.0.3
[97] recipes_0.1.17 foreach_1.5.2 vipor_0.4.5 bslib_0.3.1
[101] DirichletMultinomial_1.36.0 multtest_2.50.0 prodlim_2019.11.13 yulab.utils_0.0.4
[105] digest_0.6.29 tidytree_0.3.7 DelayedMatrixStats_1.16.0 rjson_0.2.21
[109] nloptr_1.2.2.3 lifecycle_1.0.1 nlme_3.1-157 jsonlite_1.8.0
[113] Rhdf5lib_1.16.0 BiocNeighbors_1.12.0 fansi_1.0.3 pillar_1.7.0
[117] fastmap_1.1.0 survival_3.2-13 glue_1.6.2 png_0.1-7
[121] iterators_1.0.14 bit_4.0.4 class_7.3-20 stringi_1.7.6
[125] sass_0.4.1 blob_1.2.2 BiocSingular_1.10.0 memoise_2.0.1
[129] e1071_1.7-9 irlba_2.3.5 future.apply_1.8.1 ape_5.6-2

-Tuomas

Subsetting

This may be too much asked but I am curious to check it out.

Let us first load some example SE data.

# Example data 
library(mia)
data(dmn_se)
se <- dmn_se

Subsetting based on colData requires explicit definition of the colData:

se[, colData(se)$pheno=="Obese"]

It would quite handy if one could subset this simply with:

se[, pheno=="Obese"]

It might not be too much assumed that, by default, the subsetting in such case would apply specifically to colData fields. Moreover, such change would not exclude the possibility of using more complex subsetting commands.

Similar points apply to rowData as well.

Perhaps this is against some design principles but might come handy anyway.

colnames-related concerns


# let se denote a SummarizedExperiment instance
# this code shows that
# 1) colnames(assay(se))<- is a no-op and should probably throw an error
# 2) colData(se)<- modifies assay colnames in a way I found surprising.  This
# operation COULD attempt to reorder rows of the DataFrame supplied to conform
# with available assay colnames ... should it?  This is with 1.17.1
# These were identified by Sara Stankiewicz, a new member of our group

library(SummarizedExperiment)
mymat = matrix((1:6)*1.0, nc=3)
dimnames(mymat) = list(c("A", "B"), letters[1:3])
se1 = SummarizedExperiment(mymat)
validObject(se1)
oldAa = assay(se1["A", "a"])
oldAa
oldcn = colnames(assay(se1))
colnames(assay(se1)) <- letters[4:6] # silent, should fail?
any(colnames(assay(se1)) == oldcn)  
library(S4Vectors)
ndf = DataFrame(v1=10:12)
rownames(ndf) = letters[3:1]
ndf
colData(se1) = ndf 
newAa = assay(se1["A", "a"])
newAa
newAa == oldAa # would work if colData rows reordered to conform

derived RSE to SE coercion discards names

Coercing a subclass of an RSE to an SE discards the row names:

library(SummarizedExperiment)
example(SummarizedExperiment, echo=FALSE)

setClass("MyExperiment", contains="RangedSummarizedExperiment")
me <- new("MyExperiment", se)

rownames(me) <- paste0("GENE_", seq_len(nrow(me)))
se2 <- as(me, 'SummarizedExperiment')
rownames(se2)
## NULL

This doesn't seem to happen with just the RSE to SE coercion, oddly enough.

rownames(se) <- paste0("GENE_", seq_len(nrow(se)))
se3 <- as(se, 'SummarizedExperiment')
rownames(se3)
## Stuff comes out

Session information

R version 3.6.0 Patched (2019-05-10 r76483)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Users/luna/Software/R/R-3-6-branch-dev/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] SummarizedExperiment_1.15.5 DelayedArray_0.11.4
 [3] BiocParallel_1.19.0         matrixStats_0.54.0
 [5] Biobase_2.45.0              GenomicRanges_1.37.14
 [7] GenomeInfoDb_1.21.1         IRanges_2.19.10
 [9] S4Vectors_0.23.17           BiocGenerics_0.31.5

loaded via a namespace (and not attached):
 [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0
 [4] zlibbioc_1.31.0        XVector_0.25.0         Matrix_1.2-17
 [7] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0
[10] GenomeInfoDbData_1.2.1

Coercion from SummarizedExperiment to ExpressionSet creates an unlocked environment

I ran into some unexpected pass-by-reference behavior with ExpressionSets that were coerced from SummarizedExperiments today.

I believe this can be traced to the SummarizedExperiment -> ExpressionSet coersion implemented in the setAs("RangedSummarizedExperiment", "ExpressionSet") method, which returns an ExpressionSet with an unlocked environment in the assayData slot, instead of the default lockedEnvironment storage mode.

(See page 7 of the BiobaseDevelopment vignette for a discussion of the storage modes.)

I believe the storageMode of the ExpressionSet object returned by the setAs("RangedSummarizedExperiment", "ExpressionSet") method needs to be locked explicitly here.

eset <- ExpressionSet(assayData,
                  phenoData = phenoData,
                  featureData = featureData,
                  experimentData = experimentData,
                  annotation = annotation,
                  protocolData = protocolData
                  )
storageMode(eset) <- "lockedEnvironment"
return(eset)

Reproducible example:

library(SummarizedExperiment)
library(airway)
data("airway")

eset <- as(airway, "ExpressionSet")
storageMode(eset)
# [1] "environment"

This can lead to unintended pass-by-reference behavior downstream, e.g.

exprs(eset)[1:3, 1:3]

                SRR1039508 SRR1039509 SRR1039512
ENSG00000000003        679        448        873
ENSG00000000005          0          0          0
ENSG00000000419        467        515        621

When I copy the ExpressionSet into a new object

eset2 <- eset
exprs(eset2)[1:3, 1:3]

the data is same as above, as expected:

                SRR1039508 SRR1039509 SRR1039512
ENSG00000000003        679        448        873
ENSG00000000005          0          0          0
ENSG00000000419        467        515        621

But when I modify the assayData in the new object:

exprs(eset2) <- log2(exprs(eset2) + 1)
exprs(eset2)[1:3, 1:3]

                SRR1039508 SRR1039509 SRR1039512
ENSG00000000003   9.409391   8.810572   9.771489
ENSG00000000005   0.000000   0.000000   0.000000
ENSG00000000419   8.870365   9.011227   9.280771

unexpectedly (at least to me), the assayData in the original ExpressionSet (eset) also gets modified:

exprs(eset)[1:3, 1:3]

                SRR1039508 SRR1039509 SRR1039512
ENSG00000000003   9.409391   8.810572   9.771489
ENSG00000000005   0.000000   0.000000   0.000000
ENSG00000000419   8.870365   9.011227   9.280771

because they both point to the same (unlocked) environment.

SessionInfo

R version 4.0.2 (2020-06-22)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] airway_1.8.0                SummarizedExperiment_1.18.1 DelayedArray_0.14.0         matrixStats_0.56.0         
 [5] Biobase_2.48.0              GenomicRanges_1.40.0        GenomeInfoDb_1.24.2         IRanges_2.22.2             
 [9] S4Vectors_0.26.1            BiocGenerics_0.34.0         BiocManager_1.30.10         devtools_2.3.0             
[13] usethis_1.6.1              

loaded via a namespace (and not attached):
 [1] compiler_4.0.2         XVector_0.28.0         prettyunits_1.1.1      bitops_1.0-6          
 [5] remotes_2.1.1          tools_4.0.2            zlibbioc_1.34.0        packrat_0.5.0         
 [9] testthat_2.3.2         digest_0.6.25          pkgbuild_1.0.8         pkgload_1.1.0         
[13] lattice_0.20-41        memoise_1.1.0          rlang_0.4.6            Matrix_1.2-18         
[17] cli_2.0.2              rstudioapi_0.11        GenomeInfoDbData_1.2.3 withr_2.2.0           
[21] desc_1.2.0             fs_1.4.2               grid_4.0.2             rprojroot_1.3-2       
[25] glue_1.4.1             R6_2.4.1               processx_3.4.2         fansi_0.4.1           
[29] sessioninfo_1.1.1      callr_3.4.3            magrittr_1.5           backports_1.1.8       
[33] ps_1.3.3               fortunes_1.5-4         ellipsis_0.3.1         assertthat_0.2.1      
[37] RCurl_1.98-1.2         crayon_1.3.4

Incorrect document page

Hi,

When typing ?SummarizedExperiment in R it brings the document of RangedSummarizedExperiment. I originally thought it was made on purpose, but from the description of RangedSummarizedExperiment,

RangedSummarizedExperiment is a subclass of SummarizedExperiment and, as such, all the methods documented in ?SummarizedExperiment also work on a RangedSummarizedExperiment object. The methods documented below are additional methods that are specific to RangedSummarizedExperiment

I think this is not intentional. There is a separate help page of SummarizedExperiment and can be accessed via ?"SummarizedExperiment-class". Perhaps ?SummarizedExperiment can be redirected to ?"SummarizedExperiment-class"?

Best,
Jiefei

assayNames<- is reluctant to work

This seems to occur if the dimnames of the underlying matrix differ from that of the SE:

library(SummarizedExperiment)
mat <- matrix(1,1,1)
colnames(mat) <- "B"
rownames(mat) <- "GENE2"

se <- SummarizedExperiment(list(counts=mat))
colnames(se) <- "A"
rownames(se) <- "GENE1"

assayNames(se) <- "whee"
## Error in `assays<-`(`*tmp*`, withDimnames = FALSE, value = `*vtmp*`) :
##   current and replacement dimnames() differ

Session information

R version 3.6.0 Patched (2019-05-10 r76483)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /Users/luna/Software/R/R-3-6-branch-dev/lib/libRblas.dylib
LAPACK: /Users/luna/Software/R/R-3-6-branch-dev/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel  stats4    stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] SummarizedExperiment_1.15.3 DelayedArray_0.11.1
 [3] BiocParallel_1.19.0         matrixStats_0.54.0
 [5] Biobase_2.45.0              GenomicRanges_1.37.11
 [7] GenomeInfoDb_1.21.1         IRanges_2.19.10
 [9] S4Vectors_0.23.12           BiocGenerics_0.31.4

loaded via a namespace (and not attached):
 [1] lattice_0.20-38        bitops_1.0-6           grid_3.6.0
 [4] zlibbioc_1.31.0        XVector_0.25.0         Matrix_1.2-17
 [7] tools_3.6.0            RCurl_1.95-4.12        compiler_3.6.0
[10] BiocManager_1.30.4     GenomeInfoDbData_1.2.1

a = SimpleList(setNames(nm = list())); assays(SummarizedExperiment(a)) is no longer endomorphic

@hpages

SummarizedExperiment/R/SummarizedExperiment-class.R

Line 84 in 81d41a1

assays <- NULL

changes

x = SimpleList(setNames(nm = list()))  # named simple list
assays(SummarizedExperiment(assays = x))      # NULL

Previously this was an endomorphism. Is this intentional?

Why is SummarizedExperiment graphic not directly included in package?

Is there a reason the graphic describing the geometry of the SummarizedExperiment class is not included in in the package?
Instead, it's downloaded from a Google Doc and then included in the vignette

SummarizedExperiment/vignettes/SummarizedExperiment.Rmd

Line 75 in 01aa07a

 download.file("https://docs.google.com/feeds/download/drawings/Export?id=1kiC8Qlo1mhSnLDqkGiRNPSo6GWn3C2duBszCFbJCB-g&exportFormat=svg", "SE.svg") 

I'm including this figure in a workflow and it felt odd to have to download a file from a random looking URL rather than include it via system.file() from the package.

seqinfo<-() fails when @rowData is a GRangesList

It's not clear why seqinfo,RangedSummarizedExperiment() calls update(x@rowRanges, ...) instead of calling seqinfo(x@rowRanges) <- value. But the end result, when x@rowRanges is a GRangesList, is

Error in methods::slot(object, name) : 
  no slot of name "call" for this object of class "CompressedGRangesList"

since there is no update() method for a GRangesList.

Convert ShallowSimpleListAssays into an S3 list

Is there any simple way to convert a ShallowSimpleListAssays into a S3 list? Methods like as.list or unlist do not work. The current workaround that I use is as follows:

 assaysList <- lapply(1:length(assays(x)), function(i){
  assays(x)[[i]]
})

I think a more simple approach should exist.

More informative error message for mismatching numbers of rows

Consider the following example:

library(SummarizedExperiment)
se <- SummarizedExperiment(assays=list(X=cbind(1:10)))
cbind(se, se[1:5,])
## Error in cbind(...) : number of rows of matrices must match (see arg 2)

Okay, that's pretty helpful. But the following is more obscure:

se <- SummarizedExperiment(rowData=DataFrame(X=1:10))
cbind(se, se[1:5,])
## Error in FUN(X[[i]], ...) :
##   column(s) 'X' in ‘mcols’ are duplicated and the data do not match

Of course, I know why this happens, but it would be a lot nicer if it could just say "number of rows don't match". This is especially important because I use SE's cbind internally in some SCE methods; currently, the user ends up with a message about some column they never actually created themselves, while the real error has to do with a mismatching number of rows.

“SimpleAssays” is not a defined class

Hi,

I have a problem opening the SummarizedExperiment file

library("SummarizedExperiment")
data <- readRDS("~/data.rds")
data

class: SummarizedExperiment
dim: 10736 100
metadata(0):
Error in getClass(element.type) : “SimpleAssays” is not a defined class
Error during wrapup: cannot get a slot ("slots") from an object of type "NULL"
Thank you

sessionInfo()
R version 3.6.1 (2019-07-05)

attached base packages:
[1] parallel stats4 stats graphics grDevices utils datasets
[8] methods base

other attached packages:
[1] tibble_3.0.1 readr_1.3.1
[3] SummarizedExperiment_1.14.1 DelayedArray_0.10.0
[5] BiocParallel_1.18.1 matrixStats_0.56.0
[7] Biobase_2.44.0 GenomicRanges_1.36.1
[9] GenomeInfoDb_1.20.0 IRanges_2.18.3
[11] S4Vectors_0.22.1 BiocGenerics_0.30.0

loaded via a namespace (and not attached):
[1] Rcpp_1.0.2 rstudioapi_0.11 knitr_1.28
[4] XVector_0.24.0 hms_0.5.3 zlibbioc_1.30.0
[7] lattice_0.20-41 R6_2.4.1 rlang_0.4.6
[10] tools_3.6.1 grid_3.6.1 xfun_0.14
[13] crayon_1.3.4 Matrix_1.2-18 GenomeInfoDbData_1.2.1
[16] BiocManager_1.30.10 vctrs_0.3.1 bitops_1.0-6
[19] RCurl_1.98-1.2 pillar_1.4.4 compiler_3.6.1
[22] pkgconfig_2.0.3

Implement an all.equal() method for SummarizedExperiment objects that performs a "deep comparison"

all.equal() on SummarizedExperiment objects needs to perform a "deep comparison" in order to avoid false positives or false negatives. In particular this means that it must perform a deep comparison between the assays slots of the 2 objects. This deep comparison will be expensive when the assay data is on disk but hey, the end user should realize that s/he is asking for a deep comparison so should be ready to pay the cost.

	# Prevent the SummarizedExperiment() constructor function from choking
	# on the rownames or colnames of the combined assays.
	if (by.row) {
	endoapply(combined, `colnames<-`, NULL)
	} else {
	endoapply(combined, `rownames<-`, NULL)
	}

	ok1 <- is.null(a_rownames) \|\|
	expected_rownames_is_NULL \|\|
	(length(a_rownames) == length(expected_dimnames[[1L]])
	&& all(a_rownames == expected_dimnames[[1L]]))
	ok2 <- is.null(a_colnames) \|\|
	expected_colnames_is_NULL \|\|
	(length(a_colnames) == length(expected_dimnames[[2L]])
	&& all(a_colnames == expected_dimnames[[2L]]))

	## Combining

	\S4method{cbind}{SummarizedExperiment}(..., deparse.level=1)
	\S4method{rbind}{SummarizedExperiment}(..., deparse.level=1)

bioconductor / summarizedexperiment Goto Github PK

summarizedexperiment's Introduction

summarizedexperiment's People

Contributors

Stargazers

Watchers

Forkers

summarizedexperiment's Issues

Reproducible example:

SessionInfo

Recommend Projects

Recommend Topics

Recommend Org