mschubert / ebits Goto Github PK

R bioinformatics toolkit incubator

License: Apache License 2.0

R 96.76% Makefile 0.65% Shell 0.13% C 2.46%

ebits's Introduction

Hi there 👋

I'm Michael, a computational biologist. This means that I'm using computers and code to analyze biological data, often in collaboration with experimental scientists. I'm working on how cancer cells overcome their growth limitations and how we can use that to help patients. Here you can find some of the work I did, and more general purpose tools I wrote for anyone to use. Please reach out if you want to collaborate!

This is what I do for research 👨‍🔬 (full list 🔗)

	article	code
Developed a reliable method for estimating cell signaling pathways from gene expression		github bioc 📦
Showed how gene coexpression networks often reflect cell mixtures instead of regulation		github
Found a way how cancer cells can tolerate abnormal DNA content (aneuploidy, chromosomal instability) and a potential treatment, in collaboration with experimental scientists		transposon cgas_ko
Working on estimating DNA copy number of single-cell RNA sequencing	^{_{coming soon}}

Here are some of my open source contributions 🔠 (full list 🔗)

		code
clustermq	R package for efficient high performance computing	github cran 📦 testing ⚙️
narray	R package for simplifying array operations	github cran 📦
ebits	R bioinformatics toolkit incubator and data API	ebits data
	Software build scripts for the ArchLinux User Repository 🔗 and as Gentoo overlay 🔗	pkgbuilds overlay

ebits's People

Contributors

Stargazers

Watchers

Forkers

saezlab manikgarg demian1

ebits's Issues

override of list (with a python-like list)

The following function was created by Gabor Grothendieck
who provided it graciously on the r-help list in June 2004.
https://stat.ethz.ch/pipermail/r-help/2004-June/053343.html
This function allows you to have multiple variables as output

He originally named the function list (overriding then the built-in function).
However, I have preferred to rename it for avoiding confusion.

pyList<- structure(NA,class="result")
"[<-.result" <- function(x,...,value) {
    args <- as.list(match.call())
    args <- args[-c(1:2,length(args))]
    length(value) <- length(args)
    for(i in seq(along=args)) {
        a <- args[[i]]
        if(!missing(a)) eval.parent(substitute(a <- v,list(a=a,v=value[[i]])))
    }
    x
}

Clean up modules

Below is a (still to be extended) list of what we should clean up:

general things
- import submodules using .module = ...?
- referencing package::method instead of attaching namespaces in module?
base
- %.% - do we need it?
- %|% - remove this in favour of %>%?
- vector/count - isn't this the same as sum?
system
- should this go into base?

io should be able to load and save hdf5

see: https://github.com/mschubert/h5store

hpc: user config should override module config

At the moment, the module-level config for hpc overrides user settings in ~/.BatchJobs.R. This should not happen, but the module-level config only a default that the user can change.

Not sure how to solve this yet.

plot module

Task list before merge into develop

Consider what's implemented in the ggfortify, GGally, ggdendro, ggRandomForests, and ggmcmc packages.

Strict equality comparison

A tweet has been making its rounds lately:

@dylan_childs:

Apparently, in R…

"1" == 1
[1] TRUE
Dirty

Should we try advocating a strict equality comparison operator, akin to JavaScript’s/PHP’s ===? Its implementation is straightforward, the question is whether anybody would ever use it.

`%==%` = function (a, b)
    class(a) == class(b) & mode(a) == mode(b) & a == b …

io${read,write}_table doesn’t work with missing file

The call should be dispatched to the base R function but instead it’s calling itself recursively.

hpc: move to BatchJobs 1.6 recommendation as soon as it's on CRAN

tudo-r/BatchJobs#58 (comment) seems to be fixed in tudo-r/BatchJobs@1068b32, using old version of BatchJobs/RSQLite is no longer required.

Update docs accordingly.

Find a good solution for stringsAsFactors

Generally, we don't want stringsAsFactors to be TRUE, because it basically messes up everything.

However, different packages and prior code may rely on the option to be FALSE, making it difficult to just replace the default value in e.g. ~/.Rprofile globally.

Combining these two issues, is there a better solution to make modules use stringsAsFactors=F then supplying this default value in all functions we call or overwrite?

`io$load` should not extract columns from a `data.frame`

io$load checks if there is one or more elements in the .RData file, and it assigns either the object or a list of objects to the lhs.

If loading a data.frame, it treats it as a list and extracts its columns.

This should not happen.

Do not export names of imported modules

Right now, if in module a:

b = import('b')

then

a = import('a')
ls(a)

lists b, even though you would not want to access the module that way.

A way around this would be to (in module a) use:

.b = import('b')

Should we use this for all module-internal imports?

array/intersect should work for data.frames

for now, it doesn't

ar$map only allows along %in% c(1,2)

for 3+dim arrays, along=3 should be supported as well

this is because abind::asub only allows 2-dim - but this should clearly be supported

Fix roxygen annotations for hpc.BatchJobsWrapper

Note that I still don't like this. I want to be able to read my source code comments without the need for additional software.

Structure in higher-level functionality

Add branches for the below points, add to them and merge them via pull requests

array/map should support multiple objects

analogous to mapply

b$omit$empty: nchar works on non-character data.frame columns as well

When applying b$omit$empty on a data.frame, the function currently checks if all columns are of character class.

This is not required.

However, we want to drop "", NULL - is there an easy way to check for this?

formula-based df/call's do not work with hpc_args

Standard calls work both by specifying hpc args or not (i.e., serial or hpc module processing)

For formula-based indices, this does not work because index references variables in args.

This most likely needs subsetting of the IndexedFormula@args variable back to IndexedFormula@index.

> st$lm(Sepal.Width ~ Sepal.Length, data=iris)
  Sepal.Width Sepal.Length         term   estimate  std.error statistic
1           1            1 Sepal.Length -0.0618848 0.04296699 -1.440287
    p.value size
1 0.1518983  150
> st$lm(Sepal.Width ~ Sepal.Length, data=iris, hpc_args=list())
# Error in (function (` fun`, ..., more.args = list(), export = list(),  :
#   Argument required but not provided: data ...

io$read_table: `extension` variable not created if `sep` is given

system("echo > some.file")

io = import('io')
io$read_table("some_file", sep="\t")

Error in io$read_table("some.file", sep = "\t") :
object 'extension' not found

Fix roxygen documentation

Use proper roxygen2 syntax for module documentation.

Does it make sense to add a generalized `names` function to base?

names = function(X, expand.NULL=TRUE) {
    if (is.data.frame(X))
        list(rownames(X), colnames(X))
    else if (is.vector(X))
        names(X)
    else if (is.null(dimnames(X)) && expand.NULL)
        rep(list(NULL), length(dim(X)))
    else
        dimnames(X)
}

names = function(X, value) {
    if (is.data.frame(X)) {
        rownames(X) = value[1][[1]]
        colnames(X) = value[2][[1]]
    } else if (is.vector(X))
        names(X) = value[[1]]
    else
        dimnames(X) = value
}

setNames = function(X, value) {
    names(X) = value
    X
}

Maybe along with a generalized dim.

io$read can not read files without extension

problem is that extension variable is NA when no extension is specified, and the line

call[[1]] = if (extension == 'xlsx')

produces NA instead of TRUE or FALES

solution is using identical() here

array$split should have argument drop=T/F

to indicate whether or not to drop orphan dimensions

Module structure

Currently, the base/__init__.r file looks like this (local revision, adjusted to work with the most recent modules version):

import('./operators', attach=T)
import('./override', attach=T)
import('./util', attach=T)
import('./functional', attach=T)
import('./lambda', attach=T)
omit = import('./omit')

However, this does not actually work:

b = import('base')
ls(b)
# [1] "omit"

This is by design: file-level imports inside a module are not visible to the outside, because modules copies the environment layout of packages in this regard (I need to describe this in the documentation).

At any rate, I don’t think this is particularly clean module design: having one module which imports and exposes lots of others seems the opposite of modularisation. What do you think?

… incidentally, we could export the symbols manually. For instance, if we inserted the following line into base/__init__.r:

let = let

Then functional/let would be usable outside base:

base$let(x = 3, x * 2)
# [1] 6

Iterated model building using formulas

Consider the following setup of A, B, and C, where you want to apply a formula (e.g. lm(...), rlm(...), etc. - lm() supports matrix row indexing, but most other do not). The proposed syntax would add support for iterated calculation of a model where this is not implicitly supported, and a common output format.

A = matrix(1:4, nrow=2, ncol=2, dimnames=list(c('a','b'),c('x','y')))
B = matrix(5:6, nrow=2, ncol=1, dimnames=list(c('b','a'),'z'))
C = matrix(4:5, nrow=2, ncol=2)
# (A)   x y    (B)   z    (C)     [,1] [,2]
#     a 1 3        b 5      [1,]    4    4
#     b 2 4        a 6      [2,]    5    5

With those, you should be able to write (using lm here as an example):

# multiple outcomes - this works as expected
lm(A ~ B) # calculates lm(A[,1] ~ B), lm(A[,2] ~ B), etc.
# multiple inputs - this doesn't work
lm(B ~ A) # calculates lm(B ~ A[,1]), lm(B ~ A[,2]) + the general case
# each should return effect size, p-value, and other metrics

Now, if I want a generalized syntax to specify matrix iterations in models that only need to work on vectors, which one would be the best option?

implicitly assume all matrices are iterated, specify grouping with additional arguments
specify grouping using interaction syntax

# example 1: iterate A and C through columns. don't iterate B
x1 = create_formula_index(A ~ B + C) # option 1
x1 = create_formula_index(A ~ B:0 + C:0) # option 2
#   A B C
#1 x z 1
#2 y z 1
#3 x z 2
#4 y z 2

# example2: iterate A and C together, don't iterate B
x2 = create_formula_index(A ~ B + C, group=c("A", "C"), atomic="B") # option 1
x2 = create_formula_index(A:1 ~ B:0 + C:1) # option 2
#   A B C
#1 x z x
#2 y z y

The advantage of option 1 is that this is what lm(...) does, but it can be a bit verbose if want groups and atomic variables. Option 2 is more verbose when iterating, less when not.

Which option should be preferred?

integrated tests

potentially for (wrapped, exported) st$ml:

st = import('stats')
x = matrix(rnorm(40), ncol=2)
y = as.matrix(1:20)
subsets=c(rep("a", 10), rep("b", 10))

re11 = st$ml(y ~ x, train_args=list("regr.glmnet"), atomic="x")
re12 = st$ml(y ~ x, train_args=list("regr.glmnet"), atomic="x", subsets=subsets)

re21 = st$ml(y ~ x, train_args=list("regr.glmnet"), atomic="x", hpc_args=list())
re22 = st$ml(y ~ x, train_args=list("regr.glmnet"), atomic="x", subsets=subsets, hpc_args=list())

x = list(A=x, B=x)
re31 = st$ml(y ~ x, train_args=list("regr.glmnet"))
re32 = st$ml(y ~ x, train_args=list("regr.glmnet"), subsets=subsets, hpc_args=list(n.chunks=3))

Allow short lambda syntax with multiple arguments

At the moment, -> only supports a single argument. Augment syntax with support for multiple arguments by overwriting (, as suggested on Stack Overflow.

rxmatch undefined in io/text

can find it in rcane/strings.R but not in ebits

hpc/BatchJobsWrapper function argument checking is too strict

If the function call has got more than one argument, all of the arguments need to be supplied and named.

Why should the user have to supply arguments with default values explicitly? -> remove this requirement.

array$stack discards dimension names when they are numbers

This is because it sets names as numbers internally and removes them afterwards.

Add flag if they were set and if so, don't unset them.

Alternatives for IO

At the moment the io module exclusively uses R base IO plus xlsx.

There are two or three broad alternatives:

data.table::fread. This is a thoroughly terrible idea and I strongly oppose it, because their code base and their API is a mess, and it’s been in development for years without a stable version. Performance is phenomenal, but we need correctness first, performance last.
readr. Also still in development. Uses sane defaults (somewhat similar to what we’re already doing with stringsAsFactors), produces dplyr-compatible tbl_df without row names. Not as fast as fread but apparently still an order of magnitude faster than base R.
rio. I honestly don’t know exactly what this is; it seems to be higher level, i.e. read data from different format using the same interface (same as io, but way more formats supported). Uses (or rather, will use in the future) readr under the hood.

`data` module

We don't have a module to facilitate accessing public data sets, even though it's quite common to work with those. For some, a dedicated R package may exist, but not for others.

For example, some that I'm working with are the GDSC, LINCS, and TCGA data.

There should be a module for providing access functions to those, with a link to where to get the actual data. Obviously, this repository shouldn't contain the data itself.

call default args

In io/text: the .set_defaults function should be moved to be generally accessible, as matching default args in a call is a task that will occur more often (I'm working on something now, for instance).

.set_defaults = function (call, .formals = formals(sys.function(sys.parent()))) {
    for (n in names(.formals))
        if (n != '...' && ! (n %in% names(call)))
            call[[n]] = .formals[[n]]

    call
}

The method also fails for functions with NULL arguments, like the one below.

x = function(y=NULL) {
    args = .set_defaults(match.call())
    as.list(args)
} 
# > x()
# Error in call[[n]] = .formals[[n]] : subscript out of bounds

Lambda shortcut symbol

The goal is to define a shortcut for defining functions that would encourage the use of anonymous functions, since the default syntax in R is rather verbose (function (args) body).

The initial choice fell to .(args -> body), which works fairly well. However, . is a private, non-exported name in the modules convention, and changing this would require some unintuitive modifications, which I want to avoid.

Here are some alternative naming suggestions, keeping in mind that valid R identifiers consist of letters, digits, the dot and the underscore, and that they cannot start with an underscore or digit, nor with a dot followed by a digit, and that furthermore the definition of “letter” is locale-dependent.

function (x, y) x + y

fun(x, y = x + y)
fun(x, y -> x + y)
f(x, y -> x + y)
λ(x, y -> x + y)
ƒ(x, y -> x + y)
F(x, y -> x + y)

Of these, the Unicode variants λ and ƒ appeal the most to me but are impractical for obvious reasons. I dislike F because it uses a capital letter, and because it needlessly redefines a predefined variable (aliasing FALSE). I dislike fun because it saves very little compared to the full-blown function …. And finally, I dislike f because it’s a single, ordinary letter.

Man, I really liked . …

Should we add high-level descriptions as READMEs to individual modules?

Having access to the ? help and function-level documentation is great for already using a module, but how do you quickly find out if a certain module is useful for a certain task?

Option 1: Reading the code and technical documentation

Option 2: An (optional) top-level README.md that makes browsing modules for functionality easier and showcase its abilities.

Should we add this to each structured (i.e. in directory) module in ebits?

Add `previous_definition`

When overriding existing functions form base packages, use the following helper function to dispatch to existing functions without hard-coding their exact namespace:

previous_definition = function (name) {
    # Skip the parent environment of the calling frame.
    get(name, envir = parent.env(parent.env(parent.frame())))
}

This allows multiple redefinitions layered on top of each other (unlike e.g. devtools::help, which does this “wrong”). For an example of usage, see this Stack Overflow answer.

array$map should support drop

behaviour of array$stack, bind

stacking/binding vectors doesn't work as expected

bind(vectorList, along=1) should be the same as do.call(rbind, vectorList)
bind(vectorList, along=2) should be the same as do.call(cbind, vectorList)

and analogous for stack

for now, it doesn't

array/intersect does not work with data argument

> ar = import('array')
> ll = list(a=setNames(1:5, letters[1:5]), b=setNames(2:4, letters[2:4]))
> ll
$a
a b c d e 
1 2 3 4 5 

$b
b c d 
2 3 4 

> ar$intersect(a,b,data=ll)
Error in setNames(list(...), unlist(match.call(expand.dots = FALSE)$...)) :
  object 'a' not found

ll should be modified in place, with values as below:

> ar$intersect_list(ll)
$a
b c d 
2 3 4 

$b
b c d 
2 3 4

Rewrite read_table as s3 generic+implementations

@klmr

I would actually replace read.table completely by a generic read_table (with read_table.csv etc.), and build that on top of fread (orders of magnitude faster than read.table).

As for reading Excel tables, that’s arguably a different method, read_tables (note plural).

READMEs using .Rmd/.ipynb?

The couple of READMEs we have got now highlight some functionality of the modules they are in.

Does it make sense to have .Rmd files instead that generate those READMEs to showcase functionality?

Advantages

functionality can be checked by re-generating the .mds
we'd have some high-level docs for "type x and y comes out"
could also incorporate plots

Disadvantages

we version-control an automatically generated file (this is minor) and possibly images

%or% for length(lhs) > 1

In base:%or%:

else if (length(a) > 1)
    mapply(cmp, a, b)

Is that what we intend to do?

I know I committed this, but shouldn't it rather be:

else if (length(a) > 1)
    a

array$map reorders subsets

array$map reorders subsets according to the order of levels(subset), which is not what we want - it should keep the same order in the result as in the input array.

ar$stack fails without names along the stacked dimension

> A = matrix(1:4, nrow=2, ncol=2, dimnames=list(c('a','b'),c('x','y')))
> B = matrix(5:6, nrow=2, ncol=1, dimnames=list(c('b','a'),'z'))
> C = ar$stack(list(A, B), along=2)
  x y z
a 1 3 6
b 2 4 5

> colnames(A) = NULL
> colnames(B) = NULL
> ar$stack(list(A, B), along=2)
  [,1] [,2]
a    6    3
b    5    4

Apart from the colnames of the resulting matrix, results should be identical.

Vectorised %or% only looks at first element and discards warnings

> c(NA, 1) %or% c(1,2)
[1] 1 2

this should obviously do an element-wise comparison yielding 1,1

array$construct maps length by default

construct(DF, value ~ axis1 + axis2)

this will assign 1 as value because default aggregation function is length

better default: function should map single values and throw an error if aggregation needed