Git Product home page Git Product logo

ebits's Introduction

Hi there 👋

I'm Michael, a computational biologist. This means that I'm using computers and code to analyze biological data, often in collaboration with experimental scientists. I'm working on how cancer cells overcome their growth limitations and how we can use that to help patients. Here you can find some of the work I did, and more general purpose tools I wrote for anyone to use. Please reach out if you want to collaborate!

This is what I do for research 👨‍🔬 (full list 🔗)
            article                     code         
Developed a reliable method for estimating cell signaling pathways from gene expression Nat Comm
Cell
github :octocat:
bioc 📦
Showed how gene coexpression networks often reflect cell mixtures instead of regulation BBA-GRM github :octocat:
Found a way how cancer cells can tolerate abnormal DNA content (aneuploidy, chromosomal instability) and a potential treatment, in collaboration with experimental scientists bioRxiv
Nature
transposon :octocat:
cgas_ko :octocat:
Working on estimating DNA copy number of single-cell RNA sequencing coming soon
Here are some of my open source contributions 🔠 (full list 🔗)
            status                     code         
clustermq R package for efficient high performance computing
Bioinformatics downloads
CRAN version
Build Status
github :octocat:
cran 📦
testing ⚙️
narray R package for simplifying array operations
downloads
CRAN version
Build Status
github :octocat:
cran 📦
ebits R bioinformatics toolkit incubator and data API Build Status ebits :octocat:
data :octocat:
Software build scripts for the ArchLinux User Repository 🔗 and as Gentoo overlay 🔗 pkgcheck pkgbuilds :octocat:
overlay :octocat:

ebits's People

Contributors

barzine avatar klmr avatar mschubert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ebits's Issues

override of list (with a python-like list)

The following function was created by Gabor Grothendieck
who provided it graciously on the r-help list in June 2004.
https://stat.ethz.ch/pipermail/r-help/2004-June/053343.html
This function allows you to have multiple variables as output

He originally named the function list (overriding then the built-in function).
However, I have preferred to rename it for avoiding confusion.

pyList<- structure(NA,class="result")
"[<-.result" <- function(x,...,value) {
    args <- as.list(match.call())
    args <- args[-c(1:2,length(args))]
    length(value) <- length(args)
    for(i in seq(along=args)) {
        a <- args[[i]]
        if(!missing(a)) eval.parent(substitute(a <- v,list(a=a,v=value[[i]])))
    }
    x
}

Clean up modules

Below is a (still to be extended) list of what we should clean up:

  • general things
    • import submodules using .module = ...?
    • referencing package::method instead of attaching namespaces in module?
  • base
    • %.% - do we need it?
    • %|% - remove this in favour of %>%?
    • vector/count - isn't this the same as sum?
  • system
    • should this go into base?

hpc: user config should override module config

At the moment, the module-level config for hpc overrides user settings in ~/.BatchJobs.R. This should not happen, but the module-level config only a default that the user can change.

Not sure how to solve this yet.

plot module

Task list before merge into develop

Consider what's implemented in the ggfortify, GGally, ggdendro, ggRandomForests, and ggmcmc packages.

  • color
    • gradients
      • generating from color map
      • plotting of gradients
    • quantization
    • categorical
    • categorical + continuous
  • label
  • matrix
    • full
    • upper/lower triangular
  • heatmap
    • pheatmap interface that respects df conventions
  • venn
    • start from list of vectors
    • size-proportional
    • tree plots?
  • linear_fit
    • linear associations
    • p-value reporting
    • rsquared reporting
  • volcano
    • coloring according to effect direction and p-value

Strict equality comparison

A tweet has been making its rounds lately:

@dylan_childs:

Apparently, in R…

"1" == 1
[1] TRUE
Dirty

Should we try advocating a strict equality comparison operator, akin to JavaScript’s/PHP’s ===? Its implementation is straightforward, the question is whether anybody would ever use it.

`%==%` = function (a, b)
    class(a) == class(b) & mode(a) == mode(b) & a == b …

Find a good solution for stringsAsFactors

Generally, we don't want stringsAsFactors to be TRUE, because it basically messes up everything.

However, different packages and prior code may rely on the option to be FALSE, making it difficult to just replace the default value in e.g. ~/.Rprofile globally.

Combining these two issues, is there a better solution to make modules use stringsAsFactors=F then supplying this default value in all functions we call or overwrite?

`io$load` should not extract columns from a `data.frame`

io$load checks if there is one or more elements in the .RData file, and it assigns either the object or a list of objects to the lhs.

If loading a data.frame, it treats it as a list and extracts its columns.

This should not happen.

Do not export names of imported modules

Right now, if in module a:

b = import('b')

then

a = import('a')
ls(a)

lists b, even though you would not want to access the module that way.

A way around this would be to (in module a) use:

.b = import('b')

Should we use this for all module-internal imports?

Structure in higher-level functionality

Add branches for the below points, add to them and merge them via pull requests

  • base (extending basic R functions, adding python-like basic functions)
  • text (regex, ...)
  • matrix, array, list (separately?)
  • seq
  • functional
  • hpc
  • stats
  • plots

formula-based df/call's do not work with hpc_args

Standard calls work both by specifying hpc args or not (i.e., serial or hpc module processing)

For formula-based indices, this does not work because index references variables in args.

This most likely needs subsetting of the IndexedFormula@args variable back to IndexedFormula@index.

> st$lm(Sepal.Width ~ Sepal.Length, data=iris)
  Sepal.Width Sepal.Length         term   estimate  std.error statistic
1           1            1 Sepal.Length -0.0618848 0.04296699 -1.440287
    p.value size
1 0.1518983  150
> st$lm(Sepal.Width ~ Sepal.Length, data=iris, hpc_args=list())
# Error in (function (` fun`, ..., more.args = list(), export = list(),  :
#   Argument required but not provided: data ...

Does it make sense to add a generalized `names` function to base?

names = function(X, expand.NULL=TRUE) {
    if (is.data.frame(X))
        list(rownames(X), colnames(X))
    else if (is.vector(X))
        names(X)
    else if (is.null(dimnames(X)) && expand.NULL)
        rep(list(NULL), length(dim(X)))
    else
        dimnames(X)
}

names = function(X, value) {
    if (is.data.frame(X)) {
        rownames(X) = value[1][[1]]
        colnames(X) = value[2][[1]]
    } else if (is.vector(X))
        names(X) = value[[1]]
    else
        dimnames(X) = value
}

setNames = function(X, value) {
    names(X) = value
    X
}

Maybe along with a generalized dim.

io$read can not read files without extension

problem is that extension variable is NA when no extension is specified, and the line

call[[1]] = if (extension == 'xlsx')

produces NA instead of TRUE or FALES

solution is using identical() here

Module structure

Currently, the base/__init__.r file looks like this (local revision, adjusted to work with the most recent modules version):

import('./operators', attach=T)
import('./override', attach=T)
import('./util', attach=T)
import('./functional', attach=T)
import('./lambda', attach=T)
omit = import('./omit')

However, this does not actually work:

b = import('base')
ls(b)
# [1] "omit"

This is by design: file-level imports inside a module are not visible to the outside, because modules copies the environment layout of packages in this regard (I need to describe this in the documentation).

At any rate, I don’t think this is particularly clean module design: having one module which imports and exposes lots of others seems the opposite of modularisation. What do you think?


… incidentally, we could export the symbols manually. For instance, if we inserted the following line into base/__init__.r:

let = let

Then functional/let would be usable outside base:

base$let(x = 3, x * 2)
# [1] 6

Iterated model building using formulas

Consider the following setup of A, B, and C, where you want to apply a formula (e.g. lm(...), rlm(...), etc. - lm() supports matrix row indexing, but most other do not). The proposed syntax would add support for iterated calculation of a model where this is not implicitly supported, and a common output format.

A = matrix(1:4, nrow=2, ncol=2, dimnames=list(c('a','b'),c('x','y')))
B = matrix(5:6, nrow=2, ncol=1, dimnames=list(c('b','a'),'z'))
C = matrix(4:5, nrow=2, ncol=2)
# (A)   x y    (B)   z    (C)     [,1] [,2]
#     a 1 3        b 5      [1,]    4    4
#     b 2 4        a 6      [2,]    5    5

With those, you should be able to write (using lm here as an example):

# multiple outcomes - this works as expected
lm(A ~ B) # calculates lm(A[,1] ~ B), lm(A[,2] ~ B), etc.
# multiple inputs - this doesn't work
lm(B ~ A) # calculates lm(B ~ A[,1]), lm(B ~ A[,2]) + the general case
# each should return effect size, p-value, and other metrics

Now, if I want a generalized syntax to specify matrix iterations in models that only need to work on vectors, which one would be the best option?

  • implicitly assume all matrices are iterated, specify grouping with additional arguments
  • specify grouping using interaction syntax
# example 1: iterate A and C through columns. don't iterate B
x1 = create_formula_index(A ~ B + C) # option 1
x1 = create_formula_index(A ~ B:0 + C:0) # option 2
#   A B C
#1 x z 1
#2 y z 1
#3 x z 2
#4 y z 2

# example2: iterate A and C together, don't iterate B
x2 = create_formula_index(A ~ B + C, group=c("A", "C"), atomic="B") # option 1
x2 = create_formula_index(A:1 ~ B:0 + C:1) # option 2
#   A B C
#1 x z x
#2 y z y

The advantage of option 1 is that this is what lm(...) does, but it can be a bit verbose if want groups and atomic variables. Option 2 is more verbose when iterating, less when not.

Which option should be preferred?

integrated tests

potentially for (wrapped, exported) st$ml:

st = import('stats')
x = matrix(rnorm(40), ncol=2)
y = as.matrix(1:20)
subsets=c(rep("a", 10), rep("b", 10))

re11 = st$ml(y ~ x, train_args=list("regr.glmnet"), atomic="x")
re12 = st$ml(y ~ x, train_args=list("regr.glmnet"), atomic="x", subsets=subsets)

re21 = st$ml(y ~ x, train_args=list("regr.glmnet"), atomic="x", hpc_args=list())
re22 = st$ml(y ~ x, train_args=list("regr.glmnet"), atomic="x", subsets=subsets, hpc_args=list())

x = list(A=x, B=x)
re31 = st$ml(y ~ x, train_args=list("regr.glmnet"))
re32 = st$ml(y ~ x, train_args=list("regr.glmnet"), subsets=subsets, hpc_args=list(n.chunks=3))

Alternatives for IO

At the moment the io module exclusively uses R base IO plus xlsx.

There are two or three broad alternatives:

  • data.table::fread. This is a thoroughly terrible idea and I strongly oppose it, because their code base and their API is a mess, and it’s been in development for years without a stable version. Performance is phenomenal, but we need correctness first, performance last.
  • readr. Also still in development. Uses sane defaults (somewhat similar to what we’re already doing with stringsAsFactors), produces dplyr-compatible tbl_df without row names. Not as fast as fread but apparently still an order of magnitude faster than base R.
  • rio. I honestly don’t know exactly what this is; it seems to be higher level, i.e. read data from different format using the same interface (same as io, but way more formats supported). Uses (or rather, will use in the future) readr under the hood.

`data` module

We don't have a module to facilitate accessing public data sets, even though it's quite common to work with those. For some, a dedicated R package may exist, but not for others.

For example, some that I'm working with are the GDSC, LINCS, and TCGA data.

There should be a module for providing access functions to those, with a link to where to get the actual data. Obviously, this repository shouldn't contain the data itself.

call default args

In io/text: the .set_defaults function should be moved to be generally accessible, as matching default args in a call is a task that will occur more often (I'm working on something now, for instance).

.set_defaults = function (call, .formals = formals(sys.function(sys.parent()))) {
    for (n in names(.formals))
        if (n != '...' && ! (n %in% names(call)))
            call[[n]] = .formals[[n]]

    call
}

The method also fails for functions with NULL arguments, like the one below.

x = function(y=NULL) {
    args = .set_defaults(match.call())
    as.list(args)
} 
# > x()
# Error in call[[n]] = .formals[[n]] : subscript out of bounds

Lambda shortcut symbol

The goal is to define a shortcut for defining functions that would encourage the use of anonymous functions, since the default syntax in R is rather verbose (function (args) body).

The initial choice fell to .(args -> body), which works fairly well. However, . is a private, non-exported name in the modules convention, and changing this would require some unintuitive modifications, which I want to avoid.

Here are some alternative naming suggestions, keeping in mind that valid R identifiers consist of letters, digits, the dot and the underscore, and that they cannot start with an underscore or digit, nor with a dot followed by a digit, and that furthermore the definition of “letter” is locale-dependent.

function (x, y) x + y
fun(x, y = x + y)
fun(x, y -> x + y)
f(x, y -> x + y)
λ(x, y -> x + y)
ƒ(x, y -> x + y)
F(x, y -> x + y)

Of these, the Unicode variants λ and ƒ appeal the most to me but are impractical for obvious reasons. I dislike F because it uses a capital letter, and because it needlessly redefines a predefined variable (aliasing FALSE). I dislike fun because it saves very little compared to the full-blown function …. And finally, I dislike f because it’s a single, ordinary letter.

Man, I really liked .

Should we add high-level descriptions as READMEs to individual modules?

Having access to the ? help and function-level documentation is great for already using a module, but how do you quickly find out if a certain module is useful for a certain task?

Option 1: Reading the code and technical documentation

Option 2: An (optional) top-level README.md that makes browsing modules for functionality easier and showcase its abilities.

Should we add this to each structured (i.e. in directory) module in ebits?

Add `previous_definition`

When overriding existing functions form base packages, use the following helper function to dispatch to existing functions without hard-coding their exact namespace:

previous_definition = function (name) {
    # Skip the parent environment of the calling frame.
    get(name, envir = parent.env(parent.env(parent.frame())))
}

This allows multiple redefinitions layered on top of each other (unlike e.g. devtools::help, which does this “wrong”). For an example of usage, see this Stack Overflow answer.

behaviour of array$stack, bind

stacking/binding vectors doesn't work as expected

bind(vectorList, along=1) should be the same as do.call(rbind, vectorList)
bind(vectorList, along=2) should be the same as do.call(cbind, vectorList)

and analogous for stack

for now, it doesn't

array/intersect does not work with data argument

> ar = import('array')
> ll = list(a=setNames(1:5, letters[1:5]), b=setNames(2:4, letters[2:4]))
> ll
$a
a b c d e 
1 2 3 4 5 

$b
b c d 
2 3 4 

> ar$intersect(a,b,data=ll)
Error in setNames(list(...), unlist(match.call(expand.dots = FALSE)$...)) :
  object 'a' not found

ll should be modified in place, with values as below:

> ar$intersect_list(ll)
$a
b c d 
2 3 4 

$b
b c d 
2 3 4

Rewrite read_table as s3 generic+implementations

@klmr

I would actually replace read.table completely by a generic read_table (with read_table.csv etc.), and build that on top of fread (orders of magnitude faster than read.table).

As for reading Excel tables, that’s arguably a different method, read_tables (note plural).

READMEs using .Rmd/.ipynb?

The couple of READMEs we have got now highlight some functionality of the modules they are in.

Does it make sense to have .Rmd files instead that generate those READMEs to showcase functionality?

Advantages

  • functionality can be checked by re-generating the .mds
  • we'd have some high-level docs for "type x and y comes out"
  • could also incorporate plots

Disadvantages

  • we version-control an automatically generated file (this is minor) and possibly images

%or% for length(lhs) > 1

In base:%or%:

else if (length(a) > 1)
    mapply(cmp, a, b)

Is that what we intend to do?

I know I committed this, but shouldn't it rather be:

else if (length(a) > 1)
    a

array$map reorders subsets

array$map reorders subsets according to the order of levels(subset), which is not what we want - it should keep the same order in the result as in the input array.

ar$stack fails without names along the stacked dimension

> A = matrix(1:4, nrow=2, ncol=2, dimnames=list(c('a','b'),c('x','y')))
> B = matrix(5:6, nrow=2, ncol=1, dimnames=list(c('b','a'),'z'))
> C = ar$stack(list(A, B), along=2)
  x y z
a 1 3 6
b 2 4 5

> colnames(A) = NULL
> colnames(B) = NULL
> ar$stack(list(A, B), along=2)
  [,1] [,2]
a    6    3
b    5    4

Apart from the colnames of the resulting matrix, results should be identical.

array$construct maps length by default

construct(DF, value ~ axis1 + axis2)

this will assign 1 as value because default aggregation function is length

better default: function should map single values and throw an error if aggregation needed

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.