tdsmith / arrgh Goto Github PK

View Code? Open in Web Editor NEW

307.0 307.0 14.0 55 KB

A newcomer's (angry) guide to data types in R

License: Other

Ruby 0.52% HTML 15.14% CSS 84.33%

arrgh's People

Contributors

Stargazers

Watchers

Forkers

danielsoneg kevinushey abelsonlive esparta kod3r aurora1625 protonk casunlight ajouka1

arrgh's Issues

String manipulation

I don't understand how it was a good idea to ship this language without any ability to concatenate strings other than paste and its counterparts like paste0. It means that normal code in another language like

return "apple says, " + logMessage;

Now has to be wrapped inside a function call. Inside the return call, which inexplicably also has brackets, generating return(paste("apple says,", log_message)), with omission of the white space because paste isn't just a joining system, it's also a joining one, adding an invisible space.

But lets say you want to deal with a string. Strings can have all sorts of things in them. Lets say ours has backslashes I want to make normal slashes. Java, Python below:

"C:\\blah\\blah\\blah".replace("\\", "/");
r'C:\blah\blah\blah'.replace('\\', '/')

And even if there weren't something of the sort, a sensible design for argument parameters would be something like string_replace(string, pattern, replacement). But no, in R, we don't use normal orders. You instead use gsub(pattern, replacement, string). Why is the order messed up? Who knows.

Next, you want to find out whether your string has your extra special characters in it? You want to only match on files which end in, say, csv? Java can just call a String#contains method. Python can just ask whether it is the case pattern in string. R? grepl(pattern, chars).

Why are these names so unhelpful? Probably because they're all derived from grep. But fortunately, with R, you can have these useful tools hidden from you without having to look them up or perfect knowledge of the library. PRogRess.

Dot names

Dots in identifier names are just part of the identifier. They are not scope operators. They are not operators at all. They are just a legal character to use in the names of things. They are often used where a normal human being would use underscores, since underscores were assignment operators in S, which I promise you don’t even want to think about.

I'm not sure if you're still updating, or what your understanding is at this point, but this bit is slightly off. Dots are used in function names for method overloading (by class), so print(x) dispatches differently depending on the class of x.

For example, print(data.frame(x)) actually calls print.data.frame(x), but you may want to call print.default() explicitly.

Mention books by Michael J. Crawley

My experience is only with Statistical Computing: An Introduction to Data Analysis Using S-Plus, but it was the best introduction to the statistical computing period (despite the name, it worked with R quite well). However, that experience is a bit dated, and I can see now, that he wrote more books on the theme and some of them specifically about R. And yes, the price of the books is quite unnerving. Hopefully, your university has them in the library.

no ambiguity in assignments

Some parsing ambiguities are resolved by considering whitespace around operators. See and despair: x<-y (assignment) is parsed differently than x < -y (comparison)!

but that’s not ambiguous. <- is simply a symbol, so if it occurs together, it’s always the assignment operator. (afaik)

discuss dimension dropping

m <- matrix(1:9, nrow=3)

class( m[1:2, ] ) ## matrix!
class( m[1, ] ) ## integer!

Useful when you're expecting it, the cause of difficult to trace bugs when you're not. The terribly wonderful syntax

m[1, , drop=FALSE]

allows you to avoid dimension dropping explicitly.

No way to append a list in place? Wtf

In Java, I have a list (lets say implementation is ArrayList). It's a happy list. And you can add things do that list, because lists are not arrays and you might some time want to add something to it.

List<Integer> ints = new ArrayList<>();
ints.add(1); // works

In Python, I too can have a list. And it's a happy list. You too can add things do this list, because you might want to do that some time. So in our world, where we have a number of data files, the number of which might not be pre-established (something of a reasonable and common use-case), we can do this:

dfs = []
for file in files:
    df = df_supplier(file)  # eg -- pandas.read_csv(file)
    dfs.append(df)

You want to that in R? You're out of luck. The easiest way I can find to do this is to get yourself the package rlist and then do this:

library(rlist)
dfs = list()
for i : 1:length(files) {
    file = files[[i]]
    dfs = list.append(dfs, read_XYZ(file, arguments...))
}

This way, you can have the privilege of wasting your computer time and hard-earnt electricity copying data you already have to a location in which it already is. This is pRogRess! But if you don't want to install a whole library just do this, do not be afraid!

Just write yourself a function,

append = function(li, obj) {
    name <- deparse(substitute(li))
    li[[length(li) + 1]] = obj
    assign(name, li, envir=parent.frame())
    return()
}

Now, with the power of R, you can pass your list, have the whole thing deparsed, add a single thing to the end of the list, and then manually assign that whole thing back to the parent environment's entry of the list. Fortunately, you still have the glorious featuRe of wasting your computer time and hard-earnt electricity copying data you already have to a location in which it already is! Efficiency! (The word efficiency has no Rs in it, that's how you know it's efficient.)

Fix description of variables having Javascript-like scope

Like PHP and Javascript, variables have function (not block) scope.

While it's true that variable scope is not per-block, it's not the same as in JavaScript, where local variables are hoisted to top of the function:

Example from Mozilla.org:

(function() {
  console.log(myvar); // undefined (but not an error)
  var myvar = "local value";
})();

Try the same in R and you'll get "Error in print(myvar) : object 'myvar' not found":

(function() {
  print(myvar) # error
  myvar <- "local value"
}()

But yes, there is no block scope:

(function() {
  if (TRUE) {
    myvar <- "local variable"
  }
  print(myvar)
})()

will print "local variable". For block-scope, you can use local:

(function() {
  if (TRUE) {
    local({
      myvar <- "local variable"
    })
  }
  print(myvar)
})()

will throw: "Error in print(myvar) : object 'myvar' not found"

length(foo)

Do not use length(foo), which will for reasons unexplained tell you how many columns you have;

Because dataframes, as you note above, are stored as lists of column vectors. The length of that list is the number of columns.

Sneaky data frame coercion with binary operators

See here for an example where ^ coerces a data frame to a matrix but ** does not. The post (and answer) offers some reasoning as to why as well as a link to some more edge cases.

If you think this merits inclusion I'll make a PR.

Update data frame dereferencing to include partial matching?

 * If you squint, `$` acts kind of like the `.` scope operator in C-like languages, at least for data frames. If you'd write `struct.instance_variable` in C, you'd maybe write `frame$column.variable` in R. But be careful - `$` tries to do partial matching of names if you supply a name it cannot find, but returns `NULL` if multiple elements match. Eg: `x <- data.frame(var1="a", var123="b"); x$var12` will return `var123`. Somewhat thankfully, `x$var1` will still return `var1`.

from cdrv

apply coerces to matrix, inane design decision

Let's be honest. Apply is just broken for data frames. Defending it by saying that the user just doesn't understand the language, that the language is just fine, and the function is functioning correctly is like saying that your toolbox of misshapen tools where the hammer is just the curved end on both sides is 'just fine'.

The 'correct' way to do this in R apparently is just to write out a for loop. Fortunately for you, you can't just make a for loop iterate over rows, like for row in df.iterrows() in Pandas, you have to explicitly index them.

And fortunately for you, you can't just make a range like 1:nrow(df) (also, who made the stupid choice to call it nrow when nrows makes more sense, their being more than one row...) because if nrow(df0 == 0 then it returns a sequence (1, 0) which breaks your code when you try and run that. R is just built for robustness!

But if you're doing lots of manipulation with lists, so you're familiar with sapply, you can probably fix that issue by using apply with the proper functions, right? Wrong.

a = c(TRUE, FALSE, TRUE, FALSE, TRUE, TRUE)
b = c('a', 'b', 'c', 'de', 'f', 'g')
c = c(1, 2, 3, 4, 5, 6)
d = c(0, 0, 0, 0, 0, 1)

wtf = data.frame(a, b, c, d)
wtf$huh = apply(wtf, 1, function(row) {
    if (row['a'] == T) { return('we win') }
    if (row['c'] < 5) { return('hooray') }
    if (row['d'] == 1) { return('a thing') }
    return('huh?')
})

You get this. Because R inexplicably decides that the best way to deal with data frames is to turn them all into data matrices first. So, here, the a column turns into ' TRUE' and 'FALSE'. Silently. Fantastic behaviour.

> wtf
      a  b c d     huh
1  TRUE  a 1 0  hooray
2 FALSE  b 2 0  hooray
3  TRUE  c 3 0  hooray
4 FALSE de 4 0  hooray
5  TRUE  f 5 0    huh?
6  TRUE  g 6 1 a thing

But in a reasonable and sensibly constructed system like Pandas, you can run the exact same thing, like this:

import pandas as pd
df = pd.DataFrame({
    'a': [True, False, True, False, True, True],
    'b': ['a', 'b', 'c', 'de', 'f', 'g'],
    'c': [1, 2, 3, 4, 5, 6],
    'd': [0, 0, 0, 0, 0, 1]
})
def funct(row):
    print(row)
    if row['a']: return 'we win'
    if row['c'] < 5: return 'horray'
    if row['d'] is 1: return 'a thing'
    return 'huh?'

df['huh'] = df.apply(funct, axis=1)
print(df)

And get reasonable answers like these that follow. Look what is possible when you don't make stupid design decisions!

       a   b  c  d     huh
0   True   a  1  0  we win
1  False   b  2  0  horray
2   True   c  3  0  we win
3  False  de  4  0  horray
4   True   f  5  0  we win
5   True   g  6  1  we win

Arrays

Pat Burns writes:

An array is something with a 'dim' attribute.
The length of 'dim' determines the dimensionality.
A matrix is a two-dimensional array. There are
one-dimensional arrays (but they are not common).
There are also higher-dimensional arrays.

*apply functions

There are a variety of apply-type functions available in R. Here's what I've figured out so far:

lapply and sapply both loop over a list; for each element in the list, they call the function with the element as a parameter.

R: sapply(list(c(1, 3, 5), c(2, 4, 2)), sum)
Python: map(sum, [[1, 3, 5], [2, 4, 2]])

lapply will always return a list, while sapply attempts to simplify the result to a more concise object (since lists are not as concise as I'm used to in other languages).

mapply is sapply with multiple arguments passed to the function.

R: mapply(sum, list(c(1, 3, 5), c(2, 4, 2)), list(10, 100))
Python: map(sum, [[1, 3, 5], [2, 4, 2]], [10, 100])

This means that mapply with only the function and one argument can be used as a replacement for sapply:

> sapply(list(c(1, 3, 5), c(2, 4, 2)), sum)
[1] 9 8
> mapply(sum, list(c(1, 3, 5), c(2, 4, 2)))
[1] 9 8

Note the order of the arguments is changed. As far as I can tell, this is intended to make the functions maximally confusing.

Finally, apply applies a function to what R refers to as an array, which looks much more like a matrix to me. It can either apply the function to entire rows or entire columns; for some obtuse reason, applying to rows requires MARGIN to be set to 1, while applying to columns requires it to be 2.

Given

library(datasets)
data(mtcars)

R: apply(mtcars, 2, mean)['mpg']
SQL: select avg('mpg') from mtcars;

Finally, tapply groups one thing by another, then applies a function to the groups.

Again using mtcars,

R: tapply(mtcars[['mpg']], mtcars[['cyl']], mean)
SQL: select avg('mpg') from mtcars group by 'cyl';

Truth value of vectors should be explained

issue raised in #8:

Furthermore, you forget dealing with vectors in the check:

> if (1:3) print('yes') else print('no')
[1] "yes"
Warning message:
In if (1:3) print("yes") else print("no") :
  the condition has length > 1 and only the first element will be used
> if (identical(TRUE, as.logical(1:3))) print('yes') else print('no')
[1] "no"

So invoking as.logical is not going to help you to test for the truth value of an "arbitrary not-necessarily-already-boolean value", only if it's already a scalar.

<<- assigns to closest parent environment symbol

E.g.

x <- 1

f <- function() {
  x <- 3
  g <- function() {
    x <<- 2
    return(NULL)
  }
  g()
  return(x)
}

print( c(f(), x) ) ## [1] 2 1

Why assigning to names works

You write that: "You can see a list of columns with names(frame). You rename columns by, spookily, assigning into names(frame). Do you know how and why this works? Please educate me."

It works because 'names' is an attribute of the data.frame that is being accessed by, e.g. names(iris). This yields the same value as attr(iris,"names")... and you can use either to retrieve or assign names to the columns in a data.frame.

describe namespaces

from cdrv:

Check out what's in your search path with search(). Whenever you call a variable reference, R searches through these 'environments', in order from first to last.

Subsetting can convert matrix to vector

I would add that indexing a matrix or array automatically drops dimensions of size 1. This can occasionally be convenient, as with a multidimensional array with a lot of useless dimensions, but in my experience it causes problems more often than not. If you subset only one row/column of a matrix, the result is just a vector, and functions that require a matrix (e.g. rowMeans) will obviously fail:

a <- matrix(seq(12),nr=3,nc=4)
rowMeans(a[1:2,])
[1] 5.5 6.5
rowMeans(a[1,])
Error in rowMeans(a[1, ]) :
'x' must be an array of at least two dimensions

Even if you explicitly create a 1-row matrix, subsetting will convert to vector:

a <- matrix(seq(5),nrow=1)
a
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
a[,]
[1] 1 2 3 4 5

This is intended behavior, but I find that I often want to explicitly keep all dimensions. For example, if you write a function that usually processes matrices with 2+ rows, but then one day pass it a one-row matrix, the first subsetting operation will convert it to a vector and will cause subsequent matrix operations to fail. Therefore you either need to explicitly handle operations on the one-row/one-column matrix separately, or you can make a habit of adding a "drop" parameter to the index, as I have done:

a <- matrix(seq(12),nr=3,nc=4)
rowMeans(a[1,,drop=F])
[1] 5.5

describe magical nature of functions?

 +* Deep down, everything in R is a function, though. R is pretty scheme-y and LISP-y under the hood. One can write `"+"(2, 3)` to call the `+` operator on numbers 2 and 3, for example! Even crazier, `"for"( i, 1:10, print(i^2) )` shows that the keywords are functions themselves.

from cdrv

indexing gotcha

This burned me a few months back.

f = function(x){
x
}

f() # Throws error: x is missing

g = function(x){
letters[x]
}

g(1) # Returns the first element of letters
g() # Doesn't throw error, returns ALL elements of letters

Hadley Wickham pointed out that g(x) is doing helpfully letters[], which is different from letters[NULL].

ಠ_ಠ

Testing for TRUE values

If you want to test a value where you're not sure if it's a logical TRUE or FALSE only (such as in the if condition), use isTRUE(). It does the right thing even with NA, NULL, and vectors:

> isTRUE(NULL)
[1] FALSE
> isTRUE(NA)
[1] FALSE
> isTRUE(c(TRUE, FALSE, NA)
[1] FALSE
> isTRUE(c(TRUE, TRUE)
[1] FALSE

No need for identical(TRUE, as.logical(x)).