fchuffar / intro_r Goto Github PK
View Code? Open in Web Editor NEWAn introduction to R
License: GNU General Public License v3.0
An introduction to R
License: GNU General Public License v3.0
--- title: "R introduction" author: "Florent Chuffart" date: "`r Sys.Date()`" output: rmarkdown::html_document: toc: true toc_float: true toc_depth: 3 number_sections: true --- This R session is fully available on https://github.com/fchuffar/intro_r # Purpose This first R working session will be an introduction to your favourite statistical programming language: R. After a brief introduction to basics (user interface paradigms, free software and GNU), we will go deeply into the details of data structures, data indexing and data visualisation. The main purpose of this session is to offer a not only simple but also powerful alternative to your classical spreadsheet software. This introduction gives you the keys to perform large scale statistical data analysis. This introduction follows three main goals: * deploy a fonctionnal working environment for statistical analysis on personnal laptop of participants * design a first script allowing participants to import into R environment data from classical spreadsheet * define common ground of a discussion between life scientists and data analysts # Prerequisistes Chaque participant apportera son ordinateur portable sur lequel aura été installé les logiciels R et Rstudio : * Installer R (macosx https://cran.r-project.org/bin/macosx/ ) (windows https://cran.r-project.org/bin/windows/base/ ) * Installer RStudio Desktop Open Source License https://www.rstudio.com/products/rstudio/download/ Le contenu de la session est disponible ici : https://github.com/fchuffar/intro_R Il doit être téléchargé en cliquant sur le bouton vert en hautr à droite "Clone or Download", puis "Download ZIP". La première séance se déroulera intégralement dans cette espace de travail. # R Software ## GNU Public Licence R is a free software environment for statistical computing and graphics. More details about GNU, GPL and free software here: - https://www.gnu.org/philosophy/free-sw.html - https://en.wikipedia.org/wiki/GNU_General_Public_License - https://en.wikipedia.org/wiki/Free_software_movement - https://en.wikipedia.org/wiki/CeCILL ## R and biology R is widely used in the life science community: - https://www.ncbi.nlm.nih.gov/pubmed/?term=r+software ## R Community Many packages are freely available on: - CRAN https://cran.r-project.org - Bioconductor https://www.bioconductor.org - GitHub https://github.com R users are organised as a community with many help topics available on the internet. Many tutorial are available online: - http://revue.sesamath.net/IMG/pdf/RCarte_Commandes-R.pdf - https://cran.r-project.org/doc/contrib/Kauffmann_aide_memoire_R.pdf - https://cran.r-project.org/doc/contrib/Paradis-rdebuts_fr.pdf # R CLI vs. GUI ## CLI R could be compared to an electronic calculator. It provides not only basic operations but also advanced features allowing to manipulate distributions and visualize data. ```{r} 1+1 2+2 pi rnorm(100) plot(density(rnorm(100))) lines(density(rnorm(100, sd=2)), col=2) lines(density(rnorm(100, sd=0.3, mean=1)), col=3) ``` ## R GUI (RStudio) *** https://www.rstudio.com # Let's start with R! ## Working directory ```{r, eval=FALSE} getwd() # this command give the help on... # ?getwd() # ?rnorm # ?setwd() ``` ## Organisation of your workspace - data - results - src - doc ## From a spreadsheet to R data.frame *** ```{r} #data_sh = read.table("data/seahorse_dataset2.csv", sep=";", header=TRUE, dec=".") data_sh = read.table("data/sp_exp_grp.csv", sep=" ", header=TRUE, dec=".") data_sh["GSE21749_P_WT_1",] data_sh[,1] head(data_sh) tail(data_sh) dim(data_sh) summary(data_sh) d = read.table("data/data_nutri.csv", header=TRUE, sep=",", row.names = 1) head(d) tail(d) dim(d) summary(d) data_sh[1,] data_sh[,1] ``` # Data ## Affectation ```{r} foo <- 1 foo bar = "a" bar = 'a' bar # bar = a a = 4 a bar = a bar "b" -> baz baz quz <<- baz ``` ## Common types - numeric - character - logical (boolean) - factor ```{r} 1 is.numeric(1) a=1 is.numeric(a) a="nom" a is.numeric(a) 0.5 is.numeric(0.5) is.numeric("a") foo = 1 is.numeric(foo) is.integer(1) a = as.integer(1) a is.integer(a) as.integer(0.3) as.integer(TRUE) as.integer(FALSE) as.numeric(TRUE) as.numeric(FALSE) "iab" 'toto' is.numeric("a") is.character("a") is.character(1) as.character(1) as.numeric("a") conc = c(0.1, 0.2, 0.3, "probleme de manip", 0.1) conc = c(0.1, 0.2, 0.3, "probleme de manip", 0.1) conc as.numeric(conc) conc2 = c(0.1, 0.2, 0.3, NA, 0.1) conc2 mean(conc2) mean(conc2, na.rm=TRUE) NaN log(0) 1/0 # logicals TRUE FALSE T F T = 3 T = FALSE T # TRUE = 3 # NA = 3 is.numeric(TRUE) as.numeric(TRUE) as.numeric(FALSE) as.logical(1) as.logical(0) as.logical(2) as.logical(-2) as.logical(-2.2) is.logical("true") is.character("true") other_var = as.logical("true") other_var as.logical("True") # factors col_y = c("bleu", "bleu", "vert", "marron", "vert", "marron", "blue", " blue") col_y f = as.factor(col_y) f col_y = c("bleu", "bleu", "vert", "marron", "vert", "marron", "bleu", "bleu") f = as.factor(col_y) f as.numeric(f) ``` ## Set of data | | Homogeneous | Heterogeneous | |----|-------------|---------------| | 1d | vector | list | | 2d | matrix | data.frame | a data.frame is a list of same size vectors (colums). ```{r} 1:5 c(1,2,10) c(1,"a",10) c(TRUE,0) v = rnorm(100) v v[c(5,18,39)] v[98:100] v[c(1,2,3)] v[1:3] h = v v v[c(1,2,3)] list(1,"a",10) l = list(chr="8", tx_start=124330091, tx_end=124410705, gs="ATAD2", ref="hg19", strand="-") l l$chr c(tx_start=124330091, tx_end=124410705) # matrix m = matrix(runif(120),nrow=20) dim(m) m m+1 m[20,6] m[19:20,5:6] m[c(17,20),] m[,c(1,2)] m[,c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE)] data = data.frame(size=rnorm(25)+175, gender=sample(c("M", "F"), 25, replace=TRUE)) data[1:5,] head(data) data$gender data[1,2] = "m" data[1,2] data[1,2] = "M" data[data$gender=="M",] data[data$gender=="M" & data$size > 170,] factor(sample(c("10", "J", "Q", "K", "A"), 25, replace=TRUE)) factor(sample(c("10", "J", "Q", "K", "A"), 25, replace=TRUE), levels=c("10", "J", "Q", "K", "A"), ordered=TRUE) ``` ## Selecting lines and columns of a dataset *** ```{r} v = 100:130 v v[1] v[-1] rev(v)[1] v[1:4] # index by the rank v[c(1, 4, 10, 15)] # index by mask col_y col_y[c(TRUE,FALSE, TRUE, TRUE, FALSE, TRUE)] ## select a column # by name data_sh$id # by rank data_sh[,2] data_sh[,2:6] data_sh[,c(FALSE,TRUE,TRUE,TRUE,TRUE,TRUE,FALSE,FALSE, FALSE,FALSE,FALSE)] ``` # Visualization ```{r, eval=FALSE} hist(data$size) abline(v=mean(data$mean)) boxplot(size~gender, data) qnorm(data$size) qqline(data$size) table(data$gender) # scatter plot barplot(data$gender) # b, l, t, r par(mar=c(10,1,1,1)) ``` # Tests ## Dataset ```{r} # dataset nutri data_nutri = read.table("data/data_nutri.csv", sep=",", header=TRUE, row.names=1) head(data_nutri) tail(data_nutri) dim(data_nutri) # Correlation between taille and poids plot(data_nutri$taille, data_nutri$poids, col=data_nutri$sexe) # create index for categ. idx_h = data_nutri$sexe == "Homme" idx_f = data_nutri$sexe == "Femme" # plot density of each categ. plot(density(data_nutri$taille[idx_h]), col=2, xlim=range(data_nutri$taille)) lines(density(data_nutri$taille[idx_f])) boxplot(taille~sexe, data=data_nutri) library(beanplot) beanplot(taille~sexe, data=data_nutri) library(beeswarm) beeswarm(taille~sexe, data=data_nutri) ``` ## Parametric ```{r} # test de Shapiro-Wilks # H0: normality # H1: non normality # alpha: 5% norm_pop = rnorm(100) plot(density(norm_pop)) shapiro.test(norm_pop) # pval = 0.439 > alpha = 5% # on conserve H0 au risque beta à calculer... unif_pop = runif(100) plot(density(unif_pop)) test = shapiro.test(unif_pop) test$p.value # pval = 0.0001 < alpha = 5% # on rejette H0, on accepte H1 au risque de première espèce alpha = 5% taille_h = data_nutri$taille[idx_h] shapiro.test(taille_h) # pval = 9.7% > alpha = 5% # on conserve H0 au risque beta à calculer... beta contrôlé dans le cas de shapiro.test. taille_f = data_nutri$taille[idx_f] shapiro.test(taille_f) # pval = 11% > alpha = 5% # on conserve H0 au risque beta à calculer... beta contrôlé dans le cas de shapiro.test. # Homoscedasticity # H0: homoscedasticity # H1: non homoscedasticity # alpha: 5% pop1 = rnorm(100, 0, 1) pop2 = rnorm(100, 10, 1) plot(density(pop1), xlim=c(-5,15)) lines(density(pop2), col=2) # homogénéité des variances bartlett.test( c(pop1, pop2), c(rep("a", 100), rep("b", 100)) ) # pval > 5% donc je conserve H0 au rosque beta à calculer pop1 = rnorm(100, 0, 0.1) pop2 = rnorm(100, 10, 1) plot(density(pop1), xlim=c(-5,15)) lines(density(pop2), col=2) # non homogénéité des variances bartlett.test( c(pop1, pop2), c(rep("a", 100), rep("b", 100)) ) # pval < 5% donc je rejette H0 et j'accepte H1 au risque alpha de 5%. bartlett.test(data_nutri$taille, data_nutri$sexe) # pval = 16% donc je conserve H0 au seuil beta à calculer. # Student test, t test, # H0: t_h == t_f # H1: t_h != t_f # alpha: 5% t.test(taille_h, taille_f, conf.level = 0.95) # pval = 2.2e-16 << 5% on rejette H0 et on accepte H1 au seuil alpha de 5%. # H0: t_h == t_f # H1: t_h > t_f # alpha: 5% t.test(taille_h, taille_f, conf.level = 0.95, alternative="greater") t.test(taille_h[1:10], taille_f[1:10], conf.level = 0.95, alternative="greater") t.test(taille_h[1:10], taille_f[1:10], conf.level = 0.95, alternative="two.sided") # pval = 2.2e-16 << 5% on rejette H0 et on accepte H1 au seuil alpha de 5%. # H0: t_h == t_f # H1: t_h < t_f # alpha: 5% t.test(taille_h, taille_f, conf.level = 0.95, alternative="less") # pval = 1 >> 5% on conserve H0 au risque beta, à calculer... ``` # Non parametric ```{r} # H0: t_h == t_f # H1: t_h != t_f # alpha: 5% wilcox.test(taille_h[1:10], taille_f[1:10], alternative = "two.side") # Mann Withney # pval = 0.0001 < alpha = 5% donc je rejette H0 et j'accepte H1 au seuil alpha de 5% # H0: t_h == t_f # H1: t_h > t_f # alpha: 5% test = wilcox.test(taille_h[1:10], taille_f[1:10], alternative = "greater") # Mann Withney # pval = 0.00008 < alpha = 5% donc je rejette H0 et j'accepte H1 au seuil alpha de 5% if (test$p.value < 0.001) { label = paste(signif(test$p.value,2), "***") } else if (test$p.value < 0.01) { label = paste(signif(test$p.value,2), "**") } else if (test$p.value < 0.05) { label = paste(signif(test$p.value,2), "*") } else { label = signif(test$p.value,2) } plot(density(data_nutri$taille[idx_h]), col=2, xlim=range(data_nutri$taille)) lines(density(data_nutri$taille[idx_f])) text(165,0.04, labels=label) boxplot(data_nutri$taille~data_nutri$sex, las=2) text(1.5,180, labels=label) ``` ## Empirical p-value ```{r} stat_obs = wilcox.test(unique(taille_h)[1:5], unique(taille_f)[1:5], alternative = "two.side")$statistic # Mann Withney stat_rnds = sapply(1:10000, function(i) { rnd_pop = sample(c(unique(taille_h)[1:5], unique(taille_f)[1:5])) stat_rnd = wilcox.test(rnd_pop[1:5], rnd_pop[6:10], alternative = "two.side")$statistic # Mann Withney return(stat_rnd) }) pval_emp = (sum(stat_rnds > stat_obs) + 1)/length(stat_rnds) print(pval_emp) plot(density(stat_rnds)) abline(v=stat_obs) ``` # Statistical test ```{r, eval=FALSE} plot_null = function(n, mean=0, sd=1, thresh=TRUE, ...){ foo = sapply(1:10000, function(i) { bar = rnorm(n, mean, sd) m = mean(bar) s = sd(bar) c(mean=m, sd = s) }) foo = data.frame(t(foo)) t = foo$mean / (foo$sd / sqrt(10)) lines(density(foo$mean), ...) if (thresh) abline(v=quantile(foo$mean, c(0.025,0.975)), lty=2, ...) # lines(density(t), lty=2, ...) # lines(density(rt(10000, n-1)), col="grey", lty=3) } layout(matrix(1:2, 1), respect=TRUE) plot(0,0,col=0,xlim=c(-1,1), ylim=c(0,2)) plot_null(10) plot_null(10,mean=0.25, thresh=FALSE, col=4) # plot_null(20, col=2) plot(0,0,col=0,xlim=c(-1,1), ylim=c(0,2)) plot_null(30) plot_null(30,mean=0.25, thresh=FALSE, col=4) # plot_null(20, col=2) # H0: m == 0 # H1: m != 0 # alpha = P(reject H0 | H0 is true) # beta = P(accept H0 | H0 is false) ``` # To go further with R ## Scripts ```{r, eval=FALSE} source("intro_r.R") ``` ## Functions ### Common functions ```{r} x = c(1,2,3,4,5) y = c(2,4,6,8,10) length(x) plot(x,y) rnd = runif(100000) dices = round(rnd * 6) table(dices) barplot(table(dices)) dices = floor(rnd * 6) barplot(table(dices)) dices = ceiling(rnd * 6) barplot(table(dices)) pie(table(dices)) mat_5_dices = matrix(dices,5) dim(mat_5_dices) mat_5_dices[,1:10] sum_5_dices = apply(mat_5_dices, 2, sum) table(sum_5_dices) barplot(table(sum_5_dices)) prop.table(table(sum_5_dices)) prop.table(table(sum_5_dices)) sum(prop.table(table(sum_5_dices))) ``` ```{r, eval=FALSE} ?ls() ?paste() ?sum() ?cumsum() ?sqrt() ?rnorm() ?qnorm() ?pnorm() ?cbind() ?rbind() ?sort() ?order() ?grep() ?read.table() ?write.table() ``` ### Custom functions - signature (parameter) - return - scope and global variables ```{r} hello_world = function(name="world") { ret = paste("Hello ", name, "!", sep="") return(ret) } hello_world() ``` # Practical session ## SRSF2 and SOX2 signatures in squamous cell lung carcinomas (data set suggested by Beatrice Eymin) We dispose of 5 Affymetrix Human Genome U133A Array probe values for 130 samples These 5 probes correspond to 2 genes: - 200753_x_at (SRSF) - 200754_x_at (SRSF) - 214882_s_at (SRSF) - 213721_at (SOX2) - 213722_at (SOX2) We also dispose of clinical data for the 130 samples. i) We want to check the correlation between probes ii) We want to analyse the expressin of these genes according cancer stage ```{r} # loading data data = read.csv("data/GSE4573.csv", row.names=1) head(data) dim(data) # correlation between probes pairs((data[,1:5])) pairs2 = function (...) { panel.cor = function(x, y, digits = 2, prefix = "", cex.cor, ...) { usr = par("usr") on.exit(par(usr)) par(usr = c(0, 1, 0, 1)) r = abs(cor(x, y)) txt = format(c(r, 0.123456789), digits = digits)[1] txt = paste(prefix, txt, sep = "") if (missing(cex.cor)) cex = 0.8/strwidth(txt) test = cor.test(x, y) Signif = symnum(test$p.value, corr = FALSE, na = FALSE, cutpoints = c(0, 0.001, 0.01, 0.05, 0.1, 1), symbols = c("***", "**", "*", ".", " ")) text(0.5, 0.5, txt, cex = cex * r) text(0.8, 0.8, Signif, cex = cex, col = 2) } pairs(..., upper.panel = panel.cor) } pairs2((data[,1:5])) # distribution layout(matrix(1:6, 2), respect=TRUE) plot(density(data[,"X200753_x_at"])) plot(density(data[,"X200754_x_at"])) plot(density(data[,"X214882_s_at"])) plot(density(data[,"X213721_at" ])) plot(density(data[,"X213722_at" ])) # stages data$STAGE idx_grp1 = data$STAGE %in% c("Ia", "Ib", "IIa", "IIb") idx_grp2 = data$STAGE %in% c("IIIa", "IIIb") data$stage_grps = NA data[idx_grp1,]$stage_grps = "grp1" data[idx_grp2,]$stage_grps = "grp2" layout(matrix(1:6, 2), respect=TRUE) beanplot::beanplot(X200753_x_at~stage_grps, data) beanplot::beanplot(X200754_x_at~stage_grps, data) beanplot::beanplot(X214882_s_at~stage_grps, data) beanplot::beanplot(X213721_at ~stage_grps, data) beanplot::beanplot(X213722_at ~stage_grps, data) t.test(data[idx_grp1, "X200753_x_at"], data[idx_grp2, "X200753_x_at"]) t.test(data[idx_grp1, "X200754_x_at"], data[idx_grp2, "X200754_x_at"]) t.test(data[idx_grp1, "X214882_s_at"], data[idx_grp2, "X214882_s_at"]) t.test(data[idx_grp1, "X213721_at" ], data[idx_grp2, "X213721_at" ]) t.test(data[idx_grp1, "X213722_at" ], data[idx_grp2, "X213722_at" ]) wilcox.test(data[idx_grp1, "X200753_x_at"], data[idx_grp2, "X200753_x_at"]) wilcox.test(data[idx_grp1, "X200754_x_at"], data[idx_grp2, "X200754_x_at"]) wilcox.test(data[idx_grp1, "X214882_s_at"], data[idx_grp2, "X214882_s_at"]) wilcox.test(data[idx_grp1, "X213721_at" ], data[idx_grp2, "X213721_at" ]) wilcox.test(data[idx_grp1, "X213722_at" ], data[idx_grp2, "X213722_at" ]) layout(matrix(1:6, 2), respect=TRUE) beanplot::beanplot(X200753_x_at~SEX, data) beanplot::beanplot(X200754_x_at~SEX, data) beanplot::beanplot(X214882_s_at~SEX, data) beanplot::beanplot(X213721_at ~SEX, data) beanplot::beanplot(X213722_at ~SEX, data) layout(matrix(1:6, 2), respect=TRUE) plot(data[,"X200753_x_at"], data$AGE) plot(data[,"X200754_x_at"], data$AGE) plot(data[,"X214882_s_at"], data$AGE) plot(data[,"X213721_at" ], data$AGE) plot(data[,"X213722_at" ], data$AGE) layout(matrix(1:6, 2), respect=TRUE) beanplot::beanplot(X200753_x_at~T, data) beanplot::beanplot(X200754_x_at~T, data) beanplot::beanplot(X214882_s_at~T, data) beanplot::beanplot(X213721_at ~T, data) beanplot::beanplot(X213722_at ~T, data) ``` ## Control structure A control structure is a block of programming that analyzes variables and chooses a direction in which to go based on given parameters. ## Algorithm An algorithm is a self-contained sequence of actions to be performed. An algorithm is an effective method that can be expressed within a finite amount of space and time. # References http://adv-r.had.co.nz http://r-pkgs.had.co.nz # Exercices http://graal.ens-lyon.fr/~fchuffart/files/data/exercices.pdf # RegLog ```{r} d = data_nutri layout(matrix(1:2, 1), respect=TRUE) plot(d$taille, d$poids) # Y~B # E(Y) = b.X # E(Y) = b_0 + b_1.X # Y_i = b_0 + b_1.X_i + e_i m = lm(d$poids~d$taille) m = lm(poids~taille, d) m$coefficients m$residuals abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) # arrows(d$taille, d$poids, d$taille+m$residuals, d$poids, col=adjustcolor(1, alpha.f=0.5)) arrows(d$taille, d$poids, d$taille, d$poids-m$residuals, col=adjustcolor(1, alpha.f=0.5)) layout(matrix(1:2, 1), respect=TRUE) plot(d$taille, d$poids) # Y~B # E(Y) = b.X # E(Y) = b_0 + b_1.X # Y_i = b_0 + b_1.X_i + e_i m = lm(d$poids~d$taille) m$coefficients abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) plot(d$taille, d$poids-m$residuals, col=1) arrows(d$taille, d$poids, d$taille, d$poids-m$residuals, col=adjustcolor(1, alpha.f=0.5)) d = data_nutri layout(matrix(1:2, 1), respect=TRUE) plot(d$sex, d$poids) # Y~B # E(Y) = b.X # E(Y) = b_0 + b_1.X # Y_i = b_0 + b_1.X_i + e_i m = lm(d$poids~d$sex) m$coefficients abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) plot(1:nrow(d), d$poids-m$residuals, col=1) arrows(1:nrow(d), d$poids, 1:nrow(d), d$poids-m$residuals, col=adjustcolor(1, alpha.f=0.5)) abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) arrows(d$taille, d$poids, d$taille + 0.5*m$coefficients[[2]]*m$residuals, d$poids - sqrt(m$coefficients[[2]]*m$residuals), col=adjustcolor(1, alpha.f=0.5)) arrows(d$taille, d$poids, d$taille + 0.5*m$coefficients[[2]]*m$residuals, d$poids - sqrt(m$coefficients[[2]]*m$residuals), col=adjustcolor(1, alpha.f=0.5)) # plot( d$poids, d$taille) # m = lm(d$taille~d$poids) # m = lm(taille~poids, d) # m$coefficients # m$residuals # arrows(d$poids, d$taille, d$poids+m$residuals, d$taille, col=adjustcolor(1, alpha.f=0.5)) # abline(a=m$coefficients[[1]], b=m$coefficients[[2]], col=2) # plot(density(m$residuals)) plot(data_nutri$sexe, data_nutri$taille) ``` # Session Information ```{r, results="verbatim"} sessionInfo() ```
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.