dlab-berkeley / r-fundamentals-legacy Goto Github PK

D-Lab's 12 hour introduction to R Fundamentals. Learn how to create variables and functions, manipulate data frames, make visualizations, use control flow structures, and more, using R in RStudio.

License: Other

R 100.00%

r data-science data-wrangling data-visualization automation

r-fundamentals-legacy's Introduction

D-Lab R Fundamentals Workshop

This repository contains the materials for the D-Lab R Fundamentals workshop. No prior experience with R is required.

Workshop Goals

In this workshop, we provide a broad overview of the fundamentals of using R, a programming language geared toward statistical analysis and data science. The workshop is divided into four parts, which cover the following topics:

Part 1: Introduction to R, navigating RStudio, variable assignment, data types and coercion, and data structures.
Part 2: Working with data frames: importing, subsetting, filtering, and merging.
Part 3: Data visualization using R and ggplot2.
Part 4: Functions, for loops, and if-else statements.

No prior experience with R is required.

Installation Instructions

RStudio is a software commonly used by R practitioners to develop code in R. We will use RStudio to go through the workshop materials, which requires the installation of both the R language and the RStudio software. If you would like to run R on your own computer, complete the following steps prior to the workshop:

Download R: Follow the links according to the operating system you are running. You will first need to click on a link corresponding to your operating system, and then an additional link to select a specific version of R. Download the package, and install R onto your computer. You should install the most recent version (at least version 4.1).
- If you are using a Mac, click "Download R for macOS" and then select the right version of R. You will need to select the version corresponding to your specific version of macOS, as well as whether you have an Intel or Apple Silicon Mac.
- If you are using Windows, click "Download R for Windows", then click "base", and click the download link.
- If you are using Linux, click on the link corresponding to your Linux distribution, and then follow the instructions.
Download RStudio: Install RStudio Desktop. This should be free. Do this after you have already installed R. The D-Lab strongly recommends an RStudio edition of 2022.02.0+443 "Prairie Trillium" or higher.
- Some individuals with older operating systems may run into odd issues. If you are running into issues with the installation of RStudio, you may need to install a specific version of RStudio. Please check this link if this applies to you.
Download these R Fundamentals workshop materials:
- Click the green "Code" button in the top right of the repository information.
- Click "Download Zip".
- Extract this file to a folder on your computer where you can easily access it (we recommend Desktop).
Optional: if you're familiar with git, you can instead clone this repository by opening a terminal and entering git clone [email protected]:dlab-berkeley/R-Fundamentals.git.

Is R not working on your laptop?

If you do not have R installed and the materials loaded on your workshop by the time it starts, we strongly recommend using the UC Berkeley DataHub to run the materials for these lessons. You can access the DataHub by clicking the following button:

Some users may have to click the link twice if the materials do not load initially.

The DataHub downloads this repository, along with any necessary packages, and allows you to run the materials in an RStudio instance on UC Berkeley's servers. No installation is needed from your end - you only need an internet browser and a CalNet ID to log in. By using the DataHub, you can save your work and come back to it at any time. When you want to return to your saved work, go straight to DataHub, sign in, and click on the R-Fundamentals folder.

If you don't have a Berkeley CalNet ID, you can still run these lessons in the cloud, by clicking this button:

If you are loading Binder with this repository for the first time, it may take a few minutes to set up. Binder operates similarly to the D-Lab DataHub, but on a different set of servers. By using Binder, however, you cannot save your work.

Run the Code

Now that you have all the required software and materials, you need to run the code.

Launch the RStudio software.
Use the file navigator to find the R-Fundamentals folder you downloaded from Github. Open R-Fundamentals.Rproj by double clicking on the file.
Open up the file corresponding to the part of the workshop you're attending (Part1.R, Part2.R, Part3.R, Part4.R) via the Files panel in RStudio.
Place your cursor on a given line and press Command + Enter (Mac) or Control + Enter (PC) to run an individual line of code.
The solutions folder contains the solutions to the challenge problems.

Additional Resources

Check out the following online resources to learn more about R:

as well as the following books:

Bookdown Featured Books
Introduction to Probability and Statistics in R by G. Jay Kearns.
Advanced R by Hadley Wickham.
R for Data Science by Hadley Wickham and Garrett Grolemund.
R for Everyone by Jared Lander.
Art of R Programming by Norman Matloff.

About the UC Berkeley D-Lab

D-Lab works with Berkeley faculty, research staff, and students to advance data-intensive social science and humanities research. Our goal at D-Lab is to provide practical training, staff support, resources, and space to enable you to use R for your own research applications. Our services cater to all skill levels and no programming, statistical, or computer science backgrounds are necessary. We offer these services in the form of workshops, one-to-one consulting, and working groups that cover a variety of research topics, digital tools, and programming languages.

Visit the D-Lab homepage to learn more about us. You can view our calendar for upcoming events, learn about how to utilize our consulting and data services, and check out upcoming workshops. Subscribe to our newsletter to stay up to date on D-Lab events, services, and opportunities.

Other D-Lab R workshops

D-Lab offers a variety of R workshops, catered toward different levels of expertise.

Introductory Workshops

Intermediate and Advanced Workshops

Contributors

Pratik Sachdeva
Alex Stephenson
Evan Muzzall
Aniket Kesari
Jae Yeon Kim
Sam Abdel-Ghaffar
Avery Richards
Guadalupe Tuñón
Shinhye Choi
Patty Frontiera
Rochelle Terman
Dillon Niederhut

r-fundamentals-legacy's People

Stargazers

Watchers

r-fundamentals-legacy's Issues

ln. 114 code break (solution inside!)

Taking plot() to the limit one more time on line 114, the code breaks (no output). Here's a clunky solution: make a new column as.factor to pass through the col and pch argument.

gap$plot_cat <- as.factor(gap$continent)

scatter <- plot(x = x, y = y,
main = "Life expectancy versus gdpPercap",
xlab = "Life expectancy (years)",
ylab = "gdpPercap (USD)",
# Change point colors to correspond to continents
col = as.integer(gap$plot_cat),
# Change point symbols to correspond to continents
pch = as.integer(gap$plot_cat),
# Change point size
cex = 2,
las = 1)

dplyr for sumarization

I don't know whether we should recommend the psych package or dplyr for summarization in PART 3-2. I understand that we need to teach enough base R but psych is not base R and its syntax is quite similar to dplyr and we introduce ggplot2 soon (PART 3-4) anyway.

Part2.R - need to quote dplyr when installing package.

Part2.R - Line 161
install.packages(dplyr)

Needs to be changed to
install.packages("dplyr")

Part 1 - move top of Part 1 to more appropriate locations

Move

install.packages()
library()
?
mean()

to more appropriate sections.

Instead, let's lead with a summary of what the four window panes do in RStudio.

Part 2: Change challenge 4

Currently, challenge 4 has the users develop plots based on the data frames. This is problematic since we haven't covered any plotting. It seems a bit out of place - maybe this should be a challenge about merging?

Part 4 - ordering of sections

Currently the ordering is:
functions > for loops > if-else statements > functions again

Proposed new order:
if-else statements > for loops > functions

Notes:
if statements are very intuitive and many have used them in programs like Excel, then for loops are super important and second-most intuitive, then functions can combine all this together

replace iris dataset with penguine

Here's penguine dataset: https://github.com/allisonhorst/palmerpenguins

Also, the iris dataset originated from the Annals of Eugenics, in 1936, published by Ronald Fisher. It's time to reconsider its usage in the intro stat and data science courses.

Solutions manual for Part 2 needs updating

There were wonderful updates to the content for Part 2.R but the corresponding solutions file does not appear to be fully updated (see merge section, maybe others)

Part 1: Make a note about assigment operators (= and <-)

There should be a short example demonstrating that = can be used for assignment in R, but <- is preferable due to style (and that in specific examples that don't need to be detailed at the time, they do in fact behave differently).

Bring README up to standard

The README needs a few updates to bring it up to style. It should follow the template set by R-Data-Visualization.

Part 2: Information about datasets

We should include some information about the datasets, either in the README or in the script directly. For example, it's not totally clear what the sleep_VIM dataset is, and what the columns refer to. It would make the computations we perform more meaningful if students knew what the dataset consisted of, what the features were, and what units the columns were in.

Part 1: TAB autocomplete is already completed

Lines 55 and 57 need to be shortened so that they can be tab autocompleted by the users.

Part 1: Move introduction of factors

Factors are mentioned near the end of the lesson, after the introduction of data frames. This is likely to explain the stringsAsFactors parameter for data frames.

This is likely not necessary. First, stringsAsFactors can simply be omitted here: the default argument is FALSE, so it doesn't need to be discussed at all. It can be brought up in Part 2 if necessary.

The introduction of factors should occur with the other data types. It's a little odd that it's mentioned in the beginning of that section, but not referenced till much later in the lesson.

Part 1: Remove scientific notation function

Line 220, defining the scientific notation, comes somewhat abruptly. It is likely not necessary to demonstrate usage of this function. Its use case is not motivated and it is not referenced again in this lesson, so it makes the momentum of the lesson feel a bit awkward.

Untrack .Rproj.user directory

.Rproj.user shouldn't be tracked in git as it's a user-specific temporary directory - it would be good to remove.

Part 4 - Loops section update

there are instances where we define a vector X = 1:5, then we do a for loop: "for (X in 1:length(X))". This is super confusing for people since the iterator is the same name (and values!) as the vector. We should use a vector X and an iterator i to avoid confusion. Also ideally the vector X would be X = 2:6 so that the value of X[i] is not the same as i (easier to explain this way)

Workshop Title

Workshop title should be "R-Fundamentals:-Parts-1-4"

Part 3: improve consistency with as.factor() / as.character() conversion

At some points in the lesson, it appears that conversions from characters to factors are done, or vice versa. These are not done consistently. The default import does not convert the characters in the dataframe to factors, but later code blocks assume that these columns were already factors. They should be made consistent.

It is probably best to import the dataframes with stringsAsFactors = TRUE, and then make sure every conversion thereafter makes sense.

Part 4: Come up with a simpler example combining the core ideas

Currently, the final example is rather involved: it requires writing a function for the birthday problem. This is somewhat advanced, and difficult to squeeze into what is already a full lesson.

A simpler example could involve writing a naive function that checks whether a number is prime - this involves a function, a for loop, and an if statement, and simpler mathematical ideas.

Part 1: Move Section 8 to Part 2

Section 8 discusses getwd() and saving a csv file. This comes a little bit out of nowhere, and would fit naturally in Part 2 when importing csv files from a specific path is discussed.

add `droplevels` for subsetting

Incorporate here package

Ideally, this repository would have a lessons folder in which each Part would be placed within. A barrier to this is the importing of data files: right now, they are hard coded according to the path, starting in Part2.R.

A better approach would be to place the files within the lessons folders, and use the here package to properly create the import. This would be a good instruction point, as here is used in basically all other R workshops.

The best place to do this is in Part2.R. However, these changes would require going through all scripts and making the appropriate changes to the filepaths.

Part 3: remove deprecation error

Using guides(fill = FALSE) is deprecated, and must be updated to remove the warning.

Move materials from Part 1 to Part 2

There is bit of a bottleneck in Part1.R - Installation trouble shooting alone can take upwards of 30min-1hr.

A possible solution is to move some materials from Part1 to Part2 (especially anything dataframe related since it becomes redundant). This can make the focus of Part1 to help with troubleshooting getting started with R rather than getting deep into data structures. Will update this issue after observing Part2 today.

@Averysaurus

Part 1: Remove random number generation

It is probably not necessary to introduce random number generation in the very first lesson. It can be confusing for first time users and is not necessary to create dataframes - we can just create the dataframes by hand.

Furthermore, this concept is also brought up in Part 4. It could be moved to that lesson entirely.

Part 3: removing statistical testing?

This issue is for discussion. Is it worth keeping the statistical testing section after the introduction to visualization? It feels a bit odd because we're not explaining any of the details of the statistical testing - just demonstrating that it can be done. Furthermore, it feels only loosely connected to the previous part of the lesson.

I think it's totally fine to keep geom_smooth in there as a way to demonstrate that linear models can be fitted and plotted in one fell swoop. However, demonstrating the other functions may not be that effective since the lesson does not really go through the weeds.

Part 1: use variable number

We define the variable "number" and assign 5 to it but don't use it to demonstrate how to use a variable. Might be good to add one line like "number + 8" to show that we can use the variable like a number (which it is). I would add that right after the line "class(number)"

Add binder link

Update Read Me link

There is a misdirecting link under Installation instructions. The 3rd bullet "Download these workshop materials:" links to the R Data Visualization workshop. Instead it should link to the R-Fundamentals workshop (https://github.com/dlab-berkeley/R-Fundamentals).

Part 1: Error on accessing country column

The current Part1.R script errors on lines 354 - 368 because of a typo in the country column (it doesn't capitalize "Country").

Explained named arguments in more detail in Part1-4

Part 1: Remove or adjust sentence formation block in Section 5

In lines 163-195, several functions operating on character objects are demonstrated and used. This is probably not necessary for the very first lesson in R. Furthermore, they don't fit the section, which aims to demonstrate data structures (vectors, lists, and data frames).

They should be removed from this section. They could potentially be excised entirely, or placed in the section which introduces built-in functions.

add challenges to scripts

Part 3: change solution to Challenge 4

The solution to Challenge 4 requires using dplyr, which is not discussed thus far in this series. It would be better to use another, more approachable example.

Part 4 - examples & challenges notes

Cylinders challenge was very well done and boosted everyone's intuition a bunch. Similarly the Monte Carlo dice rolls seemed to be useful. The Birthday problem was a great one to end on because it did something that actually demonstrated the power of the technology, rather than giving us intuitive answers that we didn't actually need R for.

The "Lock" example was overkill in its complexity, I did not even attempt it with my group. I suggest to remove it.

Overall, there were perhaps too many examples in this section to be completed in the 3 hour time limit.

not mentioning factors in Part1-3

I'm reviewing Part1-3 and don't think that we need to mention factor type. As a matter of fact, we didn't discuss it in that particular section.

Introduce `%in%` function

Useful for checking if a string appears in a vector of strings (maybe part 1 or part 2 with boolean stuff?)

Part 4 - Collection of notes and suggestions from Connor and Jose

Introduce %in% somewhere - it's really useful for checking if a string appears in a vector of strings (maybe in part 1 or part 2?)
Part 4 is overall too long and complicated (a few specific suggestions below)
Start with if statements, then for loops, then functions (currently it's functions > for loops > if statements > more functions)
In the loops section, there are instances where we define a vector X = 1:5, then we do a for loop: "for (X in 1:length(X))". This is super confusing for people since the iterator is the same name (and values!) as the vector. We should use a vector X and an iterator i to avoid confusion. Also ideally the vector X would be X = 2:6 so that the value of X[i] is not the same as i (easier to explain this way)
Cylinders challenge is great!
The "Lock" example is overkill, suggested to remove (I did not even attempt it)
Monte carlo dice rolls & birthday problem were really well received

Thanks!

README - Github link broken for Datahub

zero is not a negative number

It was pointed out in Day 4 that 0 is not a negative number. Two of the example if else statements (lines 208-219 in Part4.R) produces results that say it is so ("0 is negative."). Code below. Could change line 216 and 226 to "x is less than 1" for example if we want to address this technicality, or add another else if statement for "x is equal to zero."

Lastly, if we want to check for multiple conditions in a row, we can use else-if statements:

x <- 2
if (x > 3) {
print("x is greater than 3.")
} else if (x > 0) {
print("x is greater than 0.")
} else {
print("x is negative.")
}

Let's go ahead and put this if-else statement inside its own function, and run it inside a for-loop:

if_checker <- function(x) {
if (x > 3) {
print(paste0("x = ", x, ", is greater than 3."))
} else if (x > 0) {
print(paste0("x = ", x, ", is greater than 0."))
} else {
print(paste0("x = ", x, ", is negative."))
}
}

fix broken datahub link

The link to datahub is currently broken:

Part 2: Use dplyr instead of base R

This is up for discussion. However, I feel that functions like subset and merge, from base R, are not really used by practitioners these days. Instead, students would benefit most greatly from introduction to modern tools based on tidyverse packages such as dplyr. We can mention that these functions exist in base R, but we should instead do subsetting and merging with dplyr functions. This mimics how ggplot is introduced in Part 3.

Variable assignment in Part1-2

I elaborated on why we should encourage students to use <- instead of = for variable assignment. For sure, = could be more intuitive as it's the way a variable is assigned in Python and some other languages. However, = and <- work differently in terms of their scopes. I referenced Google and Hadley Wickam's R style guidelines as well as a minimal reproducible example that demonstrates the point.