b-rodrigues / rap4all Goto Github PK

License: Other

HTML 98.18% TeX 0.80% Emacs Lisp 0.01% R 1.00% CSS 0.01%

rap4all's Introduction

Building reproducible analytical pipelines with R

This is the repository containing the source code to my free ebook called "Building reproducible analytical pipelines with R".

Read it online here https://raps-with-r.dev. On that page, you can click the download button in the sidebar on the left to download a PDF or EPUB version of the book. You can also buy (the same) DRM-free PDF or EPUB from Leanpub if you want to support this initiative.

A print version through Amazon Kindle Direct Publishing is also planned.

If you spot a typo, feel free to open an issue or pull request.

NEWS

2023-10-03: Several typos were corrected (thanks to the many contributors) and the wikipedia tables that are scraped at the beginning of the book are now re-hosted on Github Pages for reproducibility purposes.

2023-07-10: Two formatting issues were fixed on pages 338 and 419 from the PDF.

2023-06-29: A typo was spotted on the PDF version of the book, at the top of page 468. The command after docker push: was missing. This is now fixed.

rap4all's People

Contributors

Stargazers

Watchers

rap4all's Issues

Before we start -- split into subsections / subheadings

Hi, I enjoyed reading your blog (such as about running R code using older versions of R), so I couldn't miss your book.

After reading the fist "real" chapter, I feel it is a bit rambly and mixes several concepts in a single paragraph, and several ideas are spread around multiple paragraphs. I think that using more subheadings would improve the clarity and the organisation of the text.

I see some points that could be highlighted and their current position in the text:

What is R (first, but also third and fourth paragraphs)
R is not Rstudio (first and second paragraphs)
The strengths of R (second, third, and fourth paragraphs)
Best Practices for Paths (from fifth paragraph onwards)
The Learning Loop (the last chunk)

The current text could be reorganised to follow said highlights, which would enhance the reading

Associate Acronyms with their definition

In the preface, you use some acronyms like ONS, RAP, PI, etc. The text could be easier to read if they would be associated with their definition, at least for their first appearance, ex: Office for National Statistics (ONS).
Some of them are not defined, ex: Principal Investigator (PI). I don't know if this is so well known outside of R&D circles.

Mention RStudio projects (.Rproj) for reproducibility

Thanks for making this OS. I have read some parts of the book, attended one of the workshops held online, and found everything in your talk helpful! One minor thing I noticed is that, as far as I am concerned, RStudio projects should be mentioned to improve reproducibility. At least for those using the RStudio IDE (maybe the majority, nowadays), this is very handy.

For those using RStudio, the .Rproj can be committed and not git-ignored
- If you are using a different IDE you can act is it was not existent in the repository
Advantages I can think of:
- No need to set the path in your scripts or change it in the console
- Make R workflows portable and OS agnostics
- Use just a relative path from your current project folder
- Clone the repo and click on .Rproj, and you are ready to go
- Use in combination with the {here} package
- bonus: Robust global search within the project

Reference:

what are the packages doing in the first example?

Sorry, not really a R developer, this is probably a stupid question.
On this page, could you please explain quickly what those packages are doing?

library(dplyr)
library(purrr)
library(readxl)
library(stringr)
library(janitor)

I was expecting some explanation, then it did not came ;) Or is it useful to mention loading packages here?

minor typo

minor typo in chapter 13 right above section 13.3 (I think)

The pipeline is nothing but a list (told you lists ~~where~~ were a very important object) of targets.

Why do we need to use an anonymous function?

Hi Bruno,

In 6.3.4 Data frames, the code:

nested_unemp %>% mutate(nrows = map(data, \(x)filter(x, year == 2015)))

Why do we need to use an anonymous function? Would it not work fine as:

nested_unemp %>% mutate(nrows = map(data, function(x){filter(x, year == 2015)}))

Consider adding Github links in book

Similar to R4DS 2e.

Reference links:

'Edit this page' link broken

Regardless of which chapter I am viewing of the book at: https://raps-with-r.dev/, the 'Edit this page' link returns a '404 - page not found' error.

apt-get vs apt

On the page about git, I would recommend using apt instead of apt-get. It is not false per se, but why not use the more modern and potentially more user friendly version for beginner users?
More info there: https://itsfoss.com/apt-vs-apt-get-difference/

sudo apt update
sudo apt install git

Thousands marker

Hello Bruno, I took up your challenge from Mastodon to read a bit of your draft. One thing that stands out is how you use the . separator rather than , separator for 000s. e.g. 400.000 rather than 400,000. As a native English reader, I usually expect to see the , and the . is used as a decimal point. If I do a PR with typos, do you want me to mark these for change also?

Renv not working in targets-minimal repo

Hi Bruno,

I have cloned your repo - targets-minimal, activated renv using renv::activate() and then ran renv::restore(). The packages download but do not install. I get the following error message from trying to install the MASS package:

Installing MASS ...                           FAILED
Error: Error installing package 'MASS':
================================

* installing *source* package ‘MASS’ ...
** package ‘MASS’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
using C compiler: ‘Apple clang version 14.0.0 (clang-1400.0.29.202)’
using SDK: ‘MacOSX13.1.sdk’
clang -arch x86_64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I/opt/R/x86_64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c MASS.c -o MASS.o
MASS.c:37:23: error: unknown type name 'Sint'; did you mean 'int'?

With a load more error output. I am wondering wether this is because I am using a newer version of R (R 4.3.1) and renv (renv 1.0.2). Or should this still work?

Thanks
Jake

Indirection and tidyselect

Hi Bruno,

Have you thought about updating your scripts which are prepared for inflation by fusen to be concurrent with the syntax used to deal with indirection and tidyselect. This allows you to access the variables directly within the pipe, i.e., to filter on the column locality, you need to call it using .data$locality. An example from your code is:

make_commune_level_data <- function(flat_data){
  flat_data |> 
    filter(!grepl("nationale|offres", **.data$**locality),
           !is.na(**.data$**locality))
}

Without this, when inflating, there are many warnings which appear telling you either:

 make_country_level_data: no visible binding for global variable
    ‘locality’
  Undefined global functions or variables:
    locality

Or, in a tidyselect you cannot use .data$ and instead just enclose the variable in "", or you get the warning:

Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0. Please use

I am slightly unsure about whether this is best practice and it doesn't seem to be too clear online. But I thought I would share as a food for thought as by doing so removes all the warnings. If you are interested, I am happy to share my code to save you time.

Here are some links:
https://dplyr.tidyverse.org/articles/programming.html#indirection
https://community.rstudio.com/t/use-of-data-in-tidyselect-expressions-is-now-deprecated/150092

`pak`

Would pak help simplify the installation of system-level dependencies in the Docker chapter?

PS Love the book.

Data as plural rather than singular

Formally, "data" is considered a plural noun.
For example, see here: https://www.britannica.com/dictionary/eb/qa/Is-Data-Singular-or-Plural-

Nevertheless, it is common for people to use it as a singular noun. I think this is a decision that needs to be taken by you, and to be implemented consistently throughout the book, so I won't open PRs on this manner.

various issues in "14 Reproducible analytical pipelines with Docker"

"ls also works on the Windows command prompt, and in Powershell as well)" AFAIK it is not the case. If you are using the default command promt cmd.exe, ls is not there. It is only available in Powershell, at least on Windows 10, no idea on Windows 11 though.
"What sets the Linux kernel apart from the one used for Windows or macOS, is that the Linux kernel is open-source and free software." -> the kernel in MacOS is called XNU and it is open source and free software, see https://github.com/apple-oss-distributions/xnu
"Docker recently announced that they would abandon their Docker Free Team subscription plans that some open-source organizations use, and that they should upgrade to a paid subscription within 30 days." -> I think this decision has been rolled back for now, but I agree generally with your point. There are alternatives, for example Gitlab provides its own registry https://docs.gitlab.com/ee/user/packages/container_registry/

notes

just some things I must not forget

Renv not working in targets-minimal repo

Hi Bruno,

` Installing MASS ... FAILED
Error: Error installing package 'MASS':

installing source package ‘MASS’ ...
** package ‘MASS’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
using C compiler: ‘Apple clang version 14.0.0 (clang-1400.0.29.202)’
using SDK: ‘MacOSX13.1.sdk’
clang -arch x86_64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/opt/R/x86_64/include -fPIC -falign-functions=64 -Wall -g -O2 -c MASS.c -o MASS.o
MASS.c:37:23: error: unknown type name 'Sint'; did you mean 'int'?`

With a load more error output. I am wondering wether this is because I am using a newer version of R (R 4.3.1) and renv (renv 1.0.2). Or should this still work?
(I am using a Mac)

typo in basic reproducibility

a space is missing in the command cat.Rprofile.
I suppose it should be cat .Rprofile.

functional programming part

Just a few comments reading the page about functional programming, not an issue per se:

When speaking about recursive functions, I am wondering if it could be interesting to mention the special case of tail recursion. I have found the tailr package to manage the tail call optimization of these functions, but it does not seem to be maintained anymore.
What about curried functions in R?
As I understood, most data structures in R are immutable. Might be interesting to mention it?
When you mention the maybe monad, I would have been interested to see how you manage this in the frame of a pipe, how you can substitute a default value to Nothing(), etc.

Build the book using Nix on GA

title

some issues on the page about git

On the page about git, you tell people to use git add . then git commit -am. But git commit -am corresponds to git add . followed by git commit -m. -a = all (stage all files) and -m = message.
There is also a confusion between git am which applies a series of patches coming from a mailbox and git commit -am.

For more info you can use git help am and git help commit.

PRs preferences

Do you prefer a single PR for every single change, or one PR for all changes I make to a single file?

draft outline

Purpose of the book: teach practitioners (in research or industry, doesn’t matter), how to make workflows reproducible. Do we agree on that?

As for an outline:
I think that we could skip any intro to R, and state that readers need to be familiar with R already. I would say at least comfortable with writing functions already?

Chapter 1: Functional programming primer
Chapter 2: Git (should we keep this, or state that readers need to be familiar with it already?)
Chapter 3: Literate programming with Quarto (I guess we need a separate chapter for this)
Chapter 4: Package dev (with fusen, question to Sébastien: does fusen work with qmd files? since we’re teaching quarto it would be nice to stay in quarto, if possible. what do you think?)
Chapter 5: Unit testing (in fusen it means writing meaningful examples)
Chapter 6: Targets (including renv, or should renv be a separate chapter?)
Chapter 7: Make it all reproducible (using Docker, and PROPRE? PROPRE inside Docker?)
Chapter 8: CI/CD with github actions

What do you think?

Typo - Section 1.5

rap4all/intro.qmd

Line 353 in 4a4f7cd

 and depending on the constraints you face your project can not very reproducible 

Thank you for the book! I'm loving the message so far. I'm always trying to explain this concept to newer R-programmers reluctant to learn git or document code, and now I have a much better resource :)

I know you are still working on it, but I figured I'd point out anything I see to save you some time. I don't mean to nitpick, just trying to help.

Anyway, the line above is missing a "be".

So what does this all mean? This means that reproducibility is on a continuum,
and depending on the constraints you face your project can be not very reproducible
to totally reproducible.

b-rodrigues / rap4all Goto Github PK

rap4all's Introduction

Building reproducible analytical pipelines with R

NEWS

rap4all's People

Contributors

Stargazers

Watchers

Forkers

rap4all's Issues

` Installing MASS ... FAILED Error: Error installing package 'MASS':

Recommend Projects

Recommend Topics

Recommend Org

` Installing MASS ... FAILED
Error: Error installing package 'MASS':