Git Product home page Git Product logo

rap4all's Introduction

Building reproducible analytical pipelines with R

This is the repository containing the source code to my free ebook called "Building reproducible analytical pipelines with R".

Read it online here https://raps-with-r.dev. On that page, you can click the download button in the sidebar on the left to download a PDF or EPUB version of the book. You can also buy (the same) DRM-free PDF or EPUB from Leanpub if you want to support this initiative.

A print version through Amazon Kindle Direct Publishing is also planned.

If you spot a typo, feel free to open an issue or pull request.

NEWS

2023-10-03: Several typos were corrected (thanks to the many contributors) and the wikipedia tables that are scraped at the beginning of the book are now re-hosted on Github Pages for reproducibility purposes.

2023-07-10: Two formatting issues were fixed on pages 338 and 419 from the PDF.

2023-06-29: A typo was spotted on the PDF version of the book, at the top of page 468. The command after docker push: was missing. This is now fixed.

rap4all's People

Contributors

b-rodrigues avatar matanhakim avatar jonathandmoore avatar statnmap avatar a-s-russo avatar jaketufts avatar pstennant avatar skolenik avatar karla-desouza avatar olivroy avatar asadow avatar mkienzle avatar pmassicotte avatar shitao5 avatar

Stargazers

Egor Kotov avatar  avatar Amy Heather avatar Pedro Augusto Borges dos Santos avatar Lennart Wittkuhn avatar  avatar John MacKintosh avatar Angelo Midieri Rivera avatar Owain  gaunders avatar Jordan avatar Jamiu Badmus avatar  avatar Fernando da Silva avatar Jannis avatar M. Fatih Tüzen avatar Shinya Uryu avatar Lucia Segovia de la Revilla avatar Ven Popov avatar  avatar Gary Lind avatar Alessandro Arrigo avatar Sammi Rosser avatar Matt Dray avatar Clay Ford avatar Shehab Tarek avatar camille avatar rbenatti avatar  avatar Shweta Dixit avatar David avatar Lydia Gibson, MS, GStat avatar Je Sian Keith Herman avatar Nicholas Vietto avatar Tamás Stirling avatar Azuka Atum avatar Erin Steiner avatar Ricardo J. Serrano avatar Julia avatar Ulrike Niemann avatar Yousuf Ali avatar Vishal Lama avatar Roney Fraga Souza avatar Alessandra Gherardelli avatar Srikanth K S avatar mcnanton avatar Damien Dotta avatar Maximilian Krauß avatar Paolo Cozzi avatar  avatar  avatar Lijie avatar Víctor Gauto avatar Tim Myers avatar  avatar Victor Perrier avatar Karl Makepeace avatar Dave Tang avatar Sebastian Krantz avatar Mayank Agrawal avatar Carlos Eduardo Guimarães avatar Andrew Allen Bruce avatar Sébastien Retoux avatar Denis OMeally avatar Claudio Zandonella Callegher avatar Richard avatar  avatar Jens Wiesehahn avatar Blake Girardot avatar  avatar Andrew Marsh avatar Jeremy Bejarano avatar wangyang avatar Andrea Diaz avatar  avatar Pedro Cunha avatar Steven V. Miller avatar Johannes Breuer avatar Roxanne Connelly avatar Santiago Mota avatar Sam Parmar avatar Matt Fisher avatar Lennart Klein avatar Lluís Revilla avatar Krunoslav Juraić avatar

Watchers

James Cloos avatar  avatar Sébastien Retoux avatar Miles McBain avatar  avatar

rap4all's Issues

Before we start -- split into subsections / subheadings

Hi, I enjoyed reading your blog (such as about running R code using older versions of R), so I couldn't miss your book.

After reading the fist "real" chapter, I feel it is a bit rambly and mixes several concepts in a single paragraph, and several ideas are spread around multiple paragraphs. I think that using more subheadings would improve the clarity and the organisation of the text.

I see some points that could be highlighted and their current position in the text:

  • What is R (first, but also third and fourth paragraphs)
  • R is not Rstudio (first and second paragraphs)
  • The strengths of R (second, third, and fourth paragraphs)
  • Best Practices for Paths (from fifth paragraph onwards)
  • The Learning Loop (the last chunk)

The current text could be reorganised to follow said highlights, which would enhance the reading

Associate Acronyms with their definition

In the preface, you use some acronyms like ONS, RAP, PI, etc. The text could be easier to read if they would be associated with their definition, at least for their first appearance, ex: Office for National Statistics (ONS).
Some of them are not defined, ex: Principal Investigator (PI). I don't know if this is so well known outside of R&D circles.

Mention RStudio projects (.Rproj) for reproducibility

Thanks for making this OS. I have read some parts of the book, attended one of the workshops held online, and found everything in your talk helpful! One minor thing I noticed is that, as far as I am concerned, RStudio projects should be mentioned to improve reproducibility. At least for those using the RStudio IDE (maybe the majority, nowadays), this is very handy.

  • For those using RStudio, the .Rproj can be committed and not git-ignored
    • If you are using a different IDE you can act is it was not existent in the repository
  • Advantages I can think of:
    • No need to set the path in your scripts or change it in the console
    • Make R workflows portable and OS agnostics
    • Use just a relative path from your current project folder
    • Clone the repo and click on .Rproj, and you are ready to go
    • Use in combination with the {here} package
    • bonus: Robust global search within the project

Reference:

what are the packages doing in the first example?

Sorry, not really a R developer, this is probably a stupid question.
On this page, could you please explain quickly what those packages are doing?

library(dplyr)
library(purrr)
library(readxl)
library(stringr)
library(janitor)

I was expecting some explanation, then it did not came ;) Or is it useful to mention loading packages here?

minor typo

minor typo in chapter 13 right above section 13.3 (I think)

The pipeline is nothing but a list (told you lists where were a very important object) of targets.

Why do we need to use an anonymous function?

Hi Bruno,

In 6.3.4 Data frames, the code:

nested_unemp %>% mutate(nrows = map(data, \(x)filter(x, year == 2015)))

Why do we need to use an anonymous function? Would it not work fine as:

nested_unemp %>% mutate(nrows = map(data, function(x){filter(x, year == 2015)}))

apt-get vs apt

On the page about git, I would recommend using apt instead of apt-get. It is not false per se, but why not use the more modern and potentially more user friendly version for beginner users?
More info there: https://itsfoss.com/apt-vs-apt-get-difference/

sudo apt update
sudo apt install git

Thousands marker

Hello Bruno, I took up your challenge from Mastodon to read a bit of your draft. One thing that stands out is how you use the . separator rather than , separator for 000s. e.g. 400.000 rather than 400,000. As a native English reader, I usually expect to see the , and the . is used as a decimal point. If I do a PR with typos, do you want me to mark these for change also?

Renv not working in targets-minimal repo

Hi Bruno,

I have cloned your repo - targets-minimal, activated renv using renv::activate() and then ran renv::restore(). The packages download but do not install. I get the following error message from trying to install the MASS package:

Installing MASS ...                           FAILED
Error: Error installing package 'MASS':
================================

* installing *source* package ‘MASS’ ...
** package ‘MASS’ successfully unpacked and MD5 sums checked
** using staged installation
** libs
using C compiler: ‘Apple clang version 14.0.0 (clang-1400.0.29.202)’
using SDK: ‘MacOSX13.1.sdk’
clang -arch x86_64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG   -I/opt/R/x86_64/include    -fPIC  -falign-functions=64 -Wall -g -O2  -c MASS.c -o MASS.o
MASS.c:37:23: error: unknown type name 'Sint'; did you mean 'int'?

With a load more error output. I am wondering wether this is because I am using a newer version of R (R 4.3.1) and renv (renv 1.0.2). Or should this still work?

Thanks
Jake

Indirection and tidyselect

Hi Bruno,

Have you thought about updating your scripts which are prepared for inflation by fusen to be concurrent with the syntax used to deal with indirection and tidyselect. This allows you to access the variables directly within the pipe, i.e., to filter on the column locality, you need to call it using .data$locality. An example from your code is:

make_commune_level_data <- function(flat_data){
  flat_data |> 
    filter(!grepl("nationale|offres", **.data$**locality),
           !is.na(**.data$**locality))
}

Without this, when inflating, there are many warnings which appear telling you either:

 make_country_level_data: no visible binding for global variable
    ‘locality’
  Undefined global functions or variables:
    locality

Or, in a tidyselect you cannot use .data$ and instead just enclose the variable in "", or you get the warning:

Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0. Please use

I am slightly unsure about whether this is best practice and it doesn't seem to be too clear online. But I thought I would share as a food for thought as by doing so removes all the warnings. If you are interested, I am happy to share my code to save you time.

Here are some links:
https://dplyr.tidyverse.org/articles/programming.html#indirection
https://community.rstudio.com/t/use-of-data-in-tidyselect-expressions-is-now-deprecated/150092

various issues in "14 Reproducible analytical pipelines with Docker"

  • "ls also works on the Windows command prompt, and in Powershell as well)" AFAIK it is not the case. If you are using the default command promt cmd.exe, ls is not there. It is only available in Powershell, at least on Windows 10, no idea on Windows 11 though.
  • "What sets the Linux kernel apart from the one used for Windows or macOS, is that the Linux kernel is open-source and free software." -> the kernel in MacOS is called XNU and it is open source and free software, see https://github.com/apple-oss-distributions/xnu
  • "Docker recently announced that they would abandon their Docker Free Team subscription plans that some open-source organizations use, and that they should upgrade to a paid subscription within 30 days." -> I think this decision has been rolled back for now, but I agree generally with your point. There are alternatives, for example Gitlab provides its own registry https://docs.gitlab.com/ee/user/packages/container_registry/

notes

just some things I must not forget

Renv not working in targets-minimal repo

Hi Bruno,

I have cloned your repo - targets-minimal, activated renv using renv::activate() and then ran renv::restore(). The packages download but do not install. I get the following error message from trying to install the MASS package:

` Installing MASS ... FAILED
Error: Error installing package 'MASS':

  • installing source package ‘MASS’ ...
    ** package ‘MASS’ successfully unpacked and MD5 sums checked
    ** using staged installation
    ** libs
    using C compiler: ‘Apple clang version 14.0.0 (clang-1400.0.29.202)’
    using SDK: ‘MacOSX13.1.sdk’
    clang -arch x86_64 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -I/opt/R/x86_64/include -fPIC -falign-functions=64 -Wall -g -O2 -c MASS.c -o MASS.o
    MASS.c:37:23: error: unknown type name 'Sint'; did you mean 'int'?`

With a load more error output. I am wondering wether this is because I am using a newer version of R (R 4.3.1) and renv (renv 1.0.2). Or should this still work?
(I am using a Mac)

functional programming part

Just a few comments reading the page about functional programming, not an issue per se:

  • When speaking about recursive functions, I am wondering if it could be interesting to mention the special case of tail recursion. I have found the tailr package to manage the tail call optimization of these functions, but it does not seem to be maintained anymore.
  • What about curried functions in R?
  • As I understood, most data structures in R are immutable. Might be interesting to mention it?
  • When you mention the maybe monad, I would have been interested to see how you manage this in the frame of a pipe, how you can substitute a default value to Nothing(), etc.

some issues on the page about git

On the page about git, you tell people to use git add . then git commit -am. But git commit -am corresponds to git add . followed by git commit -m. -a = all (stage all files) and -m = message.
There is also a confusion between git am which applies a series of patches coming from a mailbox and git commit -am.

For more info you can use git help am and git help commit.

PRs preferences

Do you prefer a single PR for every single change, or one PR for all changes I make to a single file?

draft outline

Purpose of the book: teach practitioners (in research or industry, doesn’t matter), how to make workflows reproducible. Do we agree on that?

As for an outline:
I think that we could skip any intro to R, and state that readers need to be familiar with R already. I would say at least comfortable with writing functions already?

  • Chapter 1: Functional programming primer
  • Chapter 2: Git (should we keep this, or state that readers need to be familiar with it already?)
  • Chapter 3: Literate programming with Quarto (I guess we need a separate chapter for this)
  • Chapter 4: Package dev (with fusen, question to Sébastien: does fusen work with qmd files? since we’re teaching quarto it would be nice to stay in quarto, if possible. what do you think?)
  • Chapter 5: Unit testing (in fusen it means writing meaningful examples)
  • Chapter 6: Targets (including renv, or should renv be a separate chapter?)
  • Chapter 7: Make it all reproducible (using Docker, and PROPRE? PROPRE inside Docker?)
  • Chapter 8: CI/CD with github actions

What do you think?

Typo - Section 1.5

and depending on the constraints you face your project can not very reproducible

Thank you for the book! I'm loving the message so far. I'm always trying to explain this concept to newer R-programmers reluctant to learn git or document code, and now I have a much better resource :)

I know you are still working on it, but I figured I'd point out anything I see to save you some time. I don't mean to nitpick, just trying to help.

Anyway, the line above is missing a "be".

So what does this all mean? This means that reproducibility is on a continuum,
and depending on the constraints you face your project can be not very reproducible
to totally reproducible.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.