Git Product home page Git Product logo

ds-pipelines-targets-1's Introduction

This is a template repo for coursework on pipelines using the targets package

ds-pipelines-targets-1's People

Contributors

lindsayplatt avatar

Watchers

Jordan S Read avatar  avatar

ds-pipelines-targets-1's Issues

Create a branch in your code repository

You'll be revising files in this repository shortly. To follow our team's standard git workflow, you should first clone this training repository to your local machine so that you can make file changes and commits there.

Open a git bash shell (Windows) or a terminal window (Mac) and change (cd) into the directory you work in for projects in R (for me, this is ~/Documents/R). There, clone the repository and set your working directory to the new project folder that was created:

git clone [email protected]:jread-usgs/ds-pipelines-targets-1.git
cd ds-pipelines-1

Now you should create a local branch called "structure" and push that branch up to the "remote" location (which is the github host of your repository). We're naming this branch "structure" to represent concepts in this section of the lab. In the future you'll probably choose branch names according to the type of work they contain - for example, "pull-oxygen-data" or "fix-issue-17".

git checkout -b structure
git push -u origin structure

By using checkout, you have switched your local branch from "main" to "structure", and any changes you make from here on out to tracked files will not show up on the main branch. To take a look back at "main", you can always use git checkout main and return to "structure" with git checkout structure. We needed the -b flag initially because we wanted to combine two operations - creating a new branch (-b) and switching to that new branch (checkout).

While you are at it, this is a good time to invite a few collaborators to your repository, which will make it easier to assign them as reviewers in the future. In the ⚙️ Settings widget at the top of your repo, select "Manage access". Go ahead and invite your course contact(s). It should look something like this:
add some friends

Close this issue when you've successfully pushed your branch to remote and added some collaborators. (A successful push of the branch will result in a message that looks like this "Branch 'structure' set up to track remote branch 'structure' from 'origin'")


I'll send you to the next issue once you've closed this one.

Get started with USGS Data Science pipelines

Data analyses are often complex. Data pipelines are ways of managing that complexity. Our data pipelines have two foundational pieces:

  • Good organization of code scripts help you quickly find the file you need, whether you or a teammate created it.

  • Dependency managers such as remake, scipiper, snakemake, drake, and targets formalize the relationships among the datasets and functions to ensure reproducibility while also minimizing the amount of unnecessary runtime as you're creating or modifying parts of the pipeline.

⌨️ Activity: Assign yourself to this issue to get started.

💡 Tip: Throughout this course, I, the Learning Lab Bot, will reply and direct you to the next step each time you complete an activity. But sometimes I'm too fast when I ⏳ give you a reply, and occasionally you'll need to refresh the current GitHub page to see it. Please be patient, and let your course contact know if I seem to have become completely stuck.


I'll sit patiently until you've assigned yourself to this one.

The anatomy of a makefile

Our targets pipelines in R use a makefile file to orchestrate the connections among files, functions, and phases. In this issue, we're going to develop a basic understanding of how these files work, starting with the anatomy of the _targets.R file.

Setting up a targets data science pipeline

In addition to phases (which we covered in #3 (comment)), it is important to decompose high-level concepts (or existing scripts) into thoughtful functions and "targets" that form the building blocks of data processing pipelines. A target is a noun we use to describe a tangible output of a function, which is often a file or an R object. Targets can be used as an end-product (like a summary map) or as input into another function to create another target.

To set up a targets pipeline, you will need to create the base makefile named _targets.R that will declare and orchestrate the rest of the pipeline connections.


A simple version of _targets.R might look something like this:

library(targets)
source("code.R")
tar_option_set(packages = c("tidyverse", "stringr", "sbtools", "whisker"))

list(
  tar_target(
    model_RMSEs_csv,
    download_data(out_filepath = "model_RMSEs.csv"),
    format = "file"
  ), 
  tar_target(
    eval_data,
    process_data(in_filepath = model_RMSEs_csv),
  ),
  tar_target(
    figure_1_png,
    make_plot(out_filepath = "figure_1.png", data = eval_data), 
    format = "file"
  )
)

This file defines the relationships between different "targets" (see how the target model_RMSEs_csv is an input to the command that creates the target eval_data?), tells us where to find any functions that are used to build targets (see the source call that points you to code.R), and declares the package dependencies needed to build the different targets (see the target_option_set() command that passes in a vector of packages).

We'll briefly explain some of the functions and conventions used here. For more extensive explanations, visit the targets documentation.

  • As you would with normal R scripts, put any source commands for loading R files and library commands for loading packages at the top of the file. The packages loaded here should be only those needed to build the targets plan; packages needed to build specific targets can be loaded later.
  • Declare each target by using the function tar_target() and passing in a target name (name arg) and the expression to run to build the target (command arg).
  • There are two types of targets - objects and files. If your target is a file, you need to add format = "file" to your tar_target() call and the command needs to return the filename of the new file.
  • Setup the full pipeline by combining all targets into a single list object.
  • There are 2 ways to define packages used to build targets: 1) declare using the packages argument in tar_option_set() in your makefile to specify packages used by all targets or 2) use the packages argument in individual tar_target() functions for packages that are specific to those targets.
  • model_RMSEs_csv shows up two times - why? model_RMSEs_csv is the name of a target that creates the file model_RMSEs.csv when the command download_data() is run. When passed in as input to other functions (unquoted), it represents the filename of the file that was created when it was built. So when model_RMSEs_csv shows up as an argument to another function, process_data(), it is really passing in the filename. The process_data() function then reads the files and changes the data (or "processes" it) in some way.

We're going to start with this simple example, and modify it to match our pipeline structure. This will start by creating a new branch, creating a new file, adding that file to git tracking, and opening a new pull request that includes the file:

⌨️ Activity: get your code plugged into a makefile

First things first: We're going to want a new branch. You can delete your previous one, since that pull request was merged.

git checkout main
git pull
git branch -d structure
git checkout -b makefile
git push -u origin makefile 

Next, create the file with the contents we've given you by entering the following from your repo directory in terminal/command line:

cat > _targets.R
library(targets)
source("code.R")
tar_option_set(packages = c("tidyverse", "stringr", "sbtools", "whisker"))

list(
  # Get the data from ScienceBase
  tar_target(
    model_RMSEs_csv,
    download_data(out_filepath = "model_RMSEs.csv"),
    format = "file"
  ), 
  # Prepare the data for plotting
  tar_target(
    eval_data,
    process_data(in_filepath = model_RMSEs_csv),
  ),
  # Create a plot
  tar_target(
    figure_1_png,
    make_plot(out_filepath = "figure_1.png", data = eval_data), 
    format = "file"
  ),
  # Save the processed data
  tar_target(
    model_summary_results_csv,
    write_csv(eval_data, file = "model_summary_results.csv"), 
    format = "file"
  ),
  # Save the model diagnostics
  tar_target(
    model_diagnostic_text_txt,
    generate_model_diagnostics(out_filepath = "model_diagnostic_text.txt", data = eval_data), 
    format = "file"
  )
)

then use Ctrl+D to exit the file creation mode and return to the prompt.


Finally, create a pull request that includes this new file (the file should be called _targets.R).


When I see your pull request, I'll make some in-line suggestions for next steps.

What's next

You are doing a great job, @jread-usgs! 🌟 💥 🐠

But you may be asking why we asked you to go through all of the hard work of connecting functions, files, and targets together using a makefile. We don't blame you for wondering...


The real power of depedency management is when something changes - that's the EUREKA! moment, but we haven't put you in a situation where it would show up. That will come further down the road on later training activities and also in the project work you will be exposed to.

In the meantime, here are a few nice tricks given you have a functional pipeline.

  • run tar_make() again. What happens? Hopefully not much. I see this:
    make all is fresh
    Which means everthing is up to date so all targets are :OK:

  • now try making a change to one of your functions in your code. What happens after running tar_make() then?

  • access the eval_data target by using tar_load(eval_data). (You may or may not have an R-object target named eval_data in your own repo at this point, so go ahead and try it with some target that you do have.) In this example, we have passed in the unquoted target name eval_data to tar_load() which creates a data.frame object in our environment called eval_data because that's what our example function process_data() creates. If you load a file target, like tar_load(model_RMSEs_csv), the resulting object in your environment is a character vector with the path to the target's file.

  • now try making a change to the template_1 variable in your function that creates the .txt file. What happens after running tar_make() then? Which targets get rebuilt and which do not?


Lastly, imagine the following comment appeared on your pull request.

Oh shoot @jread-usgs, I am using your results for FANCY BIG PROJECT and I have coded everything to assume your outputs use a character for the experiment number (the exper_n column), of the form "01", "02", etc. It looks like you are using numbers. Can you update your code accordingly?

Would your code be easy to adjust to satisfy this request? Would you need to re-run any steps that aren't associated with this naming choice? Did the use of a dependency management solution allow you to both make the change efficiently (i.e., by avoiding rebuilding any unnecessary parts of the pipeline) and increase your confidence in delivering the results?


You have completed introductions to pipelines I. Great work!

Below you will find some quick links that can help you review the content covered here

Organize your project files

You should organize your code into functions, targets, and conceptual "phases" of work.

Often we create temporary code or are sent scripts that look like my_work_R/my_happy_script.R in this repository. Take a minute to look through that file now.

This code has some major issues, including that it uses a directory that is specific to a user, it plots to a non-project file location, and the structure of the code makes it hard to figure out what is happening. This simple example is a starting point for understanding the investments we make to move towards code that is more reproducible, more shareable, and understandable. Additionally, we want to structure our code and our projects in a way where we can build on top of them as the projects progress.

Assign this issue to yourself and then we'll get started on code and project structures.


I'll sit patiently until you've assigned yourself to this one.

Why use a dependency manager?

We're asking everyone to invest in the concepts of reproducibility and efficiency of reproducibility, both of which are enabled via dependency management systems such as remake, scipiper, drake, and targets.

Background

We hope that the case for reproducibility is clear - we work for a science agency, and science that can't be reproduced does little to advance knowledge or trust.

But, the investment in efficiency of reproducibility is harder to boil down into a zingy one-liner. Many of us have embraced this need because we have been bitten by issues in our real-world collaborations, and found that data science practices and a reproducibility culture offer great solutions. Karl Broman is an advocate for reproducibility in science and is faculty at UW Madison. He has given many talks on the subject and we're going to ask you to watch part of one of them so you can be exposed to some of Karl's science challenges and solutions. Karl will be talking about GNU make, which is the inspiration for almost every modern dependency tool that we can think of. Click on the image to kick off the video.

reproducible workflows with make

💻 Activity: Watch the above video on make and reproducible workflows up until the 11 minute mark (you are welcome to watch more)

Use a GitHub comment on this issue to let us know what you thought was interesting about these pipeline concepts using no more than 300 words.


I'll respond once I spot your comment (refresh if you don't hear from me right away).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.