Git Product home page Git Product logo

ds-pipelines-targets-2's People

Contributors

lindsayplatt avatar

Watchers

 avatar

ds-pipelines-targets-2's Issues

Local setup

Set up your local environment before continuing

During the course, we will ask you to build the pipeline, explore how to troubleshoot, and implement some of the best practices your are learning. To do this, you will work with the pipeline locally and commit/push your changes to GitHub for review.

See details below for how to get started working with code and files that exist within the course repsository:


Open a git bash shell (Windows:diamond_shape_with_a_dot_inside:) or a terminal window (Mac:green_apple:) and change (cd) into the directory you work in for projects in R (for me, this is ~/Documents/R). There, clone the repository and set your working directory to the new project folder that was created:

git clone [email protected]:jread-usgs/ds-pipelines-targets-2.git
cd ds-pipelines-targets-2

You can also open this project in RStudio by double-clicking the .Rproj file in the ds-pipelines-targets-2 directory.

⭐ Now you have the repository locally! Follow along with the commands introduced and make changes to the code as requested throughout the remainder of the course.


⌨️ close this issue to continue!

Overview of data science pipelines II

Welcome to the second installment of "introduction to data pipelines" at USGS, @jread-usgs!! ✨

We're assuming you were able to navigate through the intro-to-targets-pipelines course and that you learned a few things about organizing your code for readability, re-use, and collaboration. You were also introduced to two key things through the makefile: a way to program connections between functions, files, and phases and the concept of a dependency manager that skips parts of the workflow that don't need to be re-run.


Recap of pipelines I

First, a recap of key concepts that came from intro-to-targets-pipelines 👇

  • Data science work should be organized thoughtfully. As Jenny Bryan notes, "File organization and naming are powerful weapons against chaos".
  • Capture all of the critical phases of project work with descriptive directories and function names, including how you "got" the data.
  • Turn your scripts into a collection of functions, and modify your thinking to connect outputs from these functions ("targets") to generate your final product.
  • "Skip the work you don't need" by taking advantage of a dependency manager. There was a video that covered a bit of make, and you were asked to experiment with targets.
  • Invest in efficient reproducibility to scale up projects with confidence.

This last concept was not addressed directly, but we hope that the small exercise of seeing rebuilds in action got you thinking about projects that might have much more lengthly steps (e.g., several downloads or geo-processing tasks that take hours instead of seconds).

What's ahead in pipelines II

In this training, the focus will be on conventions and best practices for making better, smarter pipelines for USGS Data Science projects. You'll learn new things here that will help you refine your knowledge from the first class and put it into practice. Let's get started!

⌨️ Activity: Add collaborators and close this issue to get started.

As with pipelines I, please invite a few collaborators to your repository so they can easily comment and review in the future. In the ⚙️ Settings widget at the top of your repo, select "Manage access" (or use this shortcut link). Go ahead and invite your course instructor. It should look something like this:
add some friends

💡 Tip: Throughout this course, I, the Learning Lab Bot, will reply and direct you to the next step each time you complete an activity. But sometimes I'm too fast when I ⏳ give you a reply, and occasionally you'll need to refresh the current GitHub page to see it. Please be patient, and let my human (your designated course instructor) know if I seem to have become completely stuck.


I'll sit patiently until you've closed the issue.

Strategies for defining targets in data pipelines

How to make decisions on how many targets to use and how targets are defined

We've covered a lot of content about the rules of writing good pipelines, but pipelines are also very flexible! Pipelines can have as many or as few targets as you would like, and targets can be as big or as small as you would like. The key theme for all pipelines is that they are reproducible codebases to document your data analysis process for both humans and machines. In this next section, we will learn about how to make decisions related to the number and types of targets you add to a pipeline.

Background

Isn't it satisfying to work through a fairly lengthy data workflow and then return to the project and it just works? For the past few years, we have been capturing the steps that go into creating results, figures, or tables appearing in data visualizations or research papers. There are recipes for reproducibility used in complex, collaborative modeling projects, such as in this reservoir temperature modeling pipeline and in this pipeline to manage downloads of forecasted meteorological driver data. Note that you need to be able to access internal USGS websites to see these examples and these were developed early on in the Data Science adoption of targets so may not showcase all of our adopted best practices.


Here is a much simpler example that was used to generate Figure 1 from Water quality data for national‐scale aquatic research: The Water Quality Portal (published in 2017):

library(targets)

## All R files that are used must be listed here:
source("R/get_mutate_HUC8s.R")
source("R/get_wqp_data.R")
source("R/plot_huc_panel.R")

tar_option_set(packages = c("dataRetrieval", "dplyr", "httr", "lubridate", "maps",
                            "maptools", "RColorBrewer", "rgeos", "rgdal", "sp", "yaml"))

# Load configuration files
p0_targets_list <- list(
  tar_target(map_config_yml, "configs/mapping.yml", format = "file"),
  tar_target(map_config, yaml.load_file(map_config_yml)),
  tar_target(wqp_config_yml, "configs/wqp_params.yml", format = "file")
  tar_target(wqp_config, yaml.load_file(wqp_config_yml))
)

# Fetch data
p1_targets_list <- list(
  tar_target(huc_map, get_mutate_HUC8s(map_config)),
  tar_target(phosphorus_lakes, get_wqp_data("phosphorus_lakes", wqp_config, map_config)),
  tar_target(phosphorus_all, get_wqp_data("phosphorus_all", wqp_config, map_config)),
  tar_target(nitrogen_lakes, get_wqp_data("nitrogen_lakes", wqp_config, map_config)),
  tar_target(nitrogen_all, get_wqp_data("nitrogen_all", wqp_config, map_config)),
  tar_target(arsenic_lakes, get_wqp_data("arsenic_lakes", wqp_config, map_config)),
  tar_target(arsenic_all, get_wqp_data("arsenic_all", wqp_config, map_config)),
  tar_target(temperature_lakes, get_wqp_data("temperature_lakes", wqp_config, map_config)),
  tar_target(temperature_all, get_wqp_data("temperature_all", wqp_config, map_config)),
  tar_target(secchi_lakes, get_wqp_data("secchi_lakes", wqp_config, map_config)),
  tar_target(secchi_all, get_wqp_data("secchi_all", wqp_config, map_config)),
)

# Summarize the data in a plot
p2_targets_list <- list(
  tar_target(
    multi_panel_constituents_png,
    plot_huc_panel(
      "figures/multi_panel_constituents.png", huc_map, map_config, 
      arsenic_lakes, arsenic_all, nitrogen_lakes, nitrogen_all, 
      phosphorus_lakes, phosphorus_all, secchi_lakes, secchi_all, 
      temperature_lakes, temperature_all
    ),
    format = "file")
)

# Combine all targets into a single list
c(p0_targets_list, p1_targets_list, p2_targets_list)

This makefile recipe generates a multipanel map, which colors HUC8 watersheds according to how many sites within the watershed have data for various water quality constituents:
multi_panel_constituents


The "figures/multi_panel_constituents.png" figure takes a while to plot, so it is a somewhat "expensive" target to iterate on when it comes to style, size, colors, and layout (it takes 3 minutes to plot for me). But the plotting expense is dwarfed by the amount of time it takes to build each water quality data "object target", since get_wqp_data uses a web service that queries a large database and returns a result; the process of fetching the data can sometimes take over thirty minutes (nitrogen_all is a target that contains the locations of all of the sites that have nitrogen water quality data samples).

Alternatively, the map_config* object above builds in a fraction of second, and contains some simple information that is used to fetch and process the proper boundaries with the get_mutate_HUC8s function, and includes some plotting details for the final map (such as plotting color divisions).

This example, although dated, represents a real project that caused us to think carefully about how many targets we use in a recipe and how complex their underlying functions are. Decisions related to targets are often motivated by the intent of the pipeline. In the case above, our intent at the time was to capture the data and processing behind the plot in the paper in order to satisfy our desire for reproducibility.


⌨️ Activity: Assign yourself to this issue to get started.


I'll sit patiently until you've assigned yourself to this one.

How to get past the gotchas without getting gotten again

In this section, we're going to go one by one through a series of tips that will help you avoid common pitfalls in pipelines. These tips will help you in the next sections and in future work. A quick list of what's to come:

  • 🔍 How to debug in a pipeline
  • 👀 Visualizing and understanding the status of dependencies in a pipeline
  • 💬 tar_visnetwork() and tar_outdated() to further interrogate the status of pipeline targets
  • 🔃 What is a cyclical dependency and how do I avoid it?
  • ⚠️ Undocumented file output from a function
  • 📂 Using a directory as a dependency
  • 📋 How do I know when to use an object vs a file target or even use a target at all?
  • ⚙️ USGS Data Science naming conventions
  • 🔓 Final tips for smart pipelining

⌨️ add a comment to this issue and the bot will respond with the next topic


I'll sit patiently until you comment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.