Git Product home page Git Product logo

ds-pipelines-2's People

Contributors

aappling-usgs avatar ratshan avatar

Watchers

 avatar  avatar

ds-pipelines-2's Issues

Learn the differences between different types of targets

remake is the R package that underlies many of scipiper's functions. Here we've borrowed some text from the remake github repo (credit to richfitz, although we've lightly edited the original text) to explain differences between targets

Targets

"Targets" are the main things that remake interacts with. They represent things that are made (they're also the vertices of the dependency graph). If you want to make a plot called plot.pdf, then that's a target. If you depend on a dataset called data.csv, that's a target (even if it already exists).

There are several types of targets:

  • files: The name of a file target is the same as its path. Something is actually stored in the file, and it's possible for the file contents to be modified outside of remake (files are the main types of targets that make deals with, since it is language agnostic). Within files, there are two sub-types:
    • implicit: these are file targets that are depended on somewhere in your process, but for which no rule to build them exists (i.e., there is no command in a remakefile). You can't build these of course. However, remake will build an implicit file target for them so it can internally monitor changes to that file.
    • explicit: these are the file targets that are built by rules that were defined within your pipeline (i.e., command-to-target recipe exists in a remakefile).
  • objects: These are R objects that represent intermediate objects in an analysis. However, these objects are transparently stored to disk so that they persist across R sesssions. Unlike actual R objects though they won't appear in your workspace and a little extra work is required to get at them.
  • fake: Fake targets are simply pointers to other targets (in make these are "phoney" targets). The all depends on all the "end points" of your analysis is a "fake" target. Running scmake("all") will build all of your targets, or verify that they are up to date.

⌨️ Activity: Assign yourself to this issue to get started.


I'll sit patiently until you've assigned yourself to this one.

Strategies for defining targets in data pipelines

How to make decisions on how many targets to use and how targets are defined

background

Isn't it satisfying to work through a fairly lengthy data workflow and then return to the project and it just works? For the past few years, we have been capturing the steps that go into creating results, figures, or tables appearing in data visualizations or research papers. There are recipes for reproducibility used to create complex, interactive data visualizations, such as
this water use data viz


Here is a much simpler example that was used to generate Figure 1 from Water quality data for national‐scale aquatic research: The Water Quality Portal (published in 2017):

packages:
  - rgeos
  - dplyr
  - rgdal
  - httr
  - yaml
  - RColorBrewer
  - dataRetrieval
  - lubridate
  - maptools
  - rgeos
  - maps
  - sp
  
## All R files that are used must be listed here:
sources:
  - R/wqp_mapping_functions.R
  - R/readWQPdataPaged.R

targets:
  all:
    depends: 
      - figures/multi_panel_constituents.png
      
  map.config:
    command: yaml.load_file("configs/mapping.yml")
    
  wqp.config:
    command: yaml.load_file("configs/wqp_params.yml")
  
  huc.map:
    command: get_mutate_HUC8s(map.config)

  phosphorus_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)

  phosphorus_all:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  nitrogen_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  nitrogen_all:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  arsenic_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  arsenic_all:
    command: get_wqp_data(target_name, wqp.config, map.config)
  
  chlorophyll_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  chlorophyll_all:
    command: get_wqp_data(target_name, wqp.config, map.config)
  
  temperature_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  temperature_all:
    command: get_wqp_data(target_name, wqp.config, map.config)
  
  doc_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  doc_all:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  secchi_all:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  secchi_lakes:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  glyphosate_all:
    command: get_wqp_data(target_name, wqp.config, map.config)
    
  figures/multi_panel_constituents.png:
    command: plot_huc_panel(huc.map, map.config, target_name, arsenic_lakes, 
      arsenic_all, nitrogen_lakes, nitrogen_all, phosphorus_lakes, phosphorus_all, 
      secchi_lakes, secchi_all, temperature_lakes, temperature_all)
    plot: true

This remakefile recipe generates a multipanel map, which colors HUC8 watersheds according to how many sites within the watershed have data for various water quality constituents:
multi_panel_constituents


The "figures/multi_panel_constituents.png" figure takes a while to plot, so it is a somewhat "expensive" target to iterate on when it comes to style, size, colors, and layout (it takes 3 minutes to plot for me). But the plotting expense is dwarfed by the amount of time it takes to build each water quality data "object target", since get_wqp_data uses a web service that queries a large database and returns a result; the process of fetching the data can sometimes take over thirty minutes (nitrogen_all is a target that contains the locations of all of the sites that have nitrogen water quality data samples).

Alternatively, the map.config* object above builds in a fraction of second, and contains some simple information that is used to fetch and process the proper boundaries with the get_mutate_HUC8s function, and includes some plotting details for the final map (such as plotting color divisions as specified by countBins):

map.config build

This example, although dated, represents a real project that caused us to think carefully about how many targets we use in a recipe and how complex their underlying functions are. Decisions related to targets are often motivated by the intent of the pipeline. In the case above, our intent at the time was to capture the data and processing behind the plot in the paper in order to satisfy our desire for reproducibility.


*disclaimer, the code above was written at a time before we'd completely transitioned away from naming variables like.this

⌨️ Activity: Assign yourself to this issue to get started.


I'll sit patiently until you've assigned yourself to this one.

Exchange object and file targets in your pipelines

You should now have a working pipeline that can run with scmake(). Your current pipeline likely only has one file target, which is the final plot.

We want you to get used to exchanging objects for files and vice versa, in order to expose some of the important differences that show up in the remakefile and also in the way the functions are put together.

⌨️ Activity: Open a PR where you swap two object targets to be file targets, and change one file target to be an object target. Run scmake and open a pull request. Paste your build status as a comment to the PR and assign Jordan or Alison as a reviewer.


I'll sit patiently until you open a new pull request

What's next

You are awesome, @RAtshan! 🌟 💥 🐠


We hope you've learned a lot in intro to pipelines II. We don't have additional exercises in this module, but we'd love to have a discussion if you have questions.

As a resource for later, here are links to the content you just completed

⌨️ Activity: If you have comments or questions, add them below and then assign a course lead this issue to engage in dialogue. When you are satisfied with the conversation, close this issue.

Refactor the existing pipeline to use more effective targets

⌨️ Activity: Make modifications to the working, but less than ideal, pipeline that exists within your course repository

Within the course repo you should see only a remake.yml and directories with code or placeholder files for each phase. You should be able to run scmake() and build the pipeline, although it may take numerous tries, since some parts of this new workflow are brittle. Some hints to get you started: the site_data target is too big, and you should consider splitting it into a target for each site, perhaps using the download_nwis_site_data() function directly to write a file. Several of the site_data_ targets are too small and it might make sense to combine them. Lastly, if it makes sense to use target_name, try using that in the "remake.yml" file too to simplify the formatting.


When you are happy with your newer, better workflow, create a pull request with your changes and assign Jordan or Alison as reviewers. Add a comment to your own PR with thoughts on how you approached the task, as well as key decisions you made. See details below for some reminders of how to get started working with code and files that exist within the course repsository:


Open a git bash shell (Windows💠) or a terminal window (Mac🍏) and change (cd) into the directory you work in for projects in R (for me, this is ~/Documents/R). There, clone the repository and set your working directory to the new project folder that was created:

git clone [email protected]:RAtshan/ds-pipelines-2.git
cd ds-pipelines-2

Now you should create a local branch called "targets" and push that branch up to the "remote" location (which is the github host of your repository). We're naming this branch "targets" to represent concepts in this section of the lab. In the future you'll probably choose branch names according to the type of work they contain - for example, "pull-oxygen-data" or "fix-issue-17".

git checkout -b targets
git push -u origin targets

A human will interact with your pull request once you assign them as a reviewer

Overview of data science pipelines II

Welcome to the second installment of "introduction to data pipelines" at USGS, @RAtshan!! ✨

We're assuming you were able to navigate through the intro-to-pipelines course and that you learned a few things about organizing your code for readability, re-use, and collaboration. You were also introduced to two key things through the remake.yml: a way to program connections between functions and files, and the concept of a dependency manager that skips parts of the workflow that don't need to be re-run.


Recap of pipelines I

First, a recap of key concepts that came from intro-to-pipelines 👇

  • Data science work should be organized thoughtfully. As Jenny Bryan notes, "File organization and naming are powerful weapons against chaos".
  • Capture all of the critical phases of project work with descriptive directories and function names, including how you "got" the data (in practice, we often use fetch for this phase).
  • Turn your scripts into a collection of functions, and modify your thinking to connect deliberate outputs from these functions ("targets") to generate your final product.
  • "Skip the work you don't need" by taking advantage of a dependency manager. There were some videos that covered a bit of make and drake, and you were asked to experiment with scipiper.
  • Investing in efficient reproducibility helps projects scale up with confidence.

This last concept was not addressed directly but we hope that the small exercise of seeing rebuilds in action got you thinking about projects that might have much more lengthly steps (e.g., several downloads or geo-processing tasks that take hours instead of seconds).

What's ahead in pipelines II

In this training, the focus will be on tricks and tips for making better, smarter pipelines. You'll learn new things here that will help you refine your knowledge from the first class and put it into practice. Let's get started!

⌨️ Activity: Add collaborators and close this issue to get started.

As with pipelines I, please invite a few collaborators to your repository so they can easily comment and review in the future. In the ⚙️ Settings widget at the top of your repo, select "Manage access" (or use this shortcut link). Go ahead and invite aappling-usgs and jread-usgs. It should look something like this:
add some friends

💡 Tip: Throughout this course, I, the Learning Lab Bot, will reply and direct you to the next step each time you complete an activity. But sometimes I'm too fast when I ⏳ give you a reply, and occasionally you'll need to refresh the current GitHub page to see it. Please be patient, and let my humans know (jread-usgs or aappling-usgs) if I seem to have become completely stuck.


I'll sit patiently until you've closed the issue.

How to get past the gotchas without getting gotten again

In this section, we're going to go one by one through a series of tips that will help you avoid common pitfalls (or gotchas!) in pipelines. These tips will help you in the next sections and in future work. A quick list of what's to come:

  • 🔍 How to debug in a pipeline
  • 👀 Visualizing and understanding the status of dependencies in a pipeline
  • 💬 which_dirty() and why_dirty() to further interrogate the status of pipeline targets
  • 🔃 What is a cyclical dependency and how do I avoid it?
  • ⚠️ Undocumented file output from a function
  • 📂 Using a directory as a dependency
  • 📋 Can I really only use filenames or object targets as arguments in pipeline functions? Understanding the I() helper
  • 🔓 The target_name special variable. Simplifying target↔️command relationships and reducing duplication

⌨️ add a comment to this issue and the bot will respond with the next topic


I'll sit patiently until you comment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.