Git Product home page Git Product logo

ds-pipelines-3's People

Contributors

aappling-usgs avatar github-learning-lab[bot] avatar wdwatkins avatar

Watchers

 avatar  avatar  avatar

ds-pipelines-3's Issues

Recognize the unique demands of data-rich pipelines

In this course you'll learn about tools for data-intensive pipelines: how to download many datasets, run many models, or make many plots. Whereas a for loop would work for many quick tasks, here our focus is on tools for managing sets of larger tasks that each take a long time and/or are subject to occasional failure.

A recurring theme in this activity will be the split-apply-combine paradigm, which has many implementations in many software languages because it is so darn useful for data analyses. It works like this:

  1. Split a large dataset or list of tasks into logical chunks, e.g., one data chunk per lake in an analysis of many lakes.
  2. Apply an analysis to each chunk, e.g., fit a model to the data for each lake.
  3. Combine the results into a single orderly bundle, e.g., a table of fitted model coefficients for all the lakes.

There can be variations on this basic paradigm, especially for larger projects:

  1. The choice of how to Split an analysis can vary - for example, it might be fastest to download data for chunks of 100 sites rather than downloading all 10000 sites at once or downloading each site independently.
  2. Sometimes we have several Apply steps - for example, for each site you might want to munge the data, fit a model, extract the model parameters, and make a diagnostic plot specific to that site.
  3. The Combine step isn't always necessary to the analysis - for example, we may prefer to publish a collection of plot .png files, one per site, rather than combining all the site plots into a single unweildy report file. That said, we may still find it useful to also create a table summarizing which plots were created successfully and which were not.

Assign yourself to this issue to explore the split-apply-combine paradigm further.


I, the Learning Lab Bot, will sit patiently until you've assigned yourself to this issue.

Please remember to be patient with me, too - sometimes I need a few seconds and/or a refresh before you'll see my response.

Scale up

Your pipeline is looking great, @wdwatkins! It's time to put it through its paces and experience the benefits of a well-plumbed pipeline. The larger your pipeline becomes, the more useful are the tools you've learned in this course.

In this issue you will:

  • Expand the pipeline to include all of the U.S. states and some territories
  • Learn one more scipiper tool, the loop_tasks() function
  • Modify the pipeline to describe temperature sites instead of discharge sites

⌨️ Activity: Check for scipiper udpates

Before you get started, make sure you have the most up-to-date version of scipiper:

packageVersion('scipiper')
## [1] ‘0.0.20’

You should have package version >= 0.0.20. If you don't, reinstall with:

remotes::install_github('USGS-R/scipiper')

Appliers

Your pipeline is looking pretty good! Now it's time to add complexity. I've just added these two files to the repository:

  • 2_process/src/tally_site_obs.R
  • 3_visualize/src/plot_site_data.R

In this issue you'll add these functions to the task table in the form of two new steps.

Background

The goal of this issue is to expose you to multi-step task tables. They're not hugely different from the single-step task table we've already implemented, but there are a few details that you haven't seen yet. The changes you'll make for this issue will also set us up to touch on some miscellaneous pipeline topics. Briefly, we'll cover:

  • The syntax for adding two steps to a task table
  • How to declare dependencies among steps within a task table
  • A quick look at / review of split-apply-combine with the lightweight dplyr syntax
  • Uses and abuses of the force = TRUE argument to scmake() in the task-table context

Digging deeper into create_task_step()

The create_task_step() function offers a lot of flexibility if you're willing to figure out the syntax. Specifically, you might have noticed the ... in the arguments list of each function you pass to create_task_step(). You don't get to decide what's available in the ...s, but you do get do decide whether you use that information.

So far, the only named argument you've likely used is task_name, but other named arguments are passed in to the target_name and command functions when the task step is turned into a task plan with create_task_plan(). These other arguments allow you to look to other information about this task-step or previous steps within the task.

  • If you're writing a target_name, you have access to the evaluated step_name for this task-step. Therefore, your function definition can be function(...) or function(step_name, ...) depending on whether you plan to use the step_name information or not. It'll get passed in regardless of whether you include that argument in the definition.
  • If you're writing a depends function, you have access to the evaluated target_name and step_name. Therefore, your function definition can be either of these: function(target_name, ...), function(target_name, step_name, ...)
  • If you're writing a command function, you have access to the evaluated target_name, step_name, and depends. Therefore, your function definition can be any of these: function(target_name, ...), function(target_name, step_name, ...), or function(target_name, step_name, depends, ...)
  • You can also get the evaluated values of step_name, target_name, depends, or command for any of the preceding steps for this task; all of these are available elements of the steps argument, which is a list named steps. Therefore, your function definition can be any of these: function(steps, ...), function(target_name, steps, ...), function(target_name, step_name, steps, ...), or function(target_name, step_name, depends, steps, ...). This is most useful if you're defining the 2nd, 3rd, etc. step, because then there's information to look back at. For example, if step 2 uses the target from step 1 as input, and if step 1 is named "prepare", then your definition for step 2 might look something like this:
step2 <- create_task_step(
  step_name = 'plot',
  target_name = function(target_name, ...) {
    sprintf("out/%s.png", target_name)
  },
  command = function(steps, ...) {
    sprintf('visualize(\'%s\', target_name)', steps[['prepare']]$target_name)
  }
)

You can always reference this information in the task steps vignette when you're working on real-world pipelines.

Meet the example problem

It's time to meet the data analysis challenge for this course! Over the next series of issues, you'll connect with the USGS National Water Information System (NWIS) web service to learn about some of the longest-running monitoring stations in USGS streamgaging history.

The repository is already set up with a basic scipiper data pipeline that:

  • Queries NWIS to find the oldest discharge gage in each of three Upper Midwest states
  • Maps the state-winner gages

Combiners

So far we've implemented split and apply operations; now it's time to explore combine operations in scipiper pipelines.

In this issue you'll add two combiners to serve different purposes - the first will combine all of the annual observation tallies into one giant table, and the second will summarize the set of state-specific timeseries plots generated by the task table. We'll use the second combiner in a target within the main remake.yml, downstream of the whole task table, to illustrate how task tables can fit into the flow of a longer pipeline.

Background

Approach

Combiners in scipiper are functions that accept a set of task-step targets as arguments and produce a single file or R object as output. We define their corresponding targets within the task remakefile (as opposed to within the main remake.yml) for several reasons:

  1. Only the task remakefile actually knows the identities and locations of the task steps. This information is unavailable to the main pipeline, which you can confirm by running remake::diagram(remake_file='remake.yml') and searching in vain for targets such as WI_tally:

remake::diagram

  1. For the same reasons that purrr and dplyr allow you to implement split, apply, and combine operations all in a single expression, it's conceptually tidier to think of those three operations as a bundle within scipiper pipelines, and therefore to code them all in the same place.

  2. The number of inputs to a combiner should change if the number of tasks changes. If we hand-coded a combiner target with a command function that accepted a set of inputs (e.g., command: combine_tallies(WI_tally, MI_tally, [etc])), we'd need to manually edit the inputs to that function anytime we changed the states vector. That would be a pain and would make our pipeline susceptible to human error if we forgot or made a mistake in that editing. It's safer if we can automatically generate that list of inputs along with the rest of the task remakefile.

Implementation

The scipiper way to use combiners is to work with the finalize_funs, final_targets, as_promises, and tickquote_combinee_objects arguments to create_task_makefile(). Once you've set up these arguments properly, create_task_makefile() will write combiner targets into your task remakefile for you, autopopulating the arguments to match the final step of each task.

  • The finalize_funs argument is a vector of one or more function names, each one of which will be called as the command to create a combiner target. Note that you don't get to specify anything else about your combiners here except their names. You can write your own combiner function or you can use the built-in combine_to_ind() combiner for a common type of combining (you'll see when we try it out). If you write your own combiner function, it should be defined within one of the sources or packages specified in create_task_makefile().

  • You don't get choices about the list of arguments that a combiner function will accept: If the output of your combiner will be a file, your combiner function must accept arguments that are a summary filename, the output from task 1, the output from task 2, and so on, in that order, and your combiner must write the output to the filename given by the summary filename argument. If the output of your combiner will be an R object, your combiner function should skip straight to the outputs from the tasks (and not accept an initial filename argument) and should return an R object. The declaration of the combiner function should therefore either be function(out_file, ...) or function(...) (where the name of the outfile argument is up to you).

  • The final_targets argument is a vector of target names for your one or more combiner targets. This argument's length needs to exactly match that of finalize_funs and will be mapped 1:1 to those target functions. If a finalize_fun (combiner) will return a file, the corresponding final_targets value should be a filename. It the combiner will return an R object, the final_targets value should be a valid object target name.

  • The utility of as_promises=TRUE will become clearer with an illustration shortly. For now you can use as_promises=FALSE.

  • We've been working with tickquote_combinee_objects=FALSE largely because you don't have any combiners. Once you add them, you should pretty much always use tickquote_combinee_objects=TRUE. This argument helps create_task_makefile format the remakefile correctly across a range of possible apply and combine operations.

What's next

Congratulations, @wdwatkins, you've finished the course! ✨

Now that you're on the USGS Data Science team, you'll likely be using data pipelines like the one you explored here in your visualization and analysis projects. More questions will surely come up - when they do, you have some options:

  1. Check out the scipiper vignettes at vignette(package='scipiper')
  2. Visit the remake and drake documentation pages
  3. Ask questions and share ideas in our Function - Data Pipelines Teams channel

You rock, @wdwatkins! Time for a well-deserved break.

Splitters

In the last issue you noted a lingering inefficiency: When you added Illinois to the states vector, your task table pipeline built WI_data, MN_data, and MI_data again even though there was no need to download those files again. This happened because those three targets each depend on oldest_active_sites, the inventory object, and that object changed to include information about a gage in Illinois. As I noted in that issue, it would be ideal if each task target only depended on exactly the values that determine whether the data need to be downloaded again. But we need a new tool to get there: a splitter.

The splitter we're going to create in this issue will split oldest_active_sites into a separate table for each state. In this case each table will be just one row long, but there are plenty of situations where the inputs to a set of tasks will be larger, even after splitting into task-size chunks. Some splitters will be quick to run and others will take a while, but either way, we'll be saving ourselves time in the overall pipeline!

Background

The slow way to split

In general, scipiper best practices require that each command creates exactly one output, either an R object or a file. To follow this policy, we could write a function that would take the full inventory and one state name and return a one-row table.

get_state_inventory <- function(sites_info, state) {
  site_info <- dplyr::filter(sites_info, state_cd == state)
}

And then we could insert an initial task table step where we pulled out that state's information before passing it to the next step, such that each state's resulting targets in 123_state_tasks.yml would follow this pattern:

  WI_inventory:
    command: get_state_inventory(sites_info=oldest_active_sites, I('WI'))
  WI_data:
    command: get_site_data(sites_info=WI_inventory, state=I('WI'), parameter=parameter)

...but ugh, what a lot of text to add for such a simple operation! Also, suppose that sites_info was a file that took a long time to read in - we've encountered cases like this for large spatial data files, for example - you'd have to re-open the file for each and every call to get_state_inventory(), which would be excruciatingly slow for a many-state pipeline.

Fortunately, there's a better way.

The fast way to split

Instead of calling get_state_inventory() once for each state, we'll go ahead and write a single splitter function that accepts oldest_active_sites and creates a single-row table for each state. It will be faster both to type (fewer targets) and to run (no redundant reloading of the data to split). The key is that instead of just creating as many tables as there are states, we'll also create a single summary file that concisely describes each of those outputs all in one place. In this way we only sort of break the "one-function, one-output" rule - we are creating many outputs, true, but one of them is an output that we can use as a scipiper target because it reflects the overall behavior of the function.

Hashes

Ideally, we should design our summary file so that it changes anytime any of the split-up outputs change. We'll get more into the rationale in a later issue (very briefly, this approach makes it possible to use the summary file as an effective dependency of other pipeline targets). For now, I'm asking you to accept on faith that summary files should reflect the split-up outputs in a meaningful way.

To meet this need, our favorite format for summary files is a YAML of key-value pairs, where each key is a filename and each value is a hash of that file. A hash is a shortish character string (usually 16-32 characters) that describes the contents of a larger file or object. There are a number of algorithms for creating hashes (e.g., MD5, SHA-2, and SHA-3), but the cool feature that unites them all is that hashes are intended to be unique - if the contents of two files are even just slightly different, their hashes will always (OK, almost always) be different.

Explore the hash idea yourself using the digest::digest() function on R objects or the tools::md5sum() function on files. For example, you can hash your remake.yml file and get something like this:

> tools::md5sum('remake.yml')
                        remake.yml 
"96b30271e93f4729ba4cbc6271c694df" 

and then if you change it just a tiny bit - add a newline, for example - you are more or less guaranteed to get a different hash:

> tools::md5sum('remake.yml')
                        remake.yml 
"9268e621b96ec9d35e9d9c0c2d58c19d"

Hash algorithms come from cryptography, which makes them extra cool 🕵️

Your mission

In this issue you'll create a splitter to make your task table more efficient in the face of a changing inventory in oldest_active_sites. Your splitter function will generate a separate one-row inventory file for each state, plus a YAML file that summarizes all of the single-state inventories with a filenames and hashes roughly like this:

1_fetch/tmp/inventory_IL.tsv: 4c59d14d16b3af05dff6e6d6dfc8aac9
1_fetch/tmp/inventory_MI.tsv: fed321c051ee99e2c7b163c5c4c10320
1_fetch/tmp/inventory_MN.tsv: d2bff76a0631abf055421b86d033d80c
1_fetch/tmp/inventory_WI.tsv: b6167db818f91d792ec639b6ec4faf68

Ready?

Task tables

In the last issue you noted some inefficiencies with writing out many nearly-identical targets within a remake.yml:

  1. It's a pain (more typing and potentially a very long remake.yml file) to add new sites.
  2. You have to run and re-run scmake() if any target breaks along the way.

In this issue we'll fix those inefficiencies by adopting the task table approach supported by scipiper.

Definitions

A task table is a concept. The idea is that you can think of a split-apply-combine operation in terms of a table: each row of the table is a split (a task) and each column is an apply activity (a step). In the example analysis for this course, each row is a state and the first column contains calls to get_site_data() for that state's oldest monitoring site. Later we'll create additional steps for tallying and plotting observations for each state's site. A task step is a cell within this conceptual table.

Task table

A task table is "conceptual" because it doesn't exist as a table in R. We implement it in two ways: as a task plan, which is a nested R list defining all the tasks and steps, and as a task remakefile, which is an automatically generated scipiper YAML file much like the remake.yml you're already working with.

In this issue you'll create a task plan and task remakefile for this analysis of USGS's oldest gages.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.