So far we've implemented split and apply operations; now it's time t

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Combiners,about wdwatkins/ds-pipelines-3

Comments (15)

github-learning-lab commented on June 21, 2024

⌨️ Activity: Switch to a new branch

Before you edit any code, create a local branch called "combiners" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout master
git pull origin master
git checkout -b combiners
git push -u origin combiners

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to master and sync with "origin" whenever you're transitioning between branches and/or PRs.

Comment on this issue once you've created and pushed the "combiners" branch.

from ds-pipelines-3.

wdwatkins commented on June 21, 2024

from ds-pipelines-3.

github-learning-lab commented on June 21, 2024

⌨️ Activity: Add a data combiner

Write `combine_obs_tallies()`

Add a new function called combine_obs_tallies somewhere in 123_state_tasks.R. The function declaration should be function(...); when the function is actually called, you can anticipate that the arguments will be a bunch of tallies tibbles (tidyverse data frames). Your function should return the concatenation of these tibbles into one very tall tibble.

Test your combine_obs_tallies() function. Run

source('123_state_tasks.R') # load `combine_obs_tallies()`
WI_tally <- remake::fetch('WI_tally', remake_file='123_state_tasks.yml')
MN_tally <- remake::fetch('MN_tally', remake_file='123_state_tasks.yml')
IA_tally <- remake::fetch('IA_tally', remake_file='123_state_tasks.yml')
combine_obs_tallies(WI_tally, MN_tally, IA_tally)

The result should be a tibble with four columns and as many rows as the sum of the number of rows in WI_tally, MN_tally, and IA_tally. If you don't have it right yet, keep fiddling and/or ask for help.

Prepare the task plan and task makefile to use `combine_obs_tallies()`

Set the final_steps argument of your call to create_task_plan() to 'tally' (which should be the step_name name of your tally step) - this tells scipiper to pass [only] the results of the "tally" task-steps into your combiner.
Set as_promises=FALSE and tickquote_combinee_objects=TRUE in your call to create_task_makefile() within do_state_tasks().
Add/edit the values of the final_targets and finalize_funs arguments in the create_task_makefile() call to specify that you want one combiner target that runs the function combine_obs_tallies() and produces an R object target named obs_tallies.

Connect the pipeline

Add '123_state_tasks.R' as yet another unnamed argument in the recipe for state_tasks in remake.yml. This exercise should be familiar; you need to specify your code sources here (and propagate them through to create_task_makefile() using the ... argument to do_state_tasks()) so that both 123_state_tasks.yml and remake.yml can see them.
Edit the # Build the tasks code chunk within do_state_tasks so that the target that gets built is obs_tallies and the output is assigned to a local variable also named obs_tallies.
Return obs_tallies as the output of do_state_tasks(). Change the # Return nothing... comment to match what you're now doing.

Test

Run state_tasks <- scmake('state_tasks'), then answer these questions:

Inspect the console output. Which task steps (download, tally, and/or plot) are no longer getting built or checked? Inspect 123_state_tasks.yml to see if you can figure out why.
Inspect the value of state_tasks. Is it what you expected?

When you're feeling confident, add a comment to this issue with your answers to the two questions above.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on June 21, 2024

the download and tally steps are checked, plot is not. The plot steps are not tied into the final combiner target, which is what we just built:

State tasks contains the summarized tally of observations for each state. It might make more sense for it to contain some representation that all the tasks were completed (including the plots), not just the intermediate tally of observations.

from ds-pipelines-3.

github-learning-lab commented on June 21, 2024

Check your progress

Some answers to compare to your own:

1. Inspect the console output. Which task steps (download, tally, and/or plot) are no longer getting built or checked? Inspect 123_state_tasks.yml to see if you can figure out why.

The plot task-steps are no longer getting built or checked. They're still there in 123_state_tasks.yml, but we're now only building the obs_tallies target, which depends on the download and tally steps but not on the plot steps. Also, the plots are no longer listed even as dependencies of the default target (123_state_tasks). The download and tally steps also got removed from the default target dependencies, but the default target does depend on the tallies combiner, which depends on each of the tally steps, which in turn depend on the download steps, so that's why the tally and download steps still get considered.

2. Inspect the value of state_tasks. Is it what you expected?

Here's what my state_tasks looks like. Your number of rows might vary slightly if you build this at a time when the available data have changed substantially, but the column structure and approximate number of rows ought to be about the same. If it looks like this, then it meets my expectations and hopefully also yours.

> state_tasks
# A tibble: 738 x 4
# Groups:   Site, State [6]
   Site     State  Year NumObs
   <chr>    <chr> <dbl>  <int>
 1 04073500 WI     1898    365
 2 04073500 WI     1899    365
 3 04073500 WI     1900    365
 4 04073500 WI     1901    365
 5 04073500 WI     1902    365
 6 04073500 WI     1903    365
 7 04073500 WI     1904    366
 8 04073500 WI     1905    365
 9 04073500 WI     1906    365
10 04073500 WI     1907    365
# … with 728 more rows

⌨️ Activity: Explore `as_promises`

We stuck with the name state_tasks in the main pipeline, but this target would now be more aptly named obs_tallies.

Try changing the target name from state_tasks to obs_tallies in remake.yml (do a whole-word find-replace to change it everywhere it occurs in that file).
Run scmake() again. What happens? Identify the line in 123_state_tasks.yml that defines a target of the same name.

Hmm. It would be nice if we could use the same name to refer to the same information (a table of observation tallies) in both remake.yml and the task table, but it appears that scipiper won't let us. This is where the as_promises argument to create_task_makefile() comes in.

Change as_promises from FALSE to TRUE.
Leave the final_targets argument alone (set to obs_tallies).
Change obs_tallies <- scmake('obs_tallies', remake_file='123_state_tasks.yml') to obs_tallies <- scmake('obs_tallies_promise', remake_file='123_state_tasks.yml') (a few lines down from the call to create_task_makefile()).
Rebuild obs_tallies from the main remake.yml. Now scipiper lets you do it, right? Check that line you identified in 123_state_tasks.yml to see what changed.

This as_promises=TRUE technique is a pattern we've adopted to accommodate the fact that scipiper doesn't allow duplicate target names, but we kinda want them to keep our code clear. It's not perfect, but it does the trick.

Comment on this issue when you're ready to proceed.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on June 21, 2024

from ds-pipelines-3.

github-learning-lab commented on June 21, 2024

⌨️ Activity: Use the combiner target downstream

It's time to reap the rewards from your first combiner.

Create a new target in remake.yml that takes advantage of your new combined tallies. Use the plot_data_coverage() function already defined for you (find it by searching or browing the repository - remember Ctrl-.), and pass in state_tasks as the oldest_site_tallies argument. Set up your target to create a file named "3_visualize/out/data_coverage.png". Remember to add the source file to the sources list in remake.yml, and set up your pipeline to build this new target as part of the default build.
Test your new target by running scmake(), then checking out 3_visualize/out/data_coverage.png.
Test your new pipeline by removing a state from states and running scmake() once more. Did 3_visualize/out/data_coverage.png get revised? If not, see if you can figure out how to make it so. Ask for help if you need it.

When you've got it, share the image in 3_visualize/out/data_coverage.png as a comment.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on June 21, 2024

from ds-pipelines-3.

github-learning-lab commented on June 21, 2024

Great, you have a combiner hooked up from start to finish, and you probably learned some things along the way! It's time to add a second combiner that serves a different purpose - here, rather than produce a target that contains the data of interest, we'll produce a combiner target that summarizes the outputs of interest (in this case the state-specific .png files we've already created).

⌨️ Activity: Add a summary combiner

Don't write another combiner

Last time, you wrote your own combiner. This time you just need to check out combine_to_ind(), a function provided by scipiper.

Check out the documentation at ?combine_to_ind.
Test it out with a command such as
```
combine_to_ind('test.yml', '3_visualize/out/timeseries_IA.png', '3_visualize/out/timeseries_MN.png')
```
Check out the contents of test.yml. Then when you're feeling clear on what happened, delete test.yml.

Prepare the task plan and task makefile to use `combine_to_ind()`

Add/edit the values of the final_targets and finalize_funs arguments in the create_task_makefile() call to specify that you want a second combiner target that runs the function combine_to_ind() and produces a file target named 3_visualize/out/timeseries_plots.yml. Keep the tallies combiner in place.
Add another line just below obs_tallies <- scmake('obs_tallies_promise', remake_file='123_state_tasks.yml') to build this second combiner. The new line should be:
```
scmake('timeseries_plots.yml_promise', remake_file='123_state_tasks.yml')
```
Note how the target name for this combiner differs from the target you provided in final_targets: it's the filename without the directories, and there's _promise at the end. This is the work of as_promises=TRUE again, this time as applied to a file target.
Run scmake(). It breaks. Check out the combiner targets at the end of 123_state_tasks.yml to see if you can figure out why before you read the instructions in the next paragraph.

Test and revise `final_steps`

Hmm, you probably just discovered that 123_state_tasks.yml is trying to apply combine_to_ind() to your tally step instead of your plot step:

  timeseries_plots.yml_promise:
    command: combine_to_ind(I('3_visualize/out/timeseries_plots.yml'),
      `WI_tally`,
      `MN_tally`,
      `MI_tally`,
      `IL_tally`,
      `IN_tally`,
      `IA_tally`)

In hindsight, that probably makes sense, but it makes the next step a bit tricky. You've already set final_steps='tally' in create_task_plan(), and that's still useful for the tally combiner. But in order to pass the plot files into combine_to_ind(), which is what we need for this new combiner, we'd really like final_steps='plot'.

Set the final_steps argument of your call to create_task_plan() to c('tally', 'plot'), call scmake() again, and check out 123_state_tasks.yml once more. How did the combiner functions change?

Hmm, that's an improvement because now both combiners are getting the arguments they need, but it's also a step backward brecause now neither combiner is getting only the arguments it needs - they're each getting both the tally and the plot outputs.

Revise the combiners

The solution for this multi-combiner pipeline is to filter the arguments in each combiner. For this particular pipeline, we can distinguish between the two final steps based on their type: the tally outputs are tibble types, and the plot outputs get passed to the combiner as character filenames.

For combine_obs_tallies(), add these two lines to the beginning of the function:
```
# filter to just those arguments that are tibbles (because the only step
# outputs that are tibbles are the tallies)
dots <- list(...)
tally_dots <- dots[purrr::map_lgl(dots, is_tibble)]
```
and then proceed with whatever code you were using to combine the tibbles, this time using tally_dots rather than .... Depending on the function you used for the combining, you may need to revise that code slightly to take a single argument that's a list of tibbles, rather than a sequence of individual tibble arguments.

For combine_to_ind(), it turns out you will need to write your own custom function after all so that you can add in this filtering. Try adding this function to 123_state_tasks.R:

summarize_timeseries_plots <- function(ind_file, ...) {
  # filter to just those arguments that are character strings (because the only
  # step outputs that are characters are the plot filenames)
  dots <- list(...)
  plot_dots <- dots[purrr::map_lgl(dots, is.character)]
  do.call(combine_to_ind, c(list(ind_file), plot_dots))
}

Then replace 'combine_to_ind' with 'summarize_timeseries_plots' in the finalize_funs argument to create_task_makefile().

Run scmake() again and then check the contents of 3_visualize/out/data_coverage.png and 3_visualize/out/timeseries_plots.yml to make sure you've succeeded in hooking up both combiners.

When you're feeling confident, add a comment to this issue with the contents of 3_visualize/out/data_coverage.png and 3_visualize/out/timeseries_plots.yml.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on June 21, 2024

3_visualize/out/timeseries_WI.png: cacd873105e2bb4a951a8ab2277c920d
3_visualize/out/timeseries_MN.png: a1347b9e25a16278c7eb5aa4019d1d0a
3_visualize/out/timeseries_MI.png: 94a4faf9ce43aac5298270f0dd997649
3_visualize/out/timeseries_IL.png: 39aa0cfb865494b22764cae370592d45
3_visualize/out/timeseries_IA.png: 05348827dd54c4722c9e01ae9a9adba1`

from ds-pipelines-3.

github-learning-lab commented on June 21, 2024

You're getting close! The last step for this second combiner is to connect it to the main pipeline. But this isn't trivial, because right now your code in do_state_tasks creates the obs_tallies target in the main pipeline, and we'd like to keep that obs_tallies information. How do we get the results of both combiners into the main pipeline all at once?

One function, two outputs?

To connect both combiners to the main pipeline - and more broadly to follow pipelining best practices, ensuring that our pipeline's reproducibility is robust to modification - we need do_state_tasks() to create a single target that represents all the effects of the task table that we want to be visible to the pipeline.

Let's take a moment to decide which effects of the task table we want to be visible. For this we need to check our project plans, because what we want does differ by project...ahh, here they are: In this course project we won't ever need to revisit the state-specific data tables again, so we don't need to carry those WI_data, WI_tally, etc. objects back to the main pipeline. The obs_tallies argument will be sufficient to store the state tallies, and the timeseries_plots.yml file is sufficient to represent the status of the plot .png files.

Great! So we only have two outputs that need to be represented by state_tasks: the big tallies table and the plot summary file. Unfortunately, two outputs is still one too many. How can we tell the main pipeline about these two objects using just one output?

This challenge should be ringing bells for you, because we've actually solved it twice already.

The first time was with the inventory splitter, where we split the inventory but also created a summary file of the split-up inventory files.
The second time was with the plot file combiner. Our apply operation had created one plot per state, but that's not easy to use downstream, so we then summarized those functions into 3_visualize/out/timeseries_plots.yml.
In both cases, we had one function and many outputs...and we saved the day by creating a single summary output. So let's do that once more!

There are actually a few ways to implement this general strategy. So far we've created summary files, but in this case, the output of do_state_tasks() could be...

A faithful representation of the combiner targets as they were produced by 123_state_tasks.yml: A list that contains (1) the contents of the tallies table and (2) a filename and hash describing the plot summary file (yes, that's a summary of a summary file).
A concise representation of the combiner targets: A list that contains a filename and hash for a tallies table file (in this case we'd write out that table to file) and for the plot summary file.
A ready-to-go translation of the combiner targets into R objects: A list that contains (1) the contents of the tallies table and (2) the contents of the plot summary file (in this case we'd read in the plot summary file as an r yaml object).
A file that could be shared with others: A file, perhaps in RDS format, that contains any of the above three options.

⌨️ Activity: Make a multi-output target

For this course, let's go with option 3 from the list above.

Add a new expression in do_state_tasks() right after

scmake('timeseries_plots.yml_promise', remake_file='123_state_tasks.yml')`

to read timeseries_plots.yml into a tibble format:

timeseries_plots_info <- yaml::yaml.load_file('3_visualize/out/timeseries_plots.yml') %>%
  tibble::enframe(name = 'filename', value = 'hash') %>%
  mutate(hash = purrr::map_chr(hash, `[[`, 1))

Change the return value of do_state_tasks() to be a list of both the tallies table and the plot summary tibble:

# Return the combiner targets to the parent remake file
return(list(obs_tallies=obs_tallies, timeseries_plots_info=timeseries_plots_info))

In remake.yml, change the target name for the result of do_state_tasks() from obs_tallies to state_combiners.
Add these two unpacker targets right after the state_combiners target (pluck() is from purrr, which is loaded when you install the already-declared tidyverse package):
```
obs_tallies:
  command: pluck(state_combiners, target_name)
timeseries_plots_info:
  command: pluck(state_combiners, target_name)
```

Test

Run obs_tallies <- scmake('obs_tallies') and check the value of obs_tallies. Look good?
Run timeseries_plots_info <- scmake('timeseries_plots_info') and check the value of timeseries_plots_info. Look good?

Add any comments, questions, or revelations to a comment on this issue.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on June 21, 2024

I wonder if the combiner target filtering could be more built-in? Perhaps some using a named vector in final_steps to allow referencing only targets for a particular step in final_target?

from ds-pipelines-3.

github-learning-lab commented on June 21, 2024

You're down to the last task for this issue! I hope you'll find this one rewarding. After all your hard work, you're now in a position to create a leaflet map that will give you interactive access to the locations, identities, and timeseries plots of the Upper Midwest's oldest gages, all in one .html map. Ready?

Use the plots downstream

Add another target to remake.yml that uses the function map_timeseries() (defined for you in 3_visualize). site_info should be the inventory of oldest sites, plot_info should be timeseries_plots_info, and the output should be written to 3_visualize/out/timeseries_map.html.
Add the three packages that map_timeseries() requires to the declaration at the top of remake.yml: leaflet, leafpop, and htmlwidgets.
Edit remake.yml as need to ensure that 3_visualize/out/timeseries_map.html will get built on a call to scmake() without arguments.
(You should already have 3_visualize/out/data_coverage.png set up for this. Also, by declaring both 3_visualize/out/timeseries_map.html and 3_visualize/out/data_coverage.png as elements of the default target, you will have ensured that obs_tallies and timeseries_plots_info will get built, so you don't need to declare those directly..)

Test

Run scmake(). Any surprises?
Check out the results of your new map by opening 3_visualize/out/timeseries_map.html in the browser. You should be able to hover and click on each marker.
Add or subtract a state from the states vector and rerun scmake(). Did you see the rebuilds and non-rebuilds that you expected? Did the html file change as expected?

Make a pull request

It's finally time to submit your work.

Commit your code changes for this issue and make sure you're .gitignoreing the new analysis products (the .png and .html files). Push your changes to the GitHub repo.
Create a PR to merge the "combiners" branch into "master". Share a screenshot of 3_visualize/out/timeseries_map.html and any thoughts you want to share in the PR description.

I'll respond when I see your PR.

from ds-pipelines-3.

jordansread commented on June 21, 2024

I wonder if the combiner target filtering could be more built-in? Perhaps some using a named vector in final_steps to allow referencing only targets for a particular step in final_target?

Agreed. This pattern does make you want more customization in the combiners for sure.

from ds-pipelines-3.

aappling-usgs commented on June 21, 2024

I wonder if the combiner target filtering could be more built-in? Perhaps some using a named vector in final_steps to allow referencing only targets for a particular step in final_target?

Agreed. This pattern does make you want more customization in the combiners for sure.

Yep. That was a pain point as I was working on this course. Also noted in this issue: DOI-USGS/scipiper#113

from ds-pipelines-3.

Comments (15)

⌨️ Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "combiners" branch.

⌨️ Activity: Add a data combiner

Write combine_obs_tallies()

Prepare the task plan and task makefile to use combine_obs_tallies()

Connect the pipeline

Test

I'll respond when I see your comment.

Check your progress

⌨️ Activity: Explore as_promises

I'll respond when I see your comment.

⌨️ Activity: Use the combiner target downstream

I'll respond when I see your comment.

⌨️ Activity: Add a summary combiner

Don't write another combiner

Prepare the task plan and task makefile to use combine_to_ind()

Test and revise final_steps

Revise the combiners

I'll respond when I see your comment.

One function, two outputs?

⌨️ Activity: Make a multi-output target

Test

I'll respond when I see your comment.

Use the plots downstream

Test

Make a pull request

I'll respond when I see your PR.

Related Issues (8)

Recommend Projects

Recommend Topics

Recommend Org

Write `combine_obs_tallies()`

Prepare the task plan and task makefile to use `combine_obs_tallies()`

⌨️ Activity: Explore `as_promises`

Prepare the task plan and task makefile to use `combine_to_ind()`

Test and revise `final_steps`