Your pipeline is looking pretty good! Now it's time to add complexity. I've just added

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

group_by effectively splits, then <code class="notran

Check your progress Run scmake()</co

Appliers,about wdwatkins/ds-pipelines-3

Comments (9)

github-learning-lab commented on July 20, 2024

⌨️ Activity: Switch to a new branch

Before you edit any code, create a local branch called "appliers" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout master
git pull origin master
git checkout -b appliers
git push -u origin appliers

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to master and sync with "origin" whenever you're transitioning between branches and/or PRs.

Comment on this issue once you've created and pushed the "appliers" branch.

from ds-pipelines-3.

wdwatkins commented on July 20, 2024

from ds-pipelines-3.

github-learning-lab commented on July 20, 2024

⌨️ Activity: Add two new appliers

Code

In 123_state_tasks.R:

Add a new step right after download_step. This step object should be called plot_step, should have step name 'plot', should create targets called 3_visualize/out/timeseries_WI.png, 3_visualize/out/timeseries_MN.png, etc., should call the plot_site_data() function (defined in 3_visualize/src/plot_site_data.R), and should make use of the targets created in download_step.
(Hint: It's fine to link backward to the downloading targets using sprintf() or another string manipulation function, but if you want to get really fancy, try out the steps argument to your command function.)
Add a third step called tally_step. This step should have step name 'tally', should create R object targets called WI_tally, MI_tally, etc., should call the tally_site_obs() function (also already defined for you), and should make use of the targets created in download_step (no need to link to the plot_step targets).
Add plot_step and tally_step to the call to create_task_plan().
Add the two new function files (where plot_site_data() and tally_site_obs() are defined) to the sources argument in your create_task_makefile() call.
Add the lubridate package to the packages argument in your create_task_makefile() call (it's used in tally_site_obs()).

Test

Run scmake('state_tasks'). Is it building a timeseries plot and a tally object for each state? If not, keep fiddling with your code until you get it to work.
Check the contents of the 3_visualize/out directory and inspect at least one of the plots. How do they look?
Assign the value of IN_tally to a variable of the same name in your global environment. You can use the scipiper::scmake() function or the remake::fetch() function. Either function will require a bit of special syntax - review ?scmake or ?remake::fetch for clues and ask if you get stuck.

When you're feeling confident, add a comment to this issue with:

an image from one of the new plots in 3_visualize/out, and
a printout of the first 10 lines of IN_tally.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on July 20, 2024

> IN_tally
# A tibble: 110 x 4
# Groups:   Site, State [1]
   Site     State  Year NumObs
   <chr>    <chr> <dbl>  <int>
 1 03373500 IN     1903     92
 2 03373500 IN     1904    366
 3 03373500 IN     1905    365
 4 03373500 IN     1906     90
 5 03373500 IN     1909    275
 6 03373500 IN     1910    365
 7 03373500 IN     1911    365
 8 03373500 IN     1912    366
 9 03373500 IN     1913    365
10 03373500 IN     1914    365
# … with 100 more rows

from ds-pipelines-3.

github-learning-lab commented on July 20, 2024

Check your progress

To help you assess your pipeline, here's what I would have put in that comment:

* an image from one of the new plots in 3_visualize/out, and

* a printout of the first 10 lines of IN_tally

> IN_tally
# A tibble: 110 x 4
# Groups:   Site, State [1]
   Site     State  Year NumObs
   <chr>    <chr> <dbl>  <int>
 1 03373500 IN     1903     92
 2 03373500 IN     1904    366
 3 03373500 IN     1905    365
 4 03373500 IN     1906     90
 5 03373500 IN     1909    275
 6 03373500 IN     1910    365
 7 03373500 IN     1911    365
 8 03373500 IN     1912    366
 9 03373500 IN     1913    365
10 03373500 IN     1914    365
# … with 100 more rows

⌨️ Activity: Spot the split-apply-combine (again)

Check out the code for tally_site_obs(). To strengthen your familiarity with the split-apply-combine paradigm, can you isolate the split, apply, and combine operations within this tidyverse expression?

site_data %>%
  mutate(Year = lubridate::year(Date)) %>%
  # group by Site and State just to retain those columns, since we're already only looking at just one site worth of data
  group_by(Site, State, Year) %>%
  summarize(NumObs = length(which(!is.na(Value))))

Give your answer to the activity in a comment on this issue.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on July 20, 2024

group_by effectively splits, then summarize applies the functions to generate NumObs for each group, which is then combined back into one data frame

from ds-pipelines-3.

github-learning-lab commented on July 20, 2024

Check your progress

Here's where I think the split-apply-combine paradigm is manifested in tidyverse:

The split is decided here:

group_by(Site, State, Year) %>%

The apply is the expression

length(which(!is.na(Value)))

And both apply and combine are orchestrated by

summarize()

It's amazing how concise these actions can be in tidyverse, don't you think? The scipiper version would require a lot more code to do the exact same operation, but it brings the special benefit of only (re)building those elements that aren't already up to date.

⌨️ Activity: Revise and rebuild a step

The timeseries plots aren't meant to be publication quality, but it would be nice to touch them up just a bit.

Revise the title to include the State value from the first row of the site_data object.
Run scmake() to build the plots again. What happens? Do you know why?
Run scmake('state_tasks', force=TRUE) to force the issue. What happens now? Why should you be uncomfortable with this solution?

Add your answers to a new comment on this issue.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on July 20, 2024

Nothing rebuilt with scmake because the plot_site_data.R file isn't a dependency directly in the main remake file. Using force=TRUE defeats the purpose of a pipeline tool, since it 'dumbly' forces a rebuild.

from ds-pipelines-3.

github-learning-lab commented on July 20, 2024

Check your progress

Run scmake() to build the plots again. What happens? Do you know why?

Nothing gets built! This is because the outer call to scmake() (from remake.yml) doesn't know that changes to plot_site_data() should trigger a rebuild of state_tasks.

Run scmake('state_tasks', force=TRUE) to force the issue. What happens now? Why should you be uncomfortable with this solution?

This approach does build the plots again. Once we persuade scipiper to rebuild state_tasks, the inner call to scmake() (within do_state_tasks()) is able to detect the change in plot_site_data() and recognizes the need to rebuild those plots. So we have a temporary solution...the problem is that this solution is not baked into the pipeline code, and you need to remember to run scmake(..., force=TRUE) just to achieve the primary assurance that a pipeline is supposed to provide - namely, that the outputs reflect the current code and data.

When is it OK to force a build?

There's an option in scmake() to force the rebuild of one or more targets...but we're also discouraging its use. You saw a similar recommendation in the "Pipelines tips and tricks" course regarding depending on a directory: it's possible to use force=TRUE but better to use a dummy argument in that case. So when is it OK to use force=TRUE? Well, think of force as a first-aid kit - when you discover that something is not rebuilding when it should, you may want to try force=TRUE to ensure that you understand the situation and maybe even to produce the downstream targets quickly in a pinch. But the bandaid is not lasting, and you really ought to get that pipeline to an operating room as soon as you can to see if there's a more long-term solution such as declaring another dependency or adding a dummy argument.

⌨️ Activity: Avoid `force=TRUE`

You can avoid the need for scmake(..., force=TRUE) here by declaring the dependency on plot_site_data.R at the top level.

Add '3_visualize/src/plot_site_data.R' as another unnamed argument in the recipe for state_tasks in remake.yml.
While you're in there, better do the same for '2_process/src/tally_site_obs.R' for the same reason!
Does this adding of source files feel familiar? You did it for 1_fetch/src/get_site_data.R a few issues ago, but it's an easy step to forget. You can actually save yourself the headache altogether if you set up your code a bit differently. First, see how the contents of ... in do_state_tasks() exactly match the contents of the sources argument to create_task_makefile()? This will be a consistent pattern, so rewrite that create_task_makefile() argument to use the ..., like so: sources = c(...). Now you only need to remember to add new source files for 123_state_tasks.yml in one place - remake.yml - so it should be easier to get right.

Now that we've fixed that last issue, your code is ready for a pull request. Go for it!

I'll respond when I see your PR.

from ds-pipelines-3.

Comments (9)

⌨️ Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "appliers" branch.

⌨️ Activity: Add two new appliers

Code

Test

I'll respond when I see your comment.

Check your progress

⌨️ Activity: Spot the split-apply-combine (again)

I'll respond when I see your comment.

Check your progress

⌨️ Activity: Revise and rebuild a step

I'll respond when I see your comment.

Check your progress

When is it OK to force a build?

⌨️ Activity: Avoid force=TRUE

I'll respond when I see your PR.

Related Issues (8)

Recommend Projects

Recommend Topics

Recommend Org

⌨️ Activity: Avoid `force=TRUE`