In the last issue you noted some inefficiencies with writing out many nearly-identical

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Check your progress You may have had to call <code class="notransl

Adding IL caused data for all states to be re-pulled.

Task tables,about wdwatkins/ds-pipelines-3

Comments (11)

github-learning-lab commented on July 20, 2024

⌨️ Activity: Switch to a new branch

Before you edit any code, create a local branch called "task-table" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout master
git pull origin master
git checkout -b task-table
git push -u origin task-table

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to master and sync with "origin" whenever you're transitioning between branches and/or PRs.

Comment on this issue once you've created and pushed the "task-table" branch.

from ds-pipelines-3.

wdwatkins commented on July 20, 2024

from ds-pipelines-3.

github-learning-lab commented on July 20, 2024

To get you started with coding, I've added a new file to the master branch (which you should have already pulled into your "task-table" branch) called 123_state_tasks.R. This is the file where you'll write code to split a body of work into tasks, apply operations to each task, and combine the results back into end products.

Design decisions

Before we get to editing, let's briefly discuss this file:

Why an R script? Well, we've experimented with several ways of coding up task tables in past projects, and we've settled on a pattern where we wrap everything needed for a task table into a single R function (in this script) that can be called to generate a single target in the top-level remake.yml. The alternative would be to make separate scipiper targets for the task plan, the task makefile, and the task table output, but in practice this has turned out to be inconvenient, especially for team projects where we only want one person to run the tasks within a task table. You'll learn more about this in the "Shared-cache pipelines" course.
Why "123"", and why isn't this file in a phase-specific src folder? Well, by the end of this course, we'll be doing data fetching, processing, and visualization steps for each state (the download, tally, and plot steps in the diagram at the top of this issue). Because we made the design decision to separate our workflow phases into 1_fetch, 2_process, and 3_visualize, this task-table script crosses those three phases. Hence the "123", and hence the decision to keep this file at the top level rather than including it within just one of the three phases. Note that this isn't the only way we could have gone - we could have defined phases 1_inventory, 2_statewise, and 3_summarize instead, then moved all the state-by-state code (including this file) into the second phase - but I didn't think "statewise" would be sufficiently clear to newcomers, and so here we are. This is the kind of pipelining decision we are frequently confronted with - choose wisely, but also enjoy the chance to be creative!

With that intro out of the way, let's get going on this task table already!

⌨️ Activity: Define your rows and columns

Connect to `remake.yml`

Connect this starter function to the remake.yml file. The function has well-formed (albeit boring) outputs already.

Remember how last issue you added three targets beneath the line that said # TODO: PULL SITE DATA HERE? Well, now you should delete those targets and replace them with a recipe that calls the do_state_tasks() function.

  # TODO: PULL SITE DATA HERE
  state_tasks:
    command: do_state_tasks(oldest_active_sites)

Remove those three _data targets from the depends list of the main target and replace them with state_tasks.
Add "123_state_tasks.R" to the sources section of remake.yml.
Add scipiper to the packages section of remake.yml, because shortly we'll be calling scipiper functions within pipeline recipes, including the recipe for state_tasks.
Make sure the connection works by calling print(scmake('state_tasks')). You should see

$example_target_name
[1] "WI_download"

$example_command
[1] "download(I('WI'))"

You can call this same command as you're revising code in the next couple of steps to check your progress.

Define the rows

Now modify 123_state_tasks.R to define the rows of your task table.

Define the rows by creating a vector of 2-digit state codes where it says # TODO: DEFINE A VECTOR OF TASK NAMES HERE. Use information from oldest_active_sites, which is already an argument to the do_state_tasks() function. You won't need much code.

Define the columns

Still in 123_state_tasks.R, modify the existing column definition for download_step so that it pulls the data from NWIS for each state's oldest monitoring site, referring to ?create_task_step for help on the syntax.

Modify the target_name argument to create_task_step() so that each target (task-step) within this column will get a name like WI_data. The target_name argument should be a function of the form function(task_name, step_name, ...) {} where the body of the function constructs and returns a string for each combination of task_name (e.g., 'WI') and step_name (where we've already defined this step name to be 'download'). You can ignore the step_name this time. When it comes time to create the task plan (the R list), this function will get applied to each value of task_name in a vector of task_names.
Modify the command argument to create_task_step() so that each command within this column will look like the commands you wrote for wi_data, mn_data, and mi_data in remake.yml in the previous issue.

Test

When you're ready, call print(scmake('state_tasks')) and paste the output into a new comment on this issue.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on July 20, 2024

> print(scmake('state_tasks'))
Starting build at 2020-07-07 16:03:24
<  MAKE > state_tasks
[    OK ] states
[    OK ] parameter
[    OK ] oldest_active_sites
[ BUILD ] state_tasks                  |  state_tasks <- do_state_tas...
[  READ ]                              |  # loading packages
  Retrieving data for WI-04073500
Finished build at 2020-07-07 16:03:30
Build completed in 0.11 minutes
$example_target_name
[1] "WI_data"

$example_command
# A tibble: 44,744 x 6
   State Site     Date       Value Quality Parameter
   <chr> <chr>    <date>     <dbl> <chr>   <chr>    
 1 WI    04073500 1898-01-01   500 A       Flow     
 2 WI    04073500 1898-01-02   500 A       Flow     
 3 WI    04073500 1898-01-03   500 A       Flow     
 4 WI    04073500 1898-01-04   500 A       Flow     
 5 WI    04073500 1898-01-05   500 A       Flow     
 6 WI    04073500 1898-01-06   475 A       Flow     
 7 WI    04073500 1898-01-07   500 A       Flow     
 8 WI    04073500 1898-01-08   500 A       Flow     
 9 WI    04073500 1898-01-09   500 A       Flow     
10 WI    04073500 1898-01-10   500 A       Flow     
# … with 44,734 more rows

from ds-pipelines-3.

github-learning-lab commented on July 20, 2024

Check your progress

You should have seen this output when you ran print(scmake('state_tasks')):

$example_target_name
[1] "WI_data"

$example_command
[1] "get_site_data(sites_info=oldest_active_sites, state=I('WI'), parameter=parameter)"

If you're not there yet, keep trying until your output matches mine. Then proceed:

⌨️ Activity: Create the task plan

Sketch the plan

create_task_plan() generates an R list that defines your plan. To use this function in 123_state_tasks.R,

Replace this code chunk

# Return test results to the parent remake file
return(list(
  example_target_name = download_step$target_name(task_name='WI'),
  example_command = download_step$command(task_name='WI')
))

with this one:

# Create the task plan
task_plan <- create_task_plan(
  task_names = YOUR_CODE_HERE,
  task_steps = YOUR_CODE_HERE,
  add_complete = FALSE)

# Return test results to the parent remake file
return(yaml::as.yaml(task_plan))

Flesh out the plan

Now modify the new block:

Assign the task names you defined above to the task_names argument.
Assign a list of steps to the task_steps argument. In this case there will just be one step in the list.
Leave add_complete = FALSE as it is. Feel free to experiment later with changing this argument to TRUE, but it's not relevant to the current exercise. You can learn more about this and other arguments with a call to ?create_task_plan if and when you're ready.

Test

Note that we're now returning yaml::as.yaml(task_plan) from this function. This is still a temporary return value but gives you a way to inspect what you've created. It's also possible to print out the raw value of task_plan - it's just an R list, after all - but converting it to YAML makes it more concise and human-readable. The cat call suggested next makes the YAML text print nicely to the console.

When you're ready, call cat(scmake('state_tasks')) and paste the output into a new comment on this issue.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on July 20, 2024

WI:
  task_name: WI
  steps:
    download:
      step_name: download
      target_name: WI_data
      depends: []
      command: |-
        get_site_data(sites_info = oldest_active_sites,
                            state = I('WI'),
                            parameter = parameter)
MN:
  task_name: MN
  steps:
    download:
      step_name: download
      target_name: MN_data
      depends: []
      command: |-
        get_site_data(sites_info = oldest_active_sites,
                            state = I('MN'),
                            parameter = parameter)
MI:
  task_name: MI
  steps:
    download:
      step_name: download
      target_name: MI_data
      depends: []
      command: |-
        get_site_data(sites_info = oldest_active_sites,
                            state = I('MI'),
                            parameter = parameter)

from ds-pipelines-3.

github-learning-lab commented on July 20, 2024

Check your progress

You should have seen this output when you ran cat(scmake('state_tasks')):

WI:
  task_name: WI
  steps:
    download:
      step_name: download
      target_name: WI_data
      depends: []
      command: get_site_data(sites_info=oldest_active_sites, state=I('WI'), parameter=parameter)
MN:
  task_name: MN
  steps:
    download:
      step_name: download
      target_name: MN_data
      depends: []
      command: get_site_data(sites_info=oldest_active_sites, state=I('MN'), parameter=parameter)
MI:
  task_name: MI
  steps:
    download:
      step_name: download
      target_name: MI_data
      depends: []
      command: get_site_data(sites_info=oldest_active_sites, state=I('MI'), parameter=parameter)

If you're not there yet, keep trying until your output matches mine. Then proceed:

⌨️ Activity: Create the task remakefile

The final step in creating a task plan is to convert the R list task plan into a YAML file that scipiper can understand. To use the create_task_makefile() function in 123_state_tasks.R,

Replace this code chunk

# Return test results to the parent remake file
return(yaml::as.yaml(task_plan))

with this one:

# Create the task remakefile
create_task_makefile(
  # TODO: ADD ARGUMENTS HERE
  tickquote_combinee_objects = FALSE,
  finalize_funs = c())

# Return nothing to the parent remake file
return()

Refine the makefile

Now modify the new block. Refer to the ?create_task_makefile documentation to identify and use the right arguments to:

Pass in the task_plan.
Write out the remakefile to 123_state_tasks.yml.
Tell scipiper to connect the dependencies of targets in this remakefile to the targets in the main remake.yml file when executing the remakefile. Use the include argument for this purpose.
Tell scipiper to load the R script that defines the get_site_data() function when executing the remakefile.
Tell scipiper to load the packages needed to execute get_site_data() function when executing the remakefile.
Leave tickquote_combinee_objects = FALSE and finalize_funs = c(). We'll explore these arguments later.

Test

Now we're returning nothing from this function, because the current effect of this function is to create a file. If this were our end goal for this function, we would change the target in remake.yml to 123_state_tasks.yml...but since we'll shortly change the file again, we won't bother. Just run scmake('state_tasks') to create the file.

And then check out your file! You should now see 123_state_tasks.yml in the top-level directory. Open it; aside from some header comments and extra sections, you should recognize the format of the file as being similar to that of remake.yml. Refer to this remake documentation page for explanations of any sections you don't yet understand. There will also be one target in the new file that you did not define as a task step - do you understand the definition and utility of that target?

Explore

You now have a new remakefile. Put it through its paces to make sure it's working as expected and you understand why. Some things to try:

Run remake::diagram(remake_file='remake.yml') and remake::diagram(remake_file='123_state_tasks.yml'). Do you understand the relationship between the two diagrams?
Set the include argument to c() in create_task_makefile(), build the remakefile again with scmake('state_tasks'), and run remake::diagram(remake_file='123_state_tasks.yml'). Do you understand the resulting error? Set the include argument back to its original value once you're done experimenting.
Run scmake('WI_data', remake_file='123_state_tasks.yml'), potentially revising your call to create_task_makefile() if needed, until you can get the target to build successfully. (Note that you can't just edit 123_state_tasks.R and see the changes immediately reflected in scmake('WI_data', remake_file='123_state_tasks.yml') - you need to call scmake('state_tasks') after editing. This problem will go away once our task table function is fully connected to the main pipeline.)
Run scmake('123_state_tasks', remake_file='123_state_tasks.yml') until you've downloaded data for all three states' gages.

When you're done exploring, paste the output of a successful call to scmake('123_state_tasks', remake_file='123_state_tasks.yml') into a new comment on this issue.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on July 20, 2024

> scmake('123_state_tasks', remake_file='123_state_tasks.yml')
Starting build at 2020-07-07 17:05:42
<  MAKE > 123_state_tasks
[    OK ] states
[    OK ] parameter
[    OK ] oldest_active_sites
[    OK ] WI_data
[ BUILD ] MN_data                      |  MN_data <- get_site_data(si...
  Retrieving data for MN-05211000
[ BUILD ] MI_data                      |  MI_data <- get_site_data(si...
  Retrieving data for MI-04063522
[ ----- ] 123_state_tasks
Finished build at 2020-07-07 17:05:52
Build completed in 0.17 minutes
>

from ds-pipelines-3.

github-learning-lab commented on July 20, 2024

Check your progress

You may have had to call scmake('123_state_tasks', remake_file='123_state_tasks.yml') a few times to get through any [pretend] failures in the data pulls, but ultimately you should have seen something like this output:

> scmake('123_state_tasks', remake_file='123_state_tasks.yml')
Starting build at 2020-05-20 20:58:42
<  MAKE > 123_state_tasks
[    OK ] states
[    OK ] parameter
[    OK ] oldest_active_sites
[    OK ] WI_data
[    OK ] MN_data
[ BUILD ] MI_data                      |  MI_data <- get_site_data(sites_info = oldest_active_sites, state = "MI", ...
Retrieving data for site 04063522
[ ----- ] 123_state_tasks
Finished build at 2020-05-20 20:58:46
Build completed in 0.07 minutes

If you're not there yet, keep trying until your output matches mine. Then proceed:

⌨️ Activity: Connect the task remakefile to remake.yml

Now that your function creates a complete and functional task remakefile, the remaining step is to revise the connection between the main remake.yml and the do_state_tasks() function: edit do_state_tasks() so that it not only creates but also builds the task remake file.

Add these lines toward the end of do_state_tasks(), before the return() statement:

# Build the tasks
scmake('123_state_tasks', remake_file='123_state_tasks.yml')

Wait, what?? You can call scmake() within a function that we're calling from a remake.yml target? Yep, sure can! It just works. (OK, mostly works - there's a gotchya we'll get into in the "Shared-cache pipelines" course, but it doesn't apply here.)

Add state_tasks to the depends list for the main target in remake.yml.

That's it, you did it! For now, anyway.

Test

Call scmake() (with no arguments) until all data files have been downloaded.
Add 'IL' to the states target. Then call scmake() again. It builds IL_data for you right? Cool! But there's something inefficient happening here, too - what is it? Can you guess why this is happening?
Make a small change to the get_site_data() function: change Sys.sleep(2) to Sys.sleep(0.5). Then call scmake() again. What's wrong with the output you see? Can you guess why this is happening?

Answer the questions from 2 and 3 above in a new comment on this issue.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on July 20, 2024

Adding IL caused data for all states to be re-pulled. All the state_data targets depend on oldest_active_sites, so any time there is change to oldest_active_sites they all have to be rebuilt.
Nothing rebuilds with scmake after this change, because get_site_data is only used in the 123_state_tasks.yml and not the main remake file. Unless the 123_state_tasks.yml file is included in the main remake file, the main remake file won't watch the changes to that function.

from ds-pipelines-3.

github-learning-lab commented on July 20, 2024

Check your progress

Here are my answers to the above questions:

Q: 2. Add 'IL' to the states target. Then call scmake() again. It builds IL_data for you right? Cool! But there's something inefficient happening here, too - what is it?

A: It built WI_data, MN_data, and MI_data again even though there was no need to download those files again. This happened because those three targets each depend on oldest_active_sites, the inventory object, and that object changed to include information about a gage in Illinois. It would be ideal if each task target only depended on exactly the values that determine whether the data need to be downloaded again.

Q: 3. Make a small change to the get_site_data() function: change Sys.sleep(2) to Sys.sleep(0.5). Then call scmake() again. What's wrong with the output you see? Can you guess why this is happening?

A: It didn't rebuild anything, even though the get_site_data() function changed. The change we made doesn't actually change the output files from this function, but scipiper doesn't know that; it should have rebuild all of the _data targerts. This happened because scmake() looks to remake.yml by default, and at that level, there's no indication that the state_tasks target depends on the definition of the get_site_data() function.

We'll solve the problem with (3) here and will deal with (2) in the next issue.

⌨️ Activity: Declare all the dependencies

To ensure that a task-table target like state_tasks always rebuilds when there are changes to the dependencies of the targets in 123_state_tasks.yml, we need to declare all of those dependencies within the command or depends field of the state_tasks target. Currently oldest_active_sites is the only dependency of the XX_data targets that is already declared (because it happens to also be needed to construct the task remakefile). Let's declare the rest.

The undeclared dependency we've already identified is get_site_data(). We can't directly declare functions as dependencies, but we can get pretty close by declaring the source code file (1_fetch/src/get_site_data.R) as a dependency. Specifically, we can identify the code file as a dependency by including the filename as an argument to the do_state_tasks() function. Amend the declaration of the do_state_tasks() function to have a ... argument: do_state_tasks <- function(oldest_active_sites, ...). Then in remake.yml, add '1_fetch/src/get_site_data.R' as an unnamed argument in the call to do_state_tasks().
It would be ideal to also declare parameter as a dependency to state_tasks, because parameter is needed by each call to get_site_data(). It's not strictly necessary in this example because oldest_active_sites will almost certainly change if parameter changes...but oh, heck, go ahead an add it anyway to build good habits. Rather than including parameter as an argument to the do_state_tasks() call, add it to the (new) depends field for the state_tasks target.

When you're satisfied with your code, commit your changes to existing R and YAML files, and also commit the new 123_state_tasks.yml file.

We've gone both ways on committing XYZ_tasks.yml files (such as 123_state_tasks.yml) in the past.

Con to committing: These files are automatically generated, so it's not technically necessary to commit them.
Pro: It's often convenient to commit them so they're visible to all teammates for discussion and debugging.
Con: For large numbers of tasks, these files get to be so long that they're no longer easy to inspect using git or GitHub, so the benefits of committing them diminish.
Pro: Even for large numbers of tasks, these YAML files are seldom too big to create actual problems with git.
Therefore, our current team policy is to usually commit these auto-generated YAML files, with optional exceptions for really large task tables.

Next, create a pull request with the final results of all the changes you've made for this issue.

I'll respond when I see your PR.

from ds-pipelines-3.

Comments (11)

⌨️ Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "task-table" branch.

Design decisions

⌨️ Activity: Define your rows and columns

Connect to remake.yml

Define the rows

Define the columns

Test

I'll respond when I see your comment.

Check your progress

⌨️ Activity: Create the task plan

Sketch the plan

Flesh out the plan

Test

I'll respond when I see your comment.

Check your progress

⌨️ Activity: Create the task remakefile

Refine the makefile

Test

Explore

I'll respond when I see your comment.

Check your progress

⌨️ Activity: Connect the task remakefile to remake.yml

Test

I'll respond when I see your comment.

Check your progress

⌨️ Activity: Declare all the dependencies

I'll respond when I see your PR.

Related Issues (8)

Recommend Projects

Recommend Topics

Recommend Org

Connect to `remake.yml`