Comments (11)
⌨️ Activity: Switch to a new branch
Before you edit any code, create a local branch called "task-table" and push that branch up to the remote location "origin" (which is the github host of your repository).
git checkout master
git pull origin master
git checkout -b task-table
git push -u origin task-table
The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to master
and sync with "origin" whenever you're transitioning between branches and/or PRs.
Comment on this issue once you've created and pushed the "task-table" branch.
from ds-pipelines-3.
a
from ds-pipelines-3.
To get you started with coding, I've added a new file to the master branch (which you should have already pulled into your "task-table" branch) called 123_state_tasks.R. This is the file where you'll write code to split a body of work into tasks, apply operations to each task, and combine the results back into end products.
Design decisions
Before we get to editing, let's briefly discuss this file:
-
Why an R script? Well, we've experimented with several ways of coding up task tables in past projects, and we've settled on a pattern where we wrap everything needed for a task table into a single R function (in this script) that can be called to generate a single target in the top-level remake.yml. The alternative would be to make separate scipiper targets for the task plan, the task makefile, and the task table output, but in practice this has turned out to be inconvenient, especially for team projects where we only want one person to run the tasks within a task table. You'll learn more about this in the "Shared-cache pipelines" course.
-
Why "123"", and why isn't this file in a phase-specific src folder? Well, by the end of this course, we'll be doing data fetching, processing, and visualization steps for each state (the download, tally, and plot steps in the diagram at the top of this issue). Because we made the design decision to separate our workflow phases into 1_fetch, 2_process, and 3_visualize, this task-table script crosses those three phases. Hence the "123", and hence the decision to keep this file at the top level rather than including it within just one of the three phases. Note that this isn't the only way we could have gone - we could have defined phases 1_inventory, 2_statewise, and 3_summarize instead, then moved all the state-by-state code (including this file) into the second phase - but I didn't think "statewise" would be sufficiently clear to newcomers, and so here we are. This is the kind of pipelining decision we are frequently confronted with - choose wisely, but also enjoy the chance to be creative!
With that intro out of the way, let's get going on this task table already!
⌨️ Activity: Define your rows and columns
Connect to remake.yml
Connect this starter function to the remake.yml file. The function has well-formed (albeit boring) outputs already.
- Remember how last issue you added three targets beneath the line that said
# TODO: PULL SITE DATA HERE
? Well, now you should delete those targets and replace them with a recipe that calls thedo_state_tasks()
function.
# TODO: PULL SITE DATA HERE
state_tasks:
command: do_state_tasks(oldest_active_sites)
-
Remove those three
_data
targets from thedepends
list of themain
target and replace them withstate_tasks
. -
Add "123_state_tasks.R" to the
sources
section ofremake.yml
. -
Add scipiper to the
packages
section ofremake.yml
, because shortly we'll be calling scipiper functions within pipeline recipes, including the recipe forstate_tasks
. -
Make sure the connection works by calling
print(scmake('state_tasks'))
. You should see
$example_target_name
[1] "WI_download"
$example_command
[1] "download(I('WI'))"
You can call this same command as you're revising code in the next couple of steps to check your progress.
Define the rows
Now modify 123_state_tasks.R to define the rows of your task table.
- Define the rows by creating a vector of 2-digit state codes where it says
# TODO: DEFINE A VECTOR OF TASK NAMES HERE
. Use information fromoldest_active_sites
, which is already an argument to thedo_state_tasks()
function. You won't need much code.
Define the columns
Still in 123_state_tasks.R, modify the existing column definition for download_step
so that it pulls the data from NWIS for each state's oldest monitoring site, referring to ?create_task_step
for help on the syntax.
-
Modify the
target_name
argument tocreate_task_step()
so that each target (task-step) within this column will get a name likeWI_data
. Thetarget_name
argument should be a function of the formfunction(task_name, step_name, ...) {}
where the body of the function constructs and returns a string for each combination oftask_name
(e.g., 'WI') andstep_name
(where we've already defined this step name to be 'download'). You can ignore thestep_name
this time. When it comes time to create the task plan (the R list), this function will get applied to each value oftask_name
in a vector oftask_names
. -
Modify the
command
argument tocreate_task_step()
so that each command within this column will look like the commands you wrote forwi_data
,mn_data
, andmi_data
inremake.yml
in the previous issue.
Test
When you're ready, call print(scmake('state_tasks'))
and paste the output into a new comment on this issue.
I'll respond when I see your comment.
from ds-pipelines-3.
> print(scmake('state_tasks'))
Starting build at 2020-07-07 16:03:24
< MAKE > state_tasks
[ OK ] states
[ OK ] parameter
[ OK ] oldest_active_sites
[ BUILD ] state_tasks | state_tasks <- do_state_tas...
[ READ ] | # loading packages
Retrieving data for WI-04073500
Finished build at 2020-07-07 16:03:30
Build completed in 0.11 minutes
$example_target_name
[1] "WI_data"
$example_command
# A tibble: 44,744 x 6
State Site Date Value Quality Parameter
<chr> <chr> <date> <dbl> <chr> <chr>
1 WI 04073500 1898-01-01 500 A Flow
2 WI 04073500 1898-01-02 500 A Flow
3 WI 04073500 1898-01-03 500 A Flow
4 WI 04073500 1898-01-04 500 A Flow
5 WI 04073500 1898-01-05 500 A Flow
6 WI 04073500 1898-01-06 475 A Flow
7 WI 04073500 1898-01-07 500 A Flow
8 WI 04073500 1898-01-08 500 A Flow
9 WI 04073500 1898-01-09 500 A Flow
10 WI 04073500 1898-01-10 500 A Flow
# … with 44,734 more rows
from ds-pipelines-3.
Check your progress
You should have seen this output when you ran print(scmake('state_tasks'))
:
$example_target_name
[1] "WI_data"
$example_command
[1] "get_site_data(sites_info=oldest_active_sites, state=I('WI'), parameter=parameter)"
If you're not there yet, keep trying until your output matches mine. Then proceed:
⌨️ Activity: Create the task plan
Sketch the plan
create_task_plan()
generates an R list that defines your plan. To use this function in 123_state_tasks.R,
- Replace this code chunk
# Return test results to the parent remake file
return(list(
example_target_name = download_step$target_name(task_name='WI'),
example_command = download_step$command(task_name='WI')
))
with this one:
# Create the task plan
task_plan <- create_task_plan(
task_names = YOUR_CODE_HERE,
task_steps = YOUR_CODE_HERE,
add_complete = FALSE)
# Return test results to the parent remake file
return(yaml::as.yaml(task_plan))
Flesh out the plan
Now modify the new block:
-
Assign the task names you defined above to the
task_names
argument. -
Assign a
list
of steps to thetask_steps
argument. In this case there will just be one step in the list. -
Leave
add_complete = FALSE
as it is. Feel free to experiment later with changing this argument toTRUE
, but it's not relevant to the current exercise. You can learn more about this and other arguments with a call to?create_task_plan
if and when you're ready.
Test
Note that we're now returning yaml::as.yaml(task_plan)
from this function. This is still a temporary return value but gives you a way to inspect what you've created. It's also possible to print out the raw value of task_plan
- it's just an R list, after all - but converting it to YAML makes it more concise and human-readable. The cat
call suggested next makes the YAML text print nicely to the console.
When you're ready, call cat(scmake('state_tasks'))
and paste the output into a new comment on this issue.
I'll respond when I see your comment.
from ds-pipelines-3.
WI:
task_name: WI
steps:
download:
step_name: download
target_name: WI_data
depends: []
command: |-
get_site_data(sites_info = oldest_active_sites,
state = I('WI'),
parameter = parameter)
MN:
task_name: MN
steps:
download:
step_name: download
target_name: MN_data
depends: []
command: |-
get_site_data(sites_info = oldest_active_sites,
state = I('MN'),
parameter = parameter)
MI:
task_name: MI
steps:
download:
step_name: download
target_name: MI_data
depends: []
command: |-
get_site_data(sites_info = oldest_active_sites,
state = I('MI'),
parameter = parameter)
from ds-pipelines-3.
Check your progress
You should have seen this output when you ran cat(scmake('state_tasks'))
:
WI:
task_name: WI
steps:
download:
step_name: download
target_name: WI_data
depends: []
command: get_site_data(sites_info=oldest_active_sites, state=I('WI'), parameter=parameter)
MN:
task_name: MN
steps:
download:
step_name: download
target_name: MN_data
depends: []
command: get_site_data(sites_info=oldest_active_sites, state=I('MN'), parameter=parameter)
MI:
task_name: MI
steps:
download:
step_name: download
target_name: MI_data
depends: []
command: get_site_data(sites_info=oldest_active_sites, state=I('MI'), parameter=parameter)
If you're not there yet, keep trying until your output matches mine. Then proceed:
⌨️ Activity: Create the task remakefile
The final step in creating a task plan is to convert the R list task plan into a YAML file that scipiper can understand. To use the create_task_makefile()
function in 123_state_tasks.R,
- Replace this code chunk
# Return test results to the parent remake file
return(yaml::as.yaml(task_plan))
with this one:
# Create the task remakefile
create_task_makefile(
# TODO: ADD ARGUMENTS HERE
tickquote_combinee_objects = FALSE,
finalize_funs = c())
# Return nothing to the parent remake file
return()
Refine the makefile
- Now modify the new block. Refer to the
?create_task_makefile
documentation to identify and use the right arguments to:
-
Pass in the
task_plan
. -
Write out the remakefile to 123_state_tasks.yml.
-
Tell scipiper to connect the dependencies of targets in this remakefile to the targets in the main remake.yml file when executing the remakefile. Use the
include
argument for this purpose. -
Tell scipiper to load the R script that defines the
get_site_data()
function when executing the remakefile. -
Tell scipiper to load the packages needed to execute
get_site_data()
function when executing the remakefile. -
Leave
tickquote_combinee_objects = FALSE
andfinalize_funs = c()
. We'll explore these arguments later.
Test
Now we're return
ing nothing from this function, because the current effect of this function is to create a file. If this were our end goal for this function, we would change the target in remake.yml to 123_state_tasks.yml...but since we'll shortly change the file again, we won't bother. Just run scmake('state_tasks')
to create the file.
And then check out your file! You should now see 123_state_tasks.yml in the top-level directory. Open it; aside from some header comments and extra sections, you should recognize the format of the file as being similar to that of remake.yml. Refer to this remake documentation page for explanations of any sections you don't yet understand. There will also be one target in the new file that you did not define as a task step - do you understand the definition and utility of that target?
Explore
You now have a new remakefile. Put it through its paces to make sure it's working as expected and you understand why. Some things to try:
-
Run
remake::diagram(remake_file='remake.yml')
andremake::diagram(remake_file='123_state_tasks.yml')
. Do you understand the relationship between the two diagrams? -
Set the
include
argument toc()
increate_task_makefile()
, build the remakefile again withscmake('state_tasks')
, and runremake::diagram(remake_file='123_state_tasks.yml')
. Do you understand the resulting error? Set theinclude
argument back to its original value once you're done experimenting. -
Run
scmake('WI_data', remake_file='123_state_tasks.yml')
, potentially revising your call tocreate_task_makefile()
if needed, until you can get the target to build successfully. (Note that you can't just edit123_state_tasks.R
and see the changes immediately reflected inscmake('WI_data', remake_file='123_state_tasks.yml')
- you need to callscmake('state_tasks')
after editing. This problem will go away once our task table function is fully connected to the main pipeline.) -
Run
scmake('123_state_tasks', remake_file='123_state_tasks.yml')
until you've downloaded data for all three states' gages.
When you're done exploring, paste the output of a successful call to scmake('123_state_tasks', remake_file='123_state_tasks.yml')
into a new comment on this issue.
I'll respond when I see your comment.
from ds-pipelines-3.
> scmake('123_state_tasks', remake_file='123_state_tasks.yml')
Starting build at 2020-07-07 17:05:42
< MAKE > 123_state_tasks
[ OK ] states
[ OK ] parameter
[ OK ] oldest_active_sites
[ OK ] WI_data
[ BUILD ] MN_data | MN_data <- get_site_data(si...
Retrieving data for MN-05211000
[ BUILD ] MI_data | MI_data <- get_site_data(si...
Retrieving data for MI-04063522
[ ----- ] 123_state_tasks
Finished build at 2020-07-07 17:05:52
Build completed in 0.17 minutes
>
from ds-pipelines-3.
Check your progress
You may have had to call scmake('123_state_tasks', remake_file='123_state_tasks.yml')
a few times to get through any [pretend] failures in the data pulls, but ultimately you should have seen something like this output:
> scmake('123_state_tasks', remake_file='123_state_tasks.yml')
Starting build at 2020-05-20 20:58:42
< MAKE > 123_state_tasks
[ OK ] states
[ OK ] parameter
[ OK ] oldest_active_sites
[ OK ] WI_data
[ OK ] MN_data
[ BUILD ] MI_data | MI_data <- get_site_data(sites_info = oldest_active_sites, state = "MI", ...
Retrieving data for site 04063522
[ ----- ] 123_state_tasks
Finished build at 2020-05-20 20:58:46
Build completed in 0.07 minutes
If you're not there yet, keep trying until your output matches mine. Then proceed:
⌨️ Activity: Connect the task remakefile to remake.yml
Now that your function creates a complete and functional task remakefile, the remaining step is to revise the connection between the main remake.yml and the do_state_tasks()
function: edit do_state_tasks()
so that it not only creates but also builds the task remake file.
- Add these lines toward the end of
do_state_tasks()
, before thereturn()
statement:
# Build the tasks
scmake('123_state_tasks', remake_file='123_state_tasks.yml')
Wait, what?? You can call scmake()
within a function that we're calling from a remake.yml
target? Yep, sure can! It just works. (OK, mostly works - there's a gotchya we'll get into in the "Shared-cache pipelines" course, but it doesn't apply here.)
- Add
state_tasks
to thedepends
list for themain
target in remake.yml.
That's it, you did it! For now, anyway.
Test
-
Call
scmake()
(with no arguments) until all data files have been downloaded. -
Add
'IL'
to thestates
target. Then callscmake()
again. It buildsIL_data
for you right? Cool! But there's something inefficient happening here, too - what is it? Can you guess why this is happening? -
Make a small change to the
get_site_data()
function: changeSys.sleep(2)
toSys.sleep(0.5)
. Then callscmake()
again. What's wrong with the output you see? Can you guess why this is happening?
Answer the questions from 2 and 3 above in a new comment on this issue.
I'll respond when I see your comment.
from ds-pipelines-3.
-
Adding IL caused data for all states to be re-pulled. All the
state_data
targets depend onoldest_active_sites
, so any time there is change tooldest_active_sites
they all have to be rebuilt. -
Nothing rebuilds with
scmake
after this change, becauseget_site_data
is only used in the123_state_tasks.yml
and not the main remake file. Unless the123_state_tasks.yml
file is included in the main remake file, the main remake file won't watch the changes to that function.
from ds-pipelines-3.
Check your progress
Here are my answers to the above questions:
Q: 2. Add 'IL'
to the states
target. Then call scmake()
again. It builds IL_data
for you right? Cool! But there's something inefficient happening here, too - what is it?
A: It built WI_data
, MN_data
, and MI_data
again even though there was no need to download those files again. This happened because those three targets each depend on oldest_active_sites
, the inventory object, and that object changed to include information about a gage in Illinois. It would be ideal if each task target only depended on exactly the values that determine whether the data need to be downloaded again.
Q: 3. Make a small change to the get_site_data()
function: change Sys.sleep(2)
to Sys.sleep(0.5)
. Then call scmake()
again. What's wrong with the output you see? Can you guess why this is happening?
A: It didn't rebuild anything, even though the get_site_data()
function changed. The change we made doesn't actually change the output files from this function, but scipiper doesn't know that; it should have rebuild all of the _data
targerts. This happened because scmake()
looks to remake.yml by default, and at that level, there's no indication that the state_tasks
target depends on the definition of the get_site_data()
function.
We'll solve the problem with (3) here and will deal with (2) in the next issue.
⌨️ Activity: Declare all the dependencies
To ensure that a task-table target like state_tasks
always rebuilds when there are changes to the dependencies of the targets in 123_state_tasks.yml, we need to declare all of those dependencies within the command
or depends
field of the state_tasks
target. Currently oldest_active_sites
is the only dependency of the XX_data
targets that is already declared (because it happens to also be needed to construct the task remakefile). Let's declare the rest.
-
The undeclared dependency we've already identified is
get_site_data()
. We can't directly declare functions as dependencies, but we can get pretty close by declaring the source code file (1_fetch/src/get_site_data.R) as a dependency. Specifically, we can identify the code file as a dependency by including the filename as an argument to thedo_state_tasks()
function. Amend the declaration of thedo_state_tasks()
function to have a...
argument:do_state_tasks <- function(oldest_active_sites, ...)
. Then in remake.yml, add'1_fetch/src/get_site_data.R'
as an unnamed argument in the call todo_state_tasks()
. -
It would be ideal to also declare
parameter
as a dependency tostate_tasks
, becauseparameter
is needed by each call toget_site_data()
. It's not strictly necessary in this example becauseoldest_active_sites
will almost certainly change ifparameter
changes...but oh, heck, go ahead an add it anyway to build good habits. Rather than includingparameter
as an argument to thedo_state_tasks()
call, add it to the (new)depends
field for thestate_tasks
target.
When you're satisfied with your code, commit your changes to existing R and YAML files, and also commit the new 123_state_tasks.yml file.
We've gone both ways on committing XYZ_tasks.yml files (such as 123_state_tasks.yml) in the past.
- Con to committing: These files are automatically generated, so it's not technically necessary to commit them.
- Pro: It's often convenient to commit them so they're visible to all teammates for discussion and debugging.
- Con: For large numbers of tasks, these files get to be so long that they're no longer easy to inspect using git or GitHub, so the benefits of committing them diminish.
- Pro: Even for large numbers of tasks, these YAML files are seldom too big to create actual problems with git.
Therefore, our current team policy is to usually commit these auto-generated YAML files, with optional exceptions for really large task tables.
Next, create a pull request with the final results of all the changes you've made for this issue.
I'll respond when I see your PR.
from ds-pipelines-3.
Related Issues (8)
- Recognize the unique demands of data-rich pipelines HOT 2
- Meet the example problem HOT 7
- Splitters HOT 5
- Appliers HOT 9
- Combiners HOT 15
- Scale up HOT 9
- What's next
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ds-pipelines-3.