Comments (15)
⌨️ Activity: Switch to a new branch
Before you edit any code, create a local branch called "combiners" and push that branch up to the remote location "origin" (which is the github host of your repository).
git checkout master
git pull origin master
git checkout -b combiners
git push -u origin combiners
The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to master
and sync with "origin" whenever you're transitioning between branches and/or PRs.
Comment on this issue once you've created and pushed the "combiners" branch.
from ds-pipelines-3.
a
from ds-pipelines-3.
⌨️ Activity: Add a data combiner
Write combine_obs_tallies()
-
Add a new function called
combine_obs_tallies
somewhere in 123_state_tasks.R. The function declaration should befunction(...)
; when the function is actually called, you can anticipate that the arguments will be a bunch of tallies tibbles (tidyverse data frames). Your function should return the concatenation of these tibbles into one very tall tibble. -
Test your
combine_obs_tallies()
function. Runsource('123_state_tasks.R') # load `combine_obs_tallies()` WI_tally <- remake::fetch('WI_tally', remake_file='123_state_tasks.yml') MN_tally <- remake::fetch('MN_tally', remake_file='123_state_tasks.yml') IA_tally <- remake::fetch('IA_tally', remake_file='123_state_tasks.yml') combine_obs_tallies(WI_tally, MN_tally, IA_tally)
The result should be a tibble with four columns and as many rows as the sum of the number of rows in
WI_tally
,MN_tally
, andIA_tally
. If you don't have it right yet, keep fiddling and/or ask for help.
Prepare the task plan and task makefile to use combine_obs_tallies()
-
Set the
final_steps
argument of your call tocreate_task_plan()
to'tally'
(which should be thestep_name
name of your tally step) - this tells scipiper to pass [only] the results of the "tally" task-steps into your combiner. -
Set
as_promises=FALSE
andtickquote_combinee_objects=TRUE
in your call tocreate_task_makefile()
withindo_state_tasks()
. -
Add/edit the values of the
final_targets
andfinalize_funs
arguments in thecreate_task_makefile()
call to specify that you want one combiner target that runs the functioncombine_obs_tallies()
and produces an R object target namedobs_tallies
.
Connect the pipeline
-
Add
'123_state_tasks.R'
as yet another unnamed argument in the recipe forstate_tasks
in remake.yml. This exercise should be familiar; you need to specify your code sources here (and propagate them through tocreate_task_makefile()
using the...
argument todo_state_tasks()
) so that both 123_state_tasks.yml and remake.yml can see them. -
Edit the
# Build the tasks
code chunk withindo_state_tasks
so that the target that gets built isobs_tallies
and the output is assigned to a local variable also namedobs_tallies
. -
Return
obs_tallies
as the output ofdo_state_tasks()
. Change the# Return nothing...
comment to match what you're now doing.
Test
Run state_tasks <- scmake('state_tasks')
, then answer these questions:
-
Inspect the console output. Which task steps (
download
,tally
, and/orplot
) are no longer getting built or checked? Inspect 123_state_tasks.yml to see if you can figure out why. -
Inspect the value of
state_tasks
. Is it what you expected?
When you're feeling confident, add a comment to this issue with your answers to the two questions above.
I'll respond when I see your comment.
from ds-pipelines-3.
the download and tally steps are checked, plot is not. The plot steps are not tied into the final combiner target, which is what we just built:
State tasks contains the summarized tally of observations for each state. It might make more sense for it to contain some representation that all the tasks were completed (including the plots), not just the intermediate tally of observations.
from ds-pipelines-3.
Check your progress
Some answers to compare to your own:
1. Inspect the console output. Which task steps (download
, tally
, and/or plot
) are no longer getting built or checked? Inspect 123_state_tasks.yml to see if you can figure out why.
The plot
task-steps are no longer getting built or checked. They're still there in 123_state_tasks.yml, but we're now only building the obs_tallies
target, which depends on the download
and tally
steps but not on the plot
steps. Also, the plots are no longer listed even as dependencies of the default target (123_state_tasks
). The download
and tally
steps also got removed from the default target dependencies, but the default target does depend on the tallies combiner, which depends on each of the tally
steps, which in turn depend on the download
steps, so that's why the tally
and download
steps still get considered.
2. Inspect the value of state_tasks
. Is it what you expected?
Here's what my state_tasks
looks like. Your number of rows might vary slightly if you build this at a time when the available data have changed substantially, but the column structure and approximate number of rows ought to be about the same. If it looks like this, then it meets my expectations and hopefully also yours.
> state_tasks
# A tibble: 738 x 4
# Groups: Site, State [6]
Site State Year NumObs
<chr> <chr> <dbl> <int>
1 04073500 WI 1898 365
2 04073500 WI 1899 365
3 04073500 WI 1900 365
4 04073500 WI 1901 365
5 04073500 WI 1902 365
6 04073500 WI 1903 365
7 04073500 WI 1904 366
8 04073500 WI 1905 365
9 04073500 WI 1906 365
10 04073500 WI 1907 365
# … with 728 more rows
⌨️ Activity: Explore as_promises
We stuck with the name state_tasks
in the main pipeline, but this target would now be more aptly named obs_tallies
.
-
Try changing the target name from
state_tasks
toobs_tallies
in remake.yml (do a whole-word find-replace to change it everywhere it occurs in that file). -
Run
scmake()
again. What happens? Identify the line in 123_state_tasks.yml that defines a target of the same name.
Hmm. It would be nice if we could use the same name to refer to the same information (a table of observation tallies) in both remake.yml and the task table, but it appears that scipiper won't let us. This is where the as_promises
argument to create_task_makefile()
comes in.
-
Change
as_promises
fromFALSE
toTRUE
. -
Leave the
final_targets
argument alone (set toobs_tallies
). -
Change
obs_tallies <- scmake('obs_tallies', remake_file='123_state_tasks.yml')
toobs_tallies <- scmake('obs_tallies_promise', remake_file='123_state_tasks.yml')
(a few lines down from the call tocreate_task_makefile()
). -
Rebuild
obs_tallies
from the main remake.yml. Now scipiper lets you do it, right? Check that line you identified in 123_state_tasks.yml to see what changed.
This as_promises=TRUE
technique is a pattern we've adopted to accommodate the fact that scipiper doesn't allow duplicate target names, but we kinda want them to keep our code clear. It's not perfect, but it does the trick.
Comment on this issue when you're ready to proceed.
I'll respond when I see your comment.
from ds-pipelines-3.
a
from ds-pipelines-3.
⌨️ Activity: Use the combiner target downstream
It's time to reap the rewards from your first combiner.
-
Create a new target in remake.yml that takes advantage of your new combined tallies. Use the
plot_data_coverage()
function already defined for you (find it by searching or browing the repository - rememberCtrl-.
), and pass instate_tasks
as theoldest_site_tallies
argument. Set up your target to create a file named "3_visualize/out/data_coverage.png". Remember to add the source file to thesources
list in remake.yml, and set up your pipeline to build this new target as part of the default build. -
Test your new target by running
scmake()
, then checking out 3_visualize/out/data_coverage.png. -
Test your new pipeline by removing a state from
states
and runningscmake()
once more. Did 3_visualize/out/data_coverage.png get revised? If not, see if you can figure out how to make it so. Ask for help if you need it.
When you've got it, share the image in 3_visualize/out/data_coverage.png as a comment.
I'll respond when I see your comment.
from ds-pipelines-3.
from ds-pipelines-3.
Great, you have a combiner hooked up from start to finish, and you probably learned some things along the way! It's time to add a second combiner that serves a different purpose - here, rather than produce a target that contains the data of interest, we'll produce a combiner target that summarizes the outputs of interest (in this case the state-specific .png files we've already created).
⌨️ Activity: Add a summary combiner
Don't write another combiner
Last time, you wrote your own combiner. This time you just need to check out combine_to_ind()
, a function provided by scipiper.
-
Check out the documentation at
?combine_to_ind
. -
Test it out with a command such as
combine_to_ind('test.yml', '3_visualize/out/timeseries_IA.png', '3_visualize/out/timeseries_MN.png')
Check out the contents of test.yml. Then when you're feeling clear on what happened, delete test.yml.
Prepare the task plan and task makefile to use combine_to_ind()
-
Add/edit the values of the
final_targets
andfinalize_funs
arguments in thecreate_task_makefile()
call to specify that you want a second combiner target that runs the functioncombine_to_ind()
and produces a file target named3_visualize/out/timeseries_plots.yml
. Keep the tallies combiner in place. -
Add another line just below
obs_tallies <- scmake('obs_tallies_promise', remake_file='123_state_tasks.yml')
to build this second combiner. The new line should be:scmake('timeseries_plots.yml_promise', remake_file='123_state_tasks.yml')
Note how the target name for this combiner differs from the target you provided in
final_targets
: it's the filename without the directories, and there's_promise
at the end. This is the work ofas_promises=TRUE
again, this time as applied to a file target. -
Run
scmake()
. It breaks. Check out the combiner targets at the end of 123_state_tasks.yml to see if you can figure out why before you read the instructions in the next paragraph.
Test and revise final_steps
Hmm, you probably just discovered that 123_state_tasks.yml is trying to apply combine_to_ind()
to your tally
step instead of your plot
step:
timeseries_plots.yml_promise:
command: combine_to_ind(I('3_visualize/out/timeseries_plots.yml'),
`WI_tally`,
`MN_tally`,
`MI_tally`,
`IL_tally`,
`IN_tally`,
`IA_tally`)
In hindsight, that probably makes sense, but it makes the next step a bit tricky. You've already set final_steps='tally'
in create_task_plan()
, and that's still useful for the tally combiner. But in order to pass the plot files into combine_to_ind()
, which is what we need for this new combiner, we'd really like final_steps='plot'
.
- Set the
final_steps
argument of your call tocreate_task_plan()
toc('tally', 'plot')
, callscmake()
again, and check out 123_state_tasks.yml once more. How did the combiner functions change?
Hmm, that's an improvement because now both combiners are getting the arguments they need, but it's also a step backward brecause now neither combiner is getting only the arguments it needs - they're each getting both the tally
and the plot
outputs.
Revise the combiners
The solution for this multi-combiner pipeline is to filter the arguments in each combiner. For this particular pipeline, we can distinguish between the two final steps based on their type: the tally
outputs are tibble
types, and the plot
outputs get passed to the combiner as character
filenames.
-
For
combine_obs_tallies()
, add these two lines to the beginning of the function:# filter to just those arguments that are tibbles (because the only step # outputs that are tibbles are the tallies) dots <- list(...) tally_dots <- dots[purrr::map_lgl(dots, is_tibble)]
and then proceed with whatever code you were using to combine the tibbles, this time using
tally_dots
rather than...
. Depending on the function you used for the combining, you may need to revise that code slightly to take a single argument that's a list of tibbles, rather than a sequence of individual tibble arguments. -
For
combine_to_ind()
, it turns out you will need to write your own custom function after all so that you can add in this filtering. Try adding this function to 123_state_tasks.R:summarize_timeseries_plots <- function(ind_file, ...) { # filter to just those arguments that are character strings (because the only # step outputs that are characters are the plot filenames) dots <- list(...) plot_dots <- dots[purrr::map_lgl(dots, is.character)] do.call(combine_to_ind, c(list(ind_file), plot_dots)) }
Then replace
'combine_to_ind'
with'summarize_timeseries_plots'
in thefinalize_funs
argument tocreate_task_makefile()
. -
Run
scmake()
again and then check the contents of 3_visualize/out/data_coverage.png and 3_visualize/out/timeseries_plots.yml to make sure you've succeeded in hooking up both combiners.
When you're feeling confident, add a comment to this issue with the contents of 3_visualize/out/data_coverage.png and 3_visualize/out/timeseries_plots.yml.
I'll respond when I see your comment.
from ds-pipelines-3.
3_visualize/out/timeseries_WI.png: cacd873105e2bb4a951a8ab2277c920d
3_visualize/out/timeseries_MN.png: a1347b9e25a16278c7eb5aa4019d1d0a
3_visualize/out/timeseries_MI.png: 94a4faf9ce43aac5298270f0dd997649
3_visualize/out/timeseries_IL.png: 39aa0cfb865494b22764cae370592d45
3_visualize/out/timeseries_IA.png: 05348827dd54c4722c9e01ae9a9adba1`
from ds-pipelines-3.
You're getting close! The last step for this second combiner is to connect it to the main pipeline. But this isn't trivial, because right now your code in do_state_tasks
creates the obs_tallies
target in the main pipeline, and we'd like to keep that obs_tallies
information. How do we get the results of both combiners into the main pipeline all at once?
One function, two outputs?
To connect both combiners to the main pipeline - and more broadly to follow pipelining best practices, ensuring that our pipeline's reproducibility is robust to modification - we need do_state_tasks()
to create a single target that represents all the effects of the task table that we want to be visible to the pipeline.
Let's take a moment to decide which effects of the task table we want to be visible. For this we need to check our project plans, because what we want does differ by project...ahh, here they are: In this course project we won't ever need to revisit the state-specific data tables again, so we don't need to carry those WI_data
, WI_tally
, etc. objects back to the main pipeline. The obs_tallies
argument will be sufficient to store the state tallies, and the timeseries_plots.yml file is sufficient to represent the status of the plot .png files.
Great! So we only have two outputs that need to be represented by state_tasks
: the big tallies table and the plot summary file. Unfortunately, two outputs is still one too many. How can we tell the main pipeline about these two objects using just one output?
This challenge should be ringing bells for you, because we've actually solved it twice already.
- The first time was with the inventory splitter, where we split the inventory but also created a summary file of the split-up inventory files.
- The second time was with the plot file combiner. Our apply operation had created one plot per state, but that's not easy to use downstream, so we then summarized those functions into 3_visualize/out/timeseries_plots.yml.
In both cases, we had one function and many outputs...and we saved the day by creating a single summary output. So let's do that once more!
There are actually a few ways to implement this general strategy. So far we've created summary files, but in this case, the output of do_state_tasks()
could be...
- A faithful representation of the combiner targets as they were produced by 123_state_tasks.yml: A list that contains (1) the contents of the tallies table and (2) a filename and hash describing the plot summary file (yes, that's a summary of a summary file).
- A concise representation of the combiner targets: A list that contains a filename and hash for a tallies table file (in this case we'd write out that table to file) and for the plot summary file.
- A ready-to-go translation of the combiner targets into R objects: A list that contains (1) the contents of the tallies table and (2) the contents of the plot summary file (in this case we'd read in the plot summary file as an r yaml object).
- A file that could be shared with others: A file, perhaps in RDS format, that contains any of the above three options.
⌨️ Activity: Make a multi-output target
For this course, let's go with option 3 from the list above.
-
Add a new expression in
do_state_tasks()
right afterscmake('timeseries_plots.yml_promise', remake_file='123_state_tasks.yml')`
to read timeseries_plots.yml into a tibble format:
timeseries_plots_info <- yaml::yaml.load_file('3_visualize/out/timeseries_plots.yml') %>% tibble::enframe(name = 'filename', value = 'hash') %>% mutate(hash = purrr::map_chr(hash, `[[`, 1))
-
Change the return value of
do_state_tasks()
to be a list of both the tallies table and the plot summary tibble:# Return the combiner targets to the parent remake file return(list(obs_tallies=obs_tallies, timeseries_plots_info=timeseries_plots_info))
-
In remake.yml, change the target name for the result of
do_state_tasks()
fromobs_tallies
tostate_combiners
. -
Add these two unpacker targets right after the
state_combiners
target (pluck()
is from purrr, which is loaded when you install the already-declared tidyverse package):obs_tallies: command: pluck(state_combiners, target_name) timeseries_plots_info: command: pluck(state_combiners, target_name)
Test
-
Run
obs_tallies <- scmake('obs_tallies')
and check the value ofobs_tallies
. Look good? -
Run
timeseries_plots_info <- scmake('timeseries_plots_info')
and check the value oftimeseries_plots_info
. Look good?
Add any comments, questions, or revelations to a comment on this issue.
I'll respond when I see your comment.
from ds-pipelines-3.
I wonder if the combiner target filtering could be more built-in? Perhaps some using a named vector in final_steps
to allow referencing only targets for a particular step in final_target
?
from ds-pipelines-3.
You're down to the last task for this issue! I hope you'll find this one rewarding. After all your hard work, you're now in a position to create a leaflet map that will give you interactive access to the locations, identities, and timeseries plots of the Upper Midwest's oldest gages, all in one .html map. Ready?
Use the plots downstream
-
Add another target to remake.yml that uses the function
map_timeseries()
(defined for you in3_visualize
).site_info
should be the inventory of oldest sites,plot_info
should betimeseries_plots_info
, and the output should be written to3_visualize/out/timeseries_map.html
. -
Add the three packages that
map_timeseries()
requires to the declaration at the top of remake.yml:leaflet
,leafpop
, andhtmlwidgets
. -
Edit remake.yml as need to ensure that
3_visualize/out/timeseries_map.html
will get built on a call toscmake()
without arguments.
(You should already have3_visualize/out/data_coverage.png
set up for this. Also, by declaring both3_visualize/out/timeseries_map.html
and3_visualize/out/data_coverage.png
as elements of the default target, you will have ensured thatobs_tallies
andtimeseries_plots_info
will get built, so you don't need to declare those directly..)
Test
-
Run
scmake()
. Any surprises? -
Check out the results of your new map by opening 3_visualize/out/timeseries_map.html in the browser. You should be able to hover and click on each marker.
-
Add or subtract a state from the
states
vector and rerunscmake()
. Did you see the rebuilds and non-rebuilds that you expected? Did the html file change as expected?
Make a pull request
It's finally time to submit your work.
-
Commit your code changes for this issue and make sure you're
.gitignore
ing the new analysis products (the .png and .html files). Push your changes to the GitHub repo. -
Create a PR to merge the "combiners" branch into "master". Share a screenshot of 3_visualize/out/timeseries_map.html and any thoughts you want to share in the PR description.
I'll respond when I see your PR.
from ds-pipelines-3.
I wonder if the combiner target filtering could be more built-in? Perhaps some using a named vector in final_steps to allow referencing only targets for a particular step in final_target?
Agreed. This pattern does make you want more customization in the combiners for sure.
from ds-pipelines-3.
I wonder if the combiner target filtering could be more built-in? Perhaps some using a named vector in final_steps to allow referencing only targets for a particular step in final_target?
Agreed. This pattern does make you want more customization in the combiners for sure.
Yep. That was a pain point as I was working on this course. Also noted in this issue: DOI-USGS/scipiper#113
from ds-pipelines-3.
Related Issues (8)
- Recognize the unique demands of data-rich pipelines HOT 2
- Meet the example problem HOT 7
- Task tables HOT 11
- Splitters HOT 5
- Appliers HOT 9
- Scale up HOT 9
- What's next
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ds-pipelines-3.