Comments (9)
⌨️ Activity: Switch to a new branch
Before you edit any code, create a local branch called "appliers" and push that branch up to the remote location "origin" (which is the github host of your repository).
git checkout master
git pull origin master
git checkout -b appliers
git push -u origin appliers
The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to master
and sync with "origin" whenever you're transitioning between branches and/or PRs.
Comment on this issue once you've created and pushed the "appliers" branch.
from ds-pipelines-3.
a
from ds-pipelines-3.
⌨️ Activity: Add two new appliers
Code
In 123_state_tasks.R:
-
Add a new step right after
download_step
. This step object should be calledplot_step
, should have step name'plot'
, should create targets called 3_visualize/out/timeseries_WI.png, 3_visualize/out/timeseries_MN.png, etc., should call theplot_site_data()
function (defined in 3_visualize/src/plot_site_data.R), and should make use of the targets created indownload_step
.
(Hint: It's fine to link backward to the downloading targets usingsprintf()
or another string manipulation function, but if you want to get really fancy, try out thesteps
argument to yourcommand
function.) -
Add a third step called
tally_step
. This step should have step name'tally'
, should create R object targets calledWI_tally
,MI_tally
, etc., should call thetally_site_obs()
function (also already defined for you), and should make use of the targets created indownload_step
(no need to link to theplot_step
targets). -
Add
plot_step
andtally_step
to the call tocreate_task_plan()
. -
Add the two new function files (where
plot_site_data()
andtally_site_obs()
are defined) to thesources
argument in yourcreate_task_makefile()
call. -
Add the lubridate package to the
packages
argument in yourcreate_task_makefile()
call (it's used intally_site_obs()
).
Test
-
Run
scmake('state_tasks')
. Is it building a timeseries plot and atally
object for each state? If not, keep fiddling with your code until you get it to work. -
Check the contents of the 3_visualize/out directory and inspect at least one of the plots. How do they look?
-
Assign the value of
IN_tally
to a variable of the same name in your global environment. You can use thescipiper::scmake()
function or theremake::fetch()
function. Either function will require a bit of special syntax - review?scmake
or?remake::fetch
for clues and ask if you get stuck.
When you're feeling confident, add a comment to this issue with:
- an image from one of the new plots in 3_visualize/out, and
- a printout of the first 10 lines of
IN_tally
.
I'll respond when I see your comment.
from ds-pipelines-3.
> IN_tally
# A tibble: 110 x 4
# Groups: Site, State [1]
Site State Year NumObs
<chr> <chr> <dbl> <int>
1 03373500 IN 1903 92
2 03373500 IN 1904 366
3 03373500 IN 1905 365
4 03373500 IN 1906 90
5 03373500 IN 1909 275
6 03373500 IN 1910 365
7 03373500 IN 1911 365
8 03373500 IN 1912 366
9 03373500 IN 1913 365
10 03373500 IN 1914 365
# … with 100 more rows
from ds-pipelines-3.
Check your progress
To help you assess your pipeline, here's what I would have put in that comment:
* an image from one of the new plots in 3_visualize/out, and
* a printout of the first 10 lines of IN_tally
> IN_tally
# A tibble: 110 x 4
# Groups: Site, State [1]
Site State Year NumObs
<chr> <chr> <dbl> <int>
1 03373500 IN 1903 92
2 03373500 IN 1904 366
3 03373500 IN 1905 365
4 03373500 IN 1906 90
5 03373500 IN 1909 275
6 03373500 IN 1910 365
7 03373500 IN 1911 365
8 03373500 IN 1912 366
9 03373500 IN 1913 365
10 03373500 IN 1914 365
# … with 100 more rows
⌨️ Activity: Spot the split-apply-combine (again)
- Check out the code for
tally_site_obs()
. To strengthen your familiarity with the split-apply-combine paradigm, can you isolate the split, apply, and combine operations within this tidyverse expression?
site_data %>%
mutate(Year = lubridate::year(Date)) %>%
# group by Site and State just to retain those columns, since we're already only looking at just one site worth of data
group_by(Site, State, Year) %>%
summarize(NumObs = length(which(!is.na(Value))))
Give your answer to the activity in a comment on this issue.
I'll respond when I see your comment.
from ds-pipelines-3.
group_by
effectively splits, then summarize
applies the functions to generate NumObs for each group, which is then combined back into one data frame
from ds-pipelines-3.
Check your progress
Here's where I think the split-apply-combine paradigm is manifested in tidyverse:
The split is decided here:
group_by(Site, State, Year) %>%
The apply
is the expression
length(which(!is.na(Value)))
And both apply
and combine
are orchestrated by
summarize()
It's amazing how concise these actions can be in tidyverse, don't you think? The scipiper version would require a lot more code to do the exact same operation, but it brings the special benefit of only (re)building those elements that aren't already up to date.
⌨️ Activity: Revise and rebuild a step
The timeseries plots aren't meant to be publication quality, but it would be nice to touch them up just a bit.
-
Revise the title to include the
State
value from the first row of thesite_data
object. -
Run
scmake()
to build the plots again. What happens? Do you know why? -
Run
scmake('state_tasks', force=TRUE)
to force the issue. What happens now? Why should you be uncomfortable with this solution?
Add your answers to a new comment on this issue.
I'll respond when I see your comment.
from ds-pipelines-3.
Nothing rebuilt with scmake
because the plot_site_data.R file isn't a dependency directly in the main remake file. Using force=TRUE
defeats the purpose of a pipeline tool, since it 'dumbly' forces a rebuild.
from ds-pipelines-3.
Check your progress
Run scmake()
to build the plots again. What happens? Do you know why?
Nothing gets built! This is because the outer call to scmake()
(from remake.yml) doesn't know that changes to plot_site_data()
should trigger a rebuild of state_tasks
.
Run scmake('state_tasks', force=TRUE)
to force the issue. What happens now? Why should you be uncomfortable with this solution?
This approach does build the plots again. Once we persuade scipiper to rebuild state_tasks
, the inner call to scmake()
(within do_state_tasks()
) is able to detect the change in plot_site_data()
and recognizes the need to rebuild those plots. So we have a temporary solution...the problem is that this solution is not baked into the pipeline code, and you need to remember to run scmake(..., force=TRUE)
just to achieve the primary assurance that a pipeline is supposed to provide - namely, that the outputs reflect the current code and data.
When is it OK to force a build?
There's an option in scmake()
to force the rebuild of one or more targets...but we're also discouraging its use. You saw a similar recommendation in the "Pipelines tips and tricks" course regarding depending on a directory: it's possible to use force=TRUE
but better to use a dummy
argument in that case. So when is it OK to use force=TRUE
? Well, think of force
as a first-aid kit - when you discover that something is not rebuilding when it should, you may want to try force=TRUE
to ensure that you understand the situation and maybe even to produce the downstream targets quickly in a pinch. But the bandaid is not lasting, and you really ought to get that pipeline to an operating room as soon as you can to see if there's a more long-term solution such as declaring another dependency or adding a dummy
argument.
⌨️ Activity: Avoid force=TRUE
You can avoid the need for scmake(..., force=TRUE)
here by declaring the dependency on plot_site_data.R
at the top level.
-
Add
'3_visualize/src/plot_site_data.R'
as another unnamed argument in the recipe forstate_tasks
in remake.yml. -
While you're in there, better do the same for
'2_process/src/tally_site_obs.R'
for the same reason! -
Does this adding of source files feel familiar? You did it for 1_fetch/src/get_site_data.R a few issues ago, but it's an easy step to forget. You can actually save yourself the headache altogether if you set up your code a bit differently. First, see how the contents of
...
indo_state_tasks()
exactly match the contents of thesources
argument tocreate_task_makefile()
? This will be a consistent pattern, so rewrite thatcreate_task_makefile()
argument to use the...
, like so:sources = c(...)
. Now you only need to remember to add new source files for 123_state_tasks.yml in one place - remake.yml - so it should be easier to get right.
Now that we've fixed that last issue, your code is ready for a pull request. Go for it!
I'll respond when I see your PR.
from ds-pipelines-3.
Related Issues (8)
- Recognize the unique demands of data-rich pipelines HOT 2
- Meet the example problem HOT 7
- Task tables HOT 11
- Splitters HOT 5
- Combiners HOT 15
- Scale up HOT 9
- What's next
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ds-pipelines-3.