Comments (5)
⌨️ Activity: Switch to a new branch
Before you edit any code, create a local branch called "splitter" and push that branch up to the remote location "origin" (which is the github host of your repository).
git checkout master
git pull origin master
git checkout -b splitter
git push -u origin splitter
The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to master
and sync with "origin" whenever you're transitioning between branches and/or PRs.
Comment on this issue once you've created and pushed the "splitter" branch.
from ds-pipelines-3.
a
from ds-pipelines-3.
⌨️ Activity: Create a separate inventory for each state
-
Scroll to the bottom of
123_state_tasks.R
and define a new function with this boilerplate:split_inventory <- function( summary_file='1_fetch/tmp/state_splits.yml', sites_info=oldest_active_sites) { }
-
Add a line at the start of
split_inventory
to create a1_fetch/tmp
directory where we'll store the files created by this function. We often createtmp
directories within phases to house data that's only important within that step. Since we already haveoldest_active_sites
as a top-level target, and the files we're about to create are redundant with that information, there's no need to store it twice in a higher-profile location likeout
. The choice betweentmp
andout
is a judgment call generally; trytmp
here and see how you like it. The line to add is:if(!dir.exists('1_fetch/tmp')) dir.create('1_fetch/tmp')
-
Add code to
split_inventory
to loop over the rows in oldest_active_sites and save each row to a file in the1_fetch/tmp
. Each filename should have the forminventory_[state].tsv
where[state]
is the state abbreviation, e.g.,inventory_WI.tsv
. Based on the file suffix, you've probably already guessed that I'm suggesting you create files in tab-separated format. You can use thereadr::write_tsv
function for this purpose.
(Hey, we had a perfectly fine R object target witholdest_active_sites
, and now we have to create a multitude of pesky little files just to support this whole splitting thing? Couldn't we just stick with R objects? Well...there probably is a way to split to objects, but we don't yet have a simple pattern established for it. If you develop an approach for this someday, your teammates will thank you for sharing it!) -
Collect the filenames of the site-specific inventories in a vector within your function. Sort them alphabetically in preparation for writing the summary file. Sorting is a good habit especially in projects where the list of split-out files changes over time, because it makes it easier to visually scan the summary file in git/GitHub to see what has changed.
-
Write a summary file to the path given by the
summary_file
argument. Thescipiper::sc_indicate()
function will do this for you - just pass in the desired summary filename as theind_file
argument and your vector of filenames as thedata_file
argument.
Test
When you think you've got it right, run your new function in isolation:
source('123_state_tasks.R')
split_inventory(summary_file = '1_fetch/tmp/state_splits.yml', sites_info = scmake('oldest_active_sites'))
You should now see five files in the 1_fetch/tmp folder:
> dir('1_fetch/tmp')
[1] "inventory_IL.tsv" "inventory_MI.tsv" "inventory_MN.tsv" "inventory_WI.tsv"
[5] "state_splits.yml"
And the state_splits.yml file should look like this:
1_fetch/tmp/inventory_IL.tsv: 4c59d14d16b3af05dff6e6d6dfc8aac9
1_fetch/tmp/inventory_MI.tsv: fed321c051ee99e2c7b163c5c4c10320
1_fetch/tmp/inventory_MN.tsv: d2bff76a0631abf055421b86d033d80c
1_fetch/tmp/inventory_WI.tsv: b6167db818f91d792ec639b6ec4faf68
Your hashes probably won't match mine because the number of available site observations changes daily, but the overall YAML format should be the same.
If you're not quite there, keep editing until you have it. When you've got it, copy and paste the contents of 1_fetch/tmp/inventory_MN.tsv into a comment on this issue.
I'll respond when I see your comment.
from ds-pipelines-3.
state_cd site_no station_nm dec_lat_va dec_long_va dec_coord_datum_cd begin_date end_date count_nu
MN 05211000 MISSISSIPPI RIVER AT GRAND RAPIDS, MN 47.23216599 -93.5302144 NAD83 1883-09-17 2020-07-07 49964
from ds-pipelines-3.
⌨️ Activity: Connect your splitter to the pipeline
You have a fancy new splitter, but you still need to connect it to the pipeline.
-
Insert a call to
split_inventory()
as the very first code line within yourdo_state_tasks()
function. Fill in the arguments, hard coding thesummary_file
and making use of the presence ofoldest_active_sites
within the local environment (it's already passed as an argument todo_state_tasks()
).
Note: In real pipelines, there might be some occasions when it makes more sense to define the splitter and its output as a separate target in the main pipeline (remake.yml
in our case). This can be useful if the splitter takes a long time to run, and you don't want to rerun it every time you need to build or rebuild any of the task-steps within your task table. An extra target means a little more complexity to the pipeline, which is why we're not taking this path in this course example...but in some future pipeline you may well find it worth the complexity. -
Edit the
get_site_data()
function to expect a file rather than the all-states inventory and a state name. This will involve changing the argument list tofunction(state_info_file, parameter)
and changing the first line ofget_site_data()
from afilter()
call to areadr::read_tsv()
call. To avoid a bunch of unnecessary messages gumming up your console output, includecol_types='cccddcDDi'
as the second argument in your call toread_tsv()
. -
Back in 123_state_tasks.R, edit the
command
fordownload_step
so that it callsget_site_data()
with the new arguments. Usesprintf()
or another string-manipulation function to build thestate_info_file
argument.
Test
How did you do?
-
Call
scmake()
and see what happens. Did you see the rebuilds and non-rebuilds that you expected? -
Add Indiana (
IN
) and Iowa (IA
) to the vector ofstates
in remake.yml. Rebuild. Did you see the rebuilds and non-rebuilds that you expected?
(If you're not sure what you should have expected, check with Alison, Jordan, or another teammate.)
Commit and PR
Comfortable with your pipeline's behavior? Time for a PR!
-
Add
1_fetch/tmp/*
to your .gitignore file - no need to commit all those teeny state inventory files. -
Add
!1_fetch/tmp/state_splits.yml
to your .gitignore file to tell git that it should commit this one file even though it's in 1_fetch/tmp. (You could have actually dealt with all files in 1_fetch/tmp with just one .gitignore line,1_fetch/tmp/*.tsv
...but I wanted you to know about!
in .gitignore in case that's new to you. Neat, right?) -
Commit 1_fetch/tmp/state_splits.yml and your changes to 123_state_tasks.R, 1_fetch/src/get_site_data.R, 123_state_tasks.yml, remake.yml, and .gitignore. Use
git push
to push your change up to the "splitter" branch on GitHub.
When everything is committed and pushed, create a pull request on GitHub. In your PR description note which files got built when you added IN
and IA
to states
.
I'll respond on your new PR once I spot it.
from ds-pipelines-3.
Related Issues (8)
- Recognize the unique demands of data-rich pipelines HOT 2
- Meet the example problem HOT 7
- Task tables HOT 11
- Appliers HOT 9
- Combiners HOT 15
- Scale up HOT 9
- What's next
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ds-pipelines-3.