In the last issue you noted a lingering inefficiency: When you added Illinois to the <

⌨️ Activity: Create a separate inventory for each state <ul class="contains-task

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

Splitters,about wdwatkins/ds-pipelines-3

Comments (5)

github-learning-lab commented on July 1, 2024

⌨️ Activity: Switch to a new branch

Before you edit any code, create a local branch called "splitter" and push that branch up to the remote location "origin" (which is the github host of your repository).

git checkout master
git pull origin master
git checkout -b splitter
git push -u origin splitter

The first two lines aren't strictly necessary when you don't have any new branches, but it's a good habit to head back to master and sync with "origin" whenever you're transitioning between branches and/or PRs.

Comment on this issue once you've created and pushed the "splitter" branch.

from ds-pipelines-3.

wdwatkins commented on July 1, 2024

from ds-pipelines-3.

github-learning-lab commented on July 1, 2024

⌨️ Activity: Create a separate inventory for each state

Scroll to the bottom of 123_state_tasks.R and define a new function with this boilerplate:

split_inventory <- function(
  summary_file='1_fetch/tmp/state_splits.yml',
  sites_info=oldest_active_sites) {

}

Add a line at the start of split_inventory to create a 1_fetch/tmp directory where we'll store the files created by this function. We often create tmp directories within phases to house data that's only important within that step. Since we already have oldest_active_sites as a top-level target, and the files we're about to create are redundant with that information, there's no need to store it twice in a higher-profile location like out. The choice between tmp and out is a judgment call generally; try tmp here and see how you like it. The line to add is:
```
if(!dir.exists('1_fetch/tmp')) dir.create('1_fetch/tmp')
```
Add code to split_inventory to loop over the rows in oldest_active_sites and save each row to a file in the 1_fetch/tmp. Each filename should have the form inventory_[state].tsv where [state] is the state abbreviation, e.g., inventory_WI.tsv. Based on the file suffix, you've probably already guessed that I'm suggesting you create files in tab-separated format. You can use the readr::write_tsv function for this purpose.
(Hey, we had a perfectly fine R object target with oldest_active_sites, and now we have to create a multitude of pesky little files just to support this whole splitting thing? Couldn't we just stick with R objects? Well...there probably is a way to split to objects, but we don't yet have a simple pattern established for it. If you develop an approach for this someday, your teammates will thank you for sharing it!)
Collect the filenames of the site-specific inventories in a vector within your function. Sort them alphabetically in preparation for writing the summary file. Sorting is a good habit especially in projects where the list of split-out files changes over time, because it makes it easier to visually scan the summary file in git/GitHub to see what has changed.
Write a summary file to the path given by the summary_file argument. The scipiper::sc_indicate() function will do this for you - just pass in the desired summary filename as the ind_file argument and your vector of filenames as the data_file argument.

Test

When you think you've got it right, run your new function in isolation:

source('123_state_tasks.R')
split_inventory(summary_file = '1_fetch/tmp/state_splits.yml', sites_info = scmake('oldest_active_sites'))

You should now see five files in the 1_fetch/tmp folder:

> dir('1_fetch/tmp')
[1] "inventory_IL.tsv" "inventory_MI.tsv" "inventory_MN.tsv" "inventory_WI.tsv"
[5] "state_splits.yml"

And the state_splits.yml file should look like this:

1_fetch/tmp/inventory_IL.tsv: 4c59d14d16b3af05dff6e6d6dfc8aac9
1_fetch/tmp/inventory_MI.tsv: fed321c051ee99e2c7b163c5c4c10320
1_fetch/tmp/inventory_MN.tsv: d2bff76a0631abf055421b86d033d80c
1_fetch/tmp/inventory_WI.tsv: b6167db818f91d792ec639b6ec4faf68

Your hashes probably won't match mine because the number of available site observations changes daily, but the overall YAML format should be the same.

If you're not quite there, keep editing until you have it. When you've got it, copy and paste the contents of 1_fetch/tmp/inventory_MN.tsv into a comment on this issue.

I'll respond when I see your comment.

from ds-pipelines-3.

wdwatkins commented on July 1, 2024

state_cd	site_no	station_nm	dec_lat_va	dec_long_va	dec_coord_datum_cd	begin_date	end_date	count_nu
MN	05211000	MISSISSIPPI RIVER AT GRAND RAPIDS, MN	47.23216599	-93.5302144	NAD83	1883-09-17	2020-07-07	49964

from ds-pipelines-3.

github-learning-lab commented on July 1, 2024

⌨️ Activity: Connect your splitter to the pipeline

You have a fancy new splitter, but you still need to connect it to the pipeline.

Insert a call to split_inventory() as the very first code line within your do_state_tasks() function. Fill in the arguments, hard coding the summary_file and making use of the presence of oldest_active_sites within the local environment (it's already passed as an argument to do_state_tasks()).
Note: In real pipelines, there might be some occasions when it makes more sense to define the splitter and its output as a separate target in the main pipeline (remake.yml in our case). This can be useful if the splitter takes a long time to run, and you don't want to rerun it every time you need to build or rebuild any of the task-steps within your task table. An extra target means a little more complexity to the pipeline, which is why we're not taking this path in this course example...but in some future pipeline you may well find it worth the complexity.
Edit the get_site_data() function to expect a file rather than the all-states inventory and a state name. This will involve changing the argument list to function(state_info_file, parameter) and changing the first line of get_site_data() from a filter() call to a readr::read_tsv() call. To avoid a bunch of unnecessary messages gumming up your console output, include col_types='cccddcDDi' as the second argument in your call to read_tsv().
Back in 123_state_tasks.R, edit the command for download_step so that it calls get_site_data() with the new arguments. Use sprintf() or another string-manipulation function to build the state_info_file argument.

Test

How did you do?

Call scmake() and see what happens. Did you see the rebuilds and non-rebuilds that you expected?
Add Indiana (IN) and Iowa (IA) to the vector of states in remake.yml. Rebuild. Did you see the rebuilds and non-rebuilds that you expected?

(If you're not sure what you should have expected, check with Alison, Jordan, or another teammate.)

Commit and PR

Comfortable with your pipeline's behavior? Time for a PR!

Add 1_fetch/tmp/* to your .gitignore file - no need to commit all those teeny state inventory files.
Add !1_fetch/tmp/state_splits.yml to your .gitignore file to tell git that it should commit this one file even though it's in 1_fetch/tmp. (You could have actually dealt with all files in 1_fetch/tmp with just one .gitignore line, 1_fetch/tmp/*.tsv...but I wanted you to know about ! in .gitignore in case that's new to you. Neat, right?)
Commit 1_fetch/tmp/state_splits.yml and your changes to 123_state_tasks.R, 1_fetch/src/get_site_data.R, 123_state_tasks.yml, remake.yml, and .gitignore. Use git push to push your change up to the "splitter" branch on GitHub.

When everything is committed and pushed, create a pull request on GitHub. In your PR description note which files got built when you added IN and IA to states.

I'll respond on your new PR once I spot it.

from ds-pipelines-3.

Splitters about ds-pipelines-3 HOT 5 CLOSED

Comments (5)

⌨️ Activity: Switch to a new branch

Comment on this issue once you've created and pushed the "splitter" branch.

⌨️ Activity: Create a separate inventory for each state

Test

I'll respond when I see your comment.

⌨️ Activity: Connect your splitter to the pipeline

Test

Commit and PR

I'll respond on your new PR once I spot it.

Related Issues (8)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent