ratshan / ds-pipelines-2 Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://lab.github.com/USGS-R/scipiper-tips-and-tricks
Home Page: https://lab.github.com/USGS-R/scipiper-tips-and-tricks
remake
is the R package that underlies many of scipiper
's functions. Here we've borrowed some text from the remake
github repo (credit to richfitz, although we've lightly edited the original text) to explain differences between targets
"Targets" are the main things that remake
interacts with. They represent things that are made (they're also the vertices of the dependency graph). If you want to make a plot called plot.pdf
, then that's a target. If you depend on a dataset called data.csv
, that's a target (even if it already exists).
There are several types of targets:
remake
(files are the main types of targets that make
deals with, since it is language agnostic). Within files, there are two sub-types:
command
in a remakefile). You can't build these of course. However, remake
will build an implicit file target for them so it can internally monitor changes to that file.make
these are "phoney" targets). The all
depends on all the "end points" of your analysis is a "fake" target. Running scmake("all")
will build all of your targets, or verify that they are up to date.⌨️ Activity: Assign yourself to this issue to get started.
Isn't it satisfying to work through a fairly lengthy data workflow and then return to the project and it just works? For the past few years, we have been capturing the steps that go into creating results, figures, or tables appearing in data visualizations or research papers. There are recipes for reproducibility used to create complex, interactive data visualizations, such as
Here is a much simpler example that was used to generate Figure 1 from Water quality data for national‐scale aquatic research: The Water Quality Portal (published in 2017):
packages:
- rgeos
- dplyr
- rgdal
- httr
- yaml
- RColorBrewer
- dataRetrieval
- lubridate
- maptools
- rgeos
- maps
- sp
## All R files that are used must be listed here:
sources:
- R/wqp_mapping_functions.R
- R/readWQPdataPaged.R
targets:
all:
depends:
- figures/multi_panel_constituents.png
map.config:
command: yaml.load_file("configs/mapping.yml")
wqp.config:
command: yaml.load_file("configs/wqp_params.yml")
huc.map:
command: get_mutate_HUC8s(map.config)
phosphorus_lakes:
command: get_wqp_data(target_name, wqp.config, map.config)
phosphorus_all:
command: get_wqp_data(target_name, wqp.config, map.config)
nitrogen_lakes:
command: get_wqp_data(target_name, wqp.config, map.config)
nitrogen_all:
command: get_wqp_data(target_name, wqp.config, map.config)
arsenic_lakes:
command: get_wqp_data(target_name, wqp.config, map.config)
arsenic_all:
command: get_wqp_data(target_name, wqp.config, map.config)
chlorophyll_lakes:
command: get_wqp_data(target_name, wqp.config, map.config)
chlorophyll_all:
command: get_wqp_data(target_name, wqp.config, map.config)
temperature_lakes:
command: get_wqp_data(target_name, wqp.config, map.config)
temperature_all:
command: get_wqp_data(target_name, wqp.config, map.config)
doc_lakes:
command: get_wqp_data(target_name, wqp.config, map.config)
doc_all:
command: get_wqp_data(target_name, wqp.config, map.config)
secchi_all:
command: get_wqp_data(target_name, wqp.config, map.config)
secchi_lakes:
command: get_wqp_data(target_name, wqp.config, map.config)
glyphosate_all:
command: get_wqp_data(target_name, wqp.config, map.config)
figures/multi_panel_constituents.png:
command: plot_huc_panel(huc.map, map.config, target_name, arsenic_lakes,
arsenic_all, nitrogen_lakes, nitrogen_all, phosphorus_lakes, phosphorus_all,
secchi_lakes, secchi_all, temperature_lakes, temperature_all)
plot: true
This remakefile recipe generates a multipanel map, which colors HUC8 watersheds according to how many sites within the watershed have data for various water quality constituents:
The "figures/multi_panel_constituents.png"
figure takes a while to plot, so it is a somewhat "expensive" target to iterate on when it comes to style, size, colors, and layout (it takes 3 minutes to plot for me). But the plotting expense is dwarfed by the amount of time it takes to build each water quality data "object target", since get_wqp_data
uses a web service that queries a large database and returns a result; the process of fetching the data can sometimes take over thirty minutes (nitrogen_all
is a target that contains the locations of all of the sites that have nitrogen water quality data samples).
Alternatively, the map.config*
object above builds in a fraction of second, and contains some simple information that is used to fetch and process the proper boundaries with the get_mutate_HUC8s
function, and includes some plotting details for the final map (such as plotting color divisions as specified by countBins
):
This example, although dated, represents a real project that caused us to think carefully about how many targets we use in a recipe and how complex their underlying functions are. Decisions related to targets are often motivated by the intent of the pipeline. In the case above, our intent at the time was to capture the data and processing behind the plot in the paper in order to satisfy our desire for reproducibility.
*disclaimer, the code above was written at a time before we'd completely transitioned away from naming variables like.this
⌨️ Activity: Assign yourself to this issue to get started.
You should now have a working pipeline that can run with scmake()
. Your current pipeline likely only has one file target, which is the final plot.
We want you to get used to exchanging objects for files and vice versa, in order to expose some of the important differences that show up in the remakefile and also in the way the functions are put together.
⌨️ Activity: Open a PR where you swap two object targets to be file targets, and change one file target to be an object target. Run scmake
and open a pull request. Paste your build status as a comment to the PR and assign Jordan or Alison as a reviewer.
You are awesome, @RAtshan! 🌟 💥 🐠
We hope you've learned a lot in intro to pipelines II. We don't have additional exercises in this module, but we'd love to have a discussion if you have questions.
As a resource for later, here are links to the content you just completed
which_dirty()
and why_dirty()
to further interrogate the status of pipeline targetsI()
helpertarget_name
special variable⌨️ Activity: If you have comments or questions, add them below and then assign a course lead this issue to engage in dialogue. When you are satisfied with the conversation, close this issue.
⌨️ Activity: Make modifications to the working, but less than ideal, pipeline that exists within your course repository
Within the course repo you should see only a remake.yml
and directories with code or placeholder files for each phase. You should be able to run scmake()
and build the pipeline, although it may take numerous tries, since some parts of this new workflow are brittle. Some hints to get you started: the site_data
target is too big, and you should consider splitting it into a target for each site, perhaps using the download_nwis_site_data()
function directly to write a file. Several of the site_data_
targets are too small and it might make sense to combine them. Lastly, if it makes sense to use target_name
, try using that in the "remake.yml"
file too to simplify the formatting.
When you are happy with your newer, better workflow, create a pull request with your changes and assign Jordan or Alison as reviewers. Add a comment to your own PR with thoughts on how you approached the task, as well as key decisions you made. See details below for some reminders of how to get started working with code and files that exist within the course repsository:
Open a git bash shell (Windows💠) or a terminal window (Mac🍏) and change (cd
) into the directory you work in for projects in R (for me, this is ~/Documents/R
). There, clone the repository and set your working directory to the new project folder that was created:
git clone [email protected]:RAtshan/ds-pipelines-2.git
cd ds-pipelines-2
Now you should create a local branch called "targets" and push that branch up to the "remote" location (which is the github host of your repository). We're naming this branch "targets" to represent concepts in this section of the lab. In the future you'll probably choose branch names according to the type of work they contain - for example, "pull-oxygen-data"
or "fix-issue-17"
.
git checkout -b targets
git push -u origin targets
Welcome to the second installment of "introduction to data pipelines" at USGS, @RAtshan!! ✨
We're assuming you were able to navigate through the intro-to-pipelines course and that you learned a few things about organizing your code for readability, re-use, and collaboration. You were also introduced to two key things through the remake.yml
: a way to program connections between functions and files, and the concept of a dependency manager that skips parts of the workflow that don't need to be re-run.
First, a recap of key concepts that came from intro-to-pipelines 👇
fetch
for this phase).make
and drake
, and you were asked to experiment with scipiper
.This last concept was not addressed directly but we hope that the small exercise of seeing rebuilds in action got you thinking about projects that might have much more lengthly steps (e.g., several downloads or geo-processing tasks that take hours instead of seconds).
In this training, the focus will be on tricks and tips for making better, smarter pipelines. You'll learn new things here that will help you refine your knowledge from the first class and put it into practice. Let's get started!
⌨️ Activity: Add collaborators and close this issue to get started.
As with pipelines I, please invite a few collaborators to your repository so they can easily comment and review in the future. In the ⚙️ Settings widget at the top of your repo, select "Manage access" (or use this shortcut link). Go ahead and invite aappling-usgs and jread-usgs. It should look something like this:
💡 Tip: Throughout this course, I, the Learning Lab Bot, will reply and direct you to the next step each time you complete an activity. But sometimes I'm too fast when I ⏳ give you a reply, and occasionally you'll need to refresh the current GitHub page to see it. Please be patient, and let my humans know (jread-usgs
or aappling-usgs
) if I seem to have become completely stuck.
In this section, we're going to go one by one through a series of tips that will help you avoid common pitfalls (or gotchas!) in pipelines. These tips will help you in the next sections and in future work. A quick list of what's to come:
which_dirty()
and why_dirty()
to further interrogate the status of pipeline targetsI()
helpertarget_name
special variable. Simplifying target
command
relationships and reducing duplication⌨️ add a comment to this issue and the bot will respond with the next topic
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.