Git Product home page Git Product logo

edah's Introduction

alt text

Environmental Data Harmonization Framework

Code forked from the soilHarmonization package.

Extends work completed by the SOils DAta Harmonization (SoDaH) & Synthesis project.

VERSION: alpha v1.0, pre-release.

Core functionality complete. Further development and testing required.

Development Jamboard


Description

Paraphrased from soilHarmonization 0.1.0

Environmental data vary vastly in their structure, units of measure, granularity and other details. To facilitate their use in models and synthesis research, the data must be homogenized to a sufficient degree such that cross-site, -project, -time comparisons are feasible.

To facilitate data harmonization, a key file must be generated for each raw data file, which serves as a guide to translate the user-provided data into a common, database-wide structure and format. For each data set provided, the key file should contain general details about the data provider, the project from which the data were generated, and generalized details that apply to the data broadly (e.g., mean annual precipitation at the study site). Such generalized information is referred to as location or locational data in this project. At a finer resolution, the key file should contain mappings between the provided data and common terminology and units employed by the project for that data type. For example, the project-designated term for the standing stock of soil organic matter is soc_stock in units of g/m2. If the provided data included information about the standing stock of soil organic matter in a column titled soil C with units of %, that translation will be noted by the data provided on the Profile_data tab of the key file. When run, the script will rename the column titled soil C to som_stock and apply the appropriate units conversion.

Example code below pulls from 'Example' directory with data and working key file.

Note: The terms harmonization and homogenization are often used synonymously across text and code in the repository.

Major changes from soilHarmonization v0.1.0:

  • Simplified adding new variables to the key file.

  • Homogenization code is now soley dependent on information provided in the key, including variable properties and constraints, unit conversions, and homogenization settings.

  • Removed dependency on Google Drive.

  • Split code up into discrete functions to aid debugging and improvements.

Current state of code is pre-alpha (as of 12/22/2022). Please contact Derek Pierson if interested in using this code prior to the initial release. While functional, the code requires more robust testing and additional core functionality. Official alpha 1.0 release expected Spring 2023.

Further notes:

  • Adding variable to the key is as simple as adding a new row and filling in the required information. A key_check and update feature will be added in the future to allow for QC of such key changes, updating of the master key, and key field requirments/locks.

  • Key file has been updated ('V3'). Future update will likely remove many of the less common var fields. Note: Key file 'V2' no longer works. Key file versioning will now be complicated by data dependent var additions. Additional updates will be needed to handle this. See comment above.

  • Additional unit conversions may be added via a sheet in the key. Unit conversion are now only variable specific if specified.

  • No longer dependent on Google Drive. Current version requires local directory input. Source/sync from Google Drive directory (or Box, etc.) may be added in the future.

  • By compartmentalizing the code into discrete functions, it's now easier to debug errors and update code. Also hope that this will ease the processs of adding further functionality.

  • Quality control checks are performed and notes are created in the code, but they are not yet exported to a 'notes.PDF.' Coming soon! (...also hoping to add variable specific QC plots)

  • With further cunctionalization and removal of dependencies on Google Drive, the homogenization code runs much faster. Thus, re-homgenizing a large set of files is a breeze.

  • Key file now contains three alignment variables. A future code update is planned to test data alignment across 'HMGZD' files. Expected that such functionality will work well for data from a single project (e.g. align by IGSN and sample date for the CZN GeoMicrobiology project), but will require more thought, code functionality and scrutiny to be implimented across separate projects (e.g. separate datasets from an LTER site, ...original SoDaH database).

Acknowledgements

Work made possible by contributions and support from Steven Earl and Will Wieder (co-authors of the soilHarmonization package and the SoDaH database), the National Center for Ecological Analysis and Synthesis (NCEAS grant supported initial soilHarmonization code development), Kate Lajtha and Kathleen Lohse (project and funding support), and the Critical Zone Network Geomicrobiology project and the USDA Forest Service Rocky Mountain Research Station (project support).


Code Overview

Data harmonization is performed through a series of functions, designed to be performed in sequence (i.e., dependent on output from previous function calls). The use of stepwise discrete functions facilitates adjusting or adding to the framework functionality, while also improving debugging and code comprehension.

Key file

[ADD INSTRUCTIONS FOR FILLING OUT A KEY FILE]

Example of data harmonization with the compiled homog() function

source("Homog_ftns.R")
data_dir <- "C:\\GitHub\\CZnetGM_SoDaH\\Example_dir"
homog_data <- homog(data_dir)

Homogenized output files will be saved to the input data directory ('data_dir' in code above).

Example usage of the sequential functions

#-----------------------------------
### STEP BY STEP CODE
#-----------------------------------

# Load sheets from key file
#-----------------------------------------------------------------------
key_path <- find_key_path(data_dir)
locationData <- read_key_location(key_path)
profileData <- read_key_profile(key_path)
notes <- build_key_notes(key_path, locationData, profileData)
unitConversions <- read_key_units(key_path)


# Location data unit conversion
#-----------------------------------------------------------------------
unitsConversions <- get_unit_conversions(key_path) 
conversionNotes <- build_unitConv_notes() 
LDU_UCL <- locationData_to_convert(locationData, unitsConversions)
unitConv_locationOutput <- apply_locData_UnitConv(locationData, LDU_UCL, conversionNotes, print_msg = F)
unitConv_locationData <- as.data.frame(unitConv_locationOutput[[1]])
loc_conversion_Notes <- as.data.frame(unitConv_locationOutput[[2]]) #output is notes


# Location data QC
#-----------------------------------------------------------------------
locationDataQC_Notes <- locationData_QC(unitConv_locationData) #output is notes


# Standardize profile data
#-----------------------------------------------------------------------
data_to_homog <- collect_data_to_homog(data_dir, locationData)
data_to_homog_w_lvls <- add_exp_trt_levels(data_to_homog, profileData)
stdzd_data <- standardize_col_names(data_to_homog_w_lvls, profileData)


# Profile data unit conversion
#-----------------------------------------------------------------------
stdzd_unitConv_profileOutput <- profileUnitConversion(stdzd_data, profileData, unitConversions, print_msg = F)
stdzd_unitConv_profileData <- as.data.frame(stdzd_unitConv_profileOutput[[1]])
prof_conversion_Notes <- as.data.frame(stdzd_unitConv_profileOutput[[2]]) #output is notes


# Profile data QC
#-----------------------------------------------------------------------
profileData_QC_Notes <- profileData_QC(profileData, stdzd_unitConv_profileData) #output is notes


# Combine location and profile data, export data (completes data homogenization)
#----------------------------------------------------------------------------------
output_path <- getwd()
homog_data <- hmgz(unitConv_locationData, stdzd_unitConv_profileData, output_path, out_csv=T, out_rds=T)


Homogenization functions


find_key_path()

Collects the file path to the key file. Searches for filename containing the word: "key"

Input: Folder path

Output: Full path to key file


read_key_location(req_fields="default")

Loads the location data tab from the key file. Includes a set of required fields or returns error message. Required fields can be ignored by setting req_fields = False

Input: Key file path

Output: Dataframe with location tab data


read_key_profile()

Loads the profile tab data.

Input: Key file path

Output: Dataframe with profile tab data


build_key_notes()

Collects all notes/comments in both the location and profile tabs of the key file.

Input: Key file path, location dataframe, profile dataframe

Output: Dataframe with combined notes


build_key_notes()

Collects all notes in both the location and profile tabs of the key file.

Input: Key file path, location dataframe, profile dataframe

Output: Dataframe with combined notes


get_unit_conversions()

Pulls unit conversions from tab in the key file. If required unit conversions are missing, they can be added to the key file tab.

Input: Key file path

Output: Dataframe of unit conversions by row


build_unitConv_notes()

Creates empty dataframe for logging unit conversions

Input: none

Output: Empty dataframe with specified column names


locationData_to_convert()

Collects all location data with specified units, then applies unit conversion if specified unit does not equal "givenUnit" specified in unitConversions tab

Input: Location dataframe, unit conversions dataframe

Output: Dataframe with only the location vars that need to be converted


apply_locData_UnitConv()

Applies conversion factors to location data vars specified by locationData_to_convert()

Input: Location dataframe, output from locationData_to_convert()

Output: List containing two dataframes: 1) Location dataframe with converted units, 2) Updated conversionNotes dataframe.

DEBUG: Needs to send error message if unit missing or cannot be converted.


Sheet icons created by Freepik - Flaticon

edah's People

Contributors

piersond avatar

Watchers

 avatar

edah's Issues

Logging

Add logging statement to each function.

Goal is to catch common errors.
If error, print to console and log report.
If no error, log progress/data manipulation.

Not meant to replace QC report, but perhaps logs could be imported into the QC report.

Data alignment

How to align data from the same soil layer?
How to align profile level information with soil layer data?
How to generalize site information and include with profile data?

  1. Create dummy data files and build up scripted construct to make these three things work.
  2. Incorporate construct into SoDaH harmonization scripts
    • Perhaps data alignment should be optional?

Update readme

  1. Add the rest of the code functions.

  2. Explain what the code does in the intro blurb.

  3. Add infographic of the process + example input & output screen caps

  4. Perhaps add a flow diagram of the harmonization.

Handling sensor data?

A key file could import it, but it would swamp the data table. Perhaps create a separate out class? e.g. HMGZD_SENSOR, such that the non-sensor and sensor data could be compiled into separate databases.

Converting dates

Add functionality to use date unit given in key (e.g. YYYYMMDD) to convert value or column data into standard date format.
Perhaps best to implement as a standalone function...?

Add key file output naming option

To be added in place of HMGZD. Allows for using such tags to compile harmonized files into different databases.
Useful change for sensor data files.

variable specific QC plots

For example, if 'var' in HMGZD and 'var_plot.R' exists, source file and run varplot() function, include in QC output file.

Tools directory

A place to store code tools for working with the HMGZD data.

Ideas:

  • alignment/aggregation code
  • analyte average by depth
  • example plots
  • simple shiny app for data visualization

Handling key file updates

Need key file handler capable of:

  • Finding/preventing new var additions in key file
  • QC of new var additions
  • Updating master key file with new var additions. May be useful to keep a separate log file of all vars used previously.

App features

Overview

  • Improve or remove for now

Map

  • Add info to popups...Pit name, depth, analytes, sensor Y/N, etc.
  • Can we show Yang's aerial view GIFs?
  • Site icon @ high lvl zoom and pit icon when zooming in
  • Add loading animation

Plot

  • Filter by site, depth
  • Color by pit level differences, depth
  • Facet wrap option
  • Add loading animation

Datatable

  • Filters: Depth, site,
  • Option to use pit level aligned or sample-date separated info
  • Option to hide metadata
  • Dataset selector: Enzymes, Fractions, Roots EOC

Sensors

  • Option to use daily avg data (hopefully increases plot speed)
  • Fix sensor selector issue
  • Add loading animation

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.