Git Product home page Git Product logo

harmonize-wq's Introduction

PyPi Documentation Status

harmonize-wq

Standardize, clean, and wrangle Water Quality Portal data into more analytic-ready formats

US EPA’s Water Quality Portal (WQP) aggregates water quality, biological, and physical data provided by many organizations and has become an essential resource with tools to query and retrieval data using python or R. Given the variety of data and variety of data originators, using the data in analysis often requires data cleaning to ensure it meets the required quality standards and data wrangling to get it in a more analytic-ready format. Recognizing the definition of analysis-ready varies depending on the analysis, the harmonize_wq package is intended to be a flexible water quality specific framework to help:

  • Identify differences in data units (including speciation and basis)
  • Identify differences in sampling or analytic methods
  • Resolve data errors using transparent assumptions
  • Transform data from long to wide format

Domain experts must decide what data meets their quality standards for data comparability and any thresholds for acceptance or rejection.

For more complete tutorial information, see: demos

Quick Start

harmonize_wq can be installed using pip:

$ python3 -m pip install harmonize-wq

To install the latest development version of harmonize_wq using pip:

pip install git+https://github.com/USEPA/harmonize-wq.git

Example Workflow

dataretrieval Query for a geojson

import dataretrieval.wqp as wqp
from harmonize_wq import wrangle

# File for area of interest
aoi_url = r'https://raw.githubusercontent.com/USEPA/harmonize-wq/main/harmonize_wq/tests/data/PPBays_NCCA.geojson'

# Build query
query = {'characteristicName': ['Temperature, water',
                                'Depth, Secchi disk depth',
                                ]}
query['bBox'] = wrangle.get_bounding_box(aoi_url)
query['dataProfile'] = 'narrowResult'

# Run query
res_narrow, md_narrow = wqp.get_results(**query)

# dataframe of downloaded results
res_narrow

Harmonize results

from harmonize_wq import harmonize

# Harmonize all results
df_harmonized = harmonize.harmonize_all(res_narrow, errors='raise')
df_harmonized

Clean results

from harmonize_wq import clean

# Clean up other columns of data
df_cleaned = clean.datetime(df_harmonized)  # datetime
df_cleaned = clean.harmonize_depth(df_cleaned)  # Sample depth
df_cleaned

Transform results from long to wide format

There are many columns in the dataframe that are characteristic specific, that is they have different values for the same sample depending on the characteristic. To ensure one result for each sample after the transformation of the data these columns must either be split, generating a new column for each characteristic with values, or moved out from the table if not being used.

from harmonize_wq import wrangle

# Split QA column into multiple characteristic specific QA columns
df_full = wrangle.split_col(df_cleaned)

# Divide table into columns of interest (main_df) and characteristic specific metadata (chars_df)
main_df, chars_df = wrangle.split_table(df_full)

# Combine rows with the same sample organization, activity, location, and datetime
df_wide = wrangle.collapse_results(main_df)

The number of columns in the resulting table is greatly reduced

Output Column Type Source Changes
MonitoringLocationIdentifier Defines row MonitoringLocationIdentifier NA
Activity_datetime Defines row ActivityStartDate, ActivityStartTime/Time, ActivityStartTime/TimeZoneCode Combined and UTC
ActivityIdentifier Defines row ActivityIdentifier NA
OrganizationIdentifier Defines row OrganizationIdentifier NA
OrganizationFormalName Metadata OrganizationFormalName NA
ProviderName Metadata ProviderName NA
StartDate Metadata ActivityStartDate Preserves date where time NAT
Depth Metadata ResultDepthHeightMeasure/MeasureValue, ResultDepthHeightMeasure/MeasureUnitCode standardized to meters
Secchi Result ResultMeasureValue, ResultMeasure/MeasureUnitCode standardized to meters
QA_Secchi QA NA harmonization processing quality issues
Temperature Result ResultMeasureValue, ResultMeasure/MeasureUnitCode standardized to degrees Celcius
QA_Temperature QA NA harmonization processing quality issues

Issue Tracker

harmonize_wq is under development. Please report any bugs and enhancement ideas using the issue track:

https://github.com/USEPA/harmonize-wq/issues

Disclaimer

The United States Environmental Protection Agency (EPA) GitHub project code is provided on an "as is" basis and the user assumes responsibility for its use. EPA has relinquished control of the information and no longer has responsibility to protect the integrity, confidentiality, or availability of the information. Any reference to specific commercial products, processes, or services by service mark, trademark, manufacturer, or otherwise, does not constitute or imply their endorsement, recommendation or favoring by EPA. The EPA seal and logo shall not be used in any manner to imply endorsement of any commercial product or activity by EPA or the United States Government.

harmonize-wq's People

Contributors

cristinamullin avatar jbousquin avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

harmonize-wq's Issues

Add handling of additional characteristicNames that are the same constituent

Some variables have different names but represent the same constituent (e.g., "Total Phosphorus, mixed forms" and "Phosphorus"). Add characteristicNames that can be converted in the same way as an existing. May require:

  • re-structuring of domains.out_col_lookup()
  • filtering of results to limit qualifiers to those that are the same
  • further development of domains.accepted_methods()

u_reg used for convert decorators

Currently one standard unit registry is used by the conversion decorators. This needs to either be passed in (potentially by another decorator?) or the u_reg needs to have all additions for all expected units. Currently just has 'TODO: find more elegant way to do this with all definitions'

OffsetUnitCalculusError

Error from Cristina Mullins on wrangle.collapse_results(). Error from pint:

OffsetUnitCalculusError: Ambiguous operation with offset unit (degree_Celsius). See https://pint.readthedocs.io/en/latest/nonmult.html for guidance.

Suspect this occurs when multiple rows with the same index are averaged (adding temperatures can cause different results depending on their units) this is negated when it's an average but need to handle with care.

Make forward compatible with Shapely 2.0

Warning message regarding array interface being deprecated and dropped in Shapely 2.0 and converting to numpy array when running location.harmonize_locations()

Expand methods filtering

The goal is to accept a dict of 'accepted' methods. Currently there is a static dict of accepted methods for select characteristicNames. These can also be read from domains and/or NEMI. Some of these methods may also help define e.g., sample fraction, when unreported.

Generalize wet_dry functions in clean module

clean.wet_dry_checks() uses a combination of column values to identify questionable field values. It was constructed to capture a suspiscious dry sediment measure with water as it's 'ActivityMediaName'. Given a dict of columns and criteria it could be used to mask and then update to a given value.

wet_dry_drop() drops rows based on a filter. It too could be generalized, or updated to instead add a qa_flag for later removal (what everything else has transitioned to).

Capitalization in units

e.g., mg/l vs mg/L is handled fine by pint, but any pre-processing that updates unit-string to fix it before pint works with it have to recognize/fix both. Planned solution is either (A) upper()/lower() and make sure it doesn't change the meaning to pint, OR (B) use upper lower to build out the dict/list (not preferable because string may be longer than char that is problematic and could require many combinations). TADA uses all caps for this and other fields.

doc_test stability

when testing the examples in the harmonize module there is inconsistency in the order of rows:

harmonize.convert_unit_series(quantity_series, unit_series, units = 'mg/l')

and the order of columns being added:

df_result_all

Suspect the issue may be due to a non-ordered dictionary being used somewhere it shouldn't be.

Review edits

List to track progress on reviewer comments. Will add/edit as we go.

  • nitpick (domain.py): there is no need for a raw string for TADA_DATA_URL
  • discussion (convert.py): About the TODO - both points of view (regrouping constants in a single place, or having them defined near their place of use to avoid jumping around the code base) are valid. I am usually in favor of the former.
  • suggestion (domain.py): In harmonize_TADA_dict, we could use a groupby operation to avoid looping through the dataframe using python. TOCHECK
  • suggestion (domain.py): replace x in list(set(pandas_series)) pattern w/ .unique method
  • suggestion (domain.py, basic.py): replace one dict functions (e.g., out_col_lookup) w/ module-level dicts. Move their docstrings (e.g., sources) to module docstring.
  • suggestion (convert.py): We could add "references" sections in the docstrings so that the sources are present in the website and not only in the source code.
  • suggestion (basis.py, general): use pandas' methods, e.g., assign w/ np.where() for masked replace. Do compare on timeit, consistency and any other pros/cons.
  • todo (init.py): importlib.metadata was added in python 3.8, which is the minimal version supported by the package according to its pyproject.toml. The try .. except block should not be needed, even more so considering that importlib_metadata is not listed in the project requirements.
  • todo (basis.py): group the conditions branches in update_result_basis
  • todo (contributing.rst): Add a section describing how to setup development environment (e.g. installing the test and docs dependencies).
  • issue (domain.py): list sub-dependencies (e.g., requests)
  • issue (domain.py): specify exception expected in re_case.

Note: several items not on the list have been or are being resolved.

Catch error for unrecognized units with 'NoneType'

Within harmonize.WQCharData.check_units(), the try except catches pint.UndefinedUnitError, but pint has to be able to evaluate it to get to that point. e.g., a special character unit like '#' of '#/100mL' first throws an AttributeError: 'NoneType' object has no attribute 'evaluate'. commit a18a70d added a built in fix for now but ideally need to catch/handle that error better.

pip install

Update toml to work w/ pypi and add info for pip installing to readme/docs.

Releases

Create first release when first release is ready

dataretrieval v1.0.1

Update require at least v1.0 and leverage new functionality (e.g., replace wrangle.what_activities).

Check for duplicate axis label when coercing measure

Specifically the error is:
ValueError: cannot reindex on an axis with duplicate labels

And comes up for GOM Phosphorus in _coerce_measure():
df_out[self.out_col] = meas_s

It occurs where geopandas is trying to re-index, this geopandas version has a relevant deprecation warning:
~lib\site-packages\geopandas\geodataframe.py:1443: FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.

Use period of record to check for data updates

When downloading large datasets it may be useful to update only what changes from a prior download. Period of record service should be part of summary webservices (not from profile). This may be better as part of dataRetrieval (at least on top, may already be in there).

Demos/tests: leverage NCCA sample frames?

Sample frames are now being published as FeatureService, can these be easily leveraged by demos instead of having a local copy in tests? Could a larger test be run with actions to ensure there is no data in these that would break? (to run periodically rather than on PRs)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.