openaq / openaq-averaging Goto Github PK

A repo focused on determining longer-term averages at varying geospatial scales from data accessed from the OpenAQ Platform.

JavaScript 100.00%

openaq-averaging's People

Contributors

Stargazers

Watchers

Forkers

amit2011

openaq-averaging's Issues

Creating Annual City-Level Averages

NOTE: Please see the README of this repo for the motivation for this work.

This issue describes:

(A) A simple method to derive city PM2.5 annual averages

(B) Caveats and short-comings of this method

(C) Corresponding SQL code

The purpose of this "issue" is to provoke discussion on ways individuals may approach creating these annual averages, flag potential data issues they may (or may not) want to address, and hopefully stimulate more open-source projects in the community that derive these values.

This content of the issue will primarily focus on PM2.5 due to reasons of a) simplicity, b) PM2.5's ubiquitousness in the OpenAQ platform, and c) PM2.5's large public health impact, compared to other commonly-measured air pollutants in most regions.

You are encouraged to give feedback, present ideas, or otherwise comment on any of these three sections.

This issue is part of OpenAQ's 2019 Goals and directly emerged after OpenAQ's participation in a DataKindDC DataDive event in Apr 2019.

(A) A simple method to derive city PM2.5 annual averages

At the station level, in the city of interest, remove all negative values from the raw dataset over a given, specified year (1).
At the station level, remove all measurements that report precisely 905ug/m^3 or 985 ug/m^3 (2).
Average all station-level values for a given day in a given city.
Average all days in a given year for a given city.

We did the averaging this way - creating a daily average of values from reporting stations in a given city - rather than averaging all station-level raw values for an entire year, since one may over-weight those days when simply more stations are reporting.

(B) Caveats and short-comings of this method

Notes from above:
1 - Removing all negative values may get rid of data that is essentially below the level of detection of the instrument (e.g. values should be represented as '0') and/or data that are valid but poorly calibrated. This may bias measurements toward higher or lower values, depending upon the issue. Without a priori knowledge of the source, it is difficult to account for this potential issue.

2 - Removing data points that equal '985 ug/m^3' or '905 ug/m^3' is done since this value is sometimes used by those who operate Beta Attenuation Monitors to indicate an output error (Source). We are unclear how uniformly this is done globally, and in most cases, we do not have a priori knowledge that PM2.5 monitors are specifically Beta Attenuation Monitors. However, we still chose to remove these values since they are specific values and likely, if removed in error from longer trends, only affect a small subset of measurements over a year-long period.

Related to Point 2: The default measurement range of BAMs does not go above 985 ug/m^3 (Source). We don't have knowledge of how operators have configured their instruments (nor, again, if they are using BAMs in the first place), so we have chosen not to remove data 986 ug/m^3 and over. One could at least flag these values.

Other notes:

This method does not remove statistically significant outliers which may be indicative of a hyper-local issue, nor does it remove positive, repeating values that may indicate an issue with a monitor's reporting. There is a discussion to be had on whether these should be removed at all or at least flagged.
This method does not calculate % of time data are reporting from a given station out of % of possible raw datapoints that make up the yearly value. Nor does it report % of days reporting, for a given year. For instance, the WHO Outdoor Air Pollution Database includes information on temporal coverage, with the aim to report values derived from raw data reported >75% of the time. However, doing such a calculation is entirely possible with the information available from the data available from the OpenAQ platform.

(C) Corresponding SQL code:

Thanks @jflasher.

This code calculates 2018 annual average values for all cities reporting data in the OpenAQ Platform.

We ran this code in Athena (here is an explainer on how you can do that too).

SELECT date_format(date, '%Y') as year, country, city, sum(count) AS count, avg(average) as average
FROM 
    (
      SELECT date,
         sum(count) AS count,
         avg(average) AS average,
         city,
         country
    FROM 
        (
          SELECT date_trunc('day', from_iso8601_timestamp(date.utc)) AS date, location, count(*) AS count, avg(value) AS average, city, country
        FROM fetches
        WHERE date.utc
            BETWEEN '2018-01-01'
                AND '2018-12-31'
                AND parameter = 'pm25'
                AND value >= 0
                AND value <> 985
                AND value <> 905
        GROUP BY  date_trunc('day', from_iso8601_timestamp(date.utc)), location, city, country
        )
        GROUP BY  date, city, country
    )
GROUP BY  country, city, date_format(date, '%Y')
ORDER BY  date_format(date, '%Y') asc, country asc, city asc;

This post was informed by conversations with @AreteY @jflasher and the DataKindDC Team.

Community Request: Annual PM2.5 and NO2 averages for cities 2017-2019

Create a table of information for PM2.5, PM10, and NO2 in our system:

annual average values of pollutants for stations during any time interval, from 2017-2019
% valid of all possible measurements,
lat+long

Updated code to generate this request below:
(H/T to @jflasher)

Important notes on this information:

Places without 1-hour averaging periods are not included.
Several data values were removed for both PM2.5 and NO2 because it's been reported that these values are supplier-specific to indicate specific issues and not actual measured values. These values are visible in the code below.
The % reporting indicates the percentage of time, over the entire time interval and given the reported time-averaged interval that the data report, that the data are present in the OpenAQ system. It is important to realize that this percentage is effected by the ability of the source to generate data and share it, as well, as reflects OpenAQ's ability to access the data. For example, if an adapter in the OpenAQ system breaks, data could be generated and publicly shared from the source, yet not accounted for in the OpenAQ system.
No guarantees can be made about accuracy of coordinates, but we require at least 4 values after the decimal to attach to a location. This equates to about about 11m resolution at the equator. To the best of our ability, we have checked the underlying sources to make sure they are reporting station level data at a station location, not reporting from the city-center.
For PM2.5 specifically, data were removed below 4 ug/m^3 due to typical detection limits from suppliers.
From NO2 specifically, data were removed below 2ug/m^3 due to typical detection limits from suppliers.
Since the request was specifically to not include US, CN, and EU data, that has been removed, but note, if added back in: In some of these places, there are multiple sources for single locations (e.g. the US EPA Air Now program and State Air report data from the same location - this messes up the '% of time reporting' information).

PM2.5 code with comments:
SELECT *

FROM

(
SELECT country,
array_agg(distinct(city)) AS city,
array_agg(distinct(location)) AS locations,
date_format(date_trunc('year', from_iso8601_timestamp(date_local)), '%Y') as year,
avg(value) AS average,
round(((count(value) / (24.0 * 365.0)) * 100.0)) as percent_reporting, -- calculates % reporting for hourly-averaged data
count(value) AS num_measurements,
round(coordinates.longitude,5) AS longitude,
round(coordinates.latitude, 5) AS latitude,
array_agg(distinct(cast(attribution as JSON))) as source
FROM measurement
WHERE parameter = 'pm2.5'
AND unit = 'µg/m³'
AND date_local
BETWEEN '2017-01-01'
AND '2019-12-31'
AND value >= 0 -- Other suggestion: remove data below 4ug typical detection limits of PM monitors
AND value <> 985 -- Removes value that is reported as a supplier flagged value
AND value <>999 -- Removes value that is reported as a supplier flagged value
AND value <>1985 -- Removes value that is reported as a supplier flagged value
AND value <>915 -- Removes value that is reported as a supplier flagged value
AND value <>515 -- Removes value that is reported as a supplier flagged value
AND averagingperiod.value = 1.0 --In hours
AND coordinates.longitude BETWEEN -180 AND 180
AND coordinates.latitude BETWEEN -90 AND 90
AND coordinates.latitude <> 0.0
AND coordinates.longitude <> 0.0
AND country <> 'CN' -- Removes Chinese data since not requested and is large
AND country <> 'US' -- Removes US data since not requested and is large
AND country <> 'ES' -- Spanish data not caught in previous filters and is not needed
AND sourcename <> 'AirNow'-- AirNow data is duplicative in our system for Embassy and Canadian data
AND sourcename <> 'GIOS' -- Takes out a chunk of the EU data since not requested and is large
AND sourcename NOT LIKE 'EEA%' -- Takes out a chunk of the EU data since not requested and is large

GROUP BY round(coordinates.longitude,5), round(coordinates.latitude,5), country, date_format(date_trunc('year', from_iso8601_timestamp(date_local)), '%Y')
ORDER BY country, city, year
)

WHERE
percent_reporting >= 0 -- Can be adjusted to put in a station-level % of time reporting threshold

NO2 + PM10 code - Same, just substitute in XX:
WHERE parameter = 'XX'

Creating an /averages endpoint

To make the work of generating averages and other statistics at various temporal and spatial resolutions, OpenAQ could have a new API endpoint which returns air quality measurement averages given a set of parameters. With such an endpoint, users no longer have to query or download the data and then parse and clean the data to get the values of interest for reporting.

This issue proposes to prototype such an endpoint using a separate AWS account running Athena against the OpenAQ S3 bucket fetches_realtime_gzipped, and then to proceed as follows:

High level steps

Clone or fork and run the existing node.js OpenAQ API (for local development)
Add an endpoint /stats or /averages which queries Athena using a set of parameters provided by the user, and returns an S3 location (follows Athena asynchronous request / response cycle)
(V2) Has an option to clean the data: If users provide some parameters to clean the data (e.g. removing negative or repeating values), the endpoint would either generate a query to Athena for the averages minus the cleaned values, or if the "cleaning" operation is complex (such as removing repeating values), this will include a more complex workflow: The endpoint would still return an S3 endpoint which the workflow would eventually write to. The endpoint would then query Athena for values (as apposed to aggregations) and the output of this query would be to an S3 location which would kick off a job to clean and generate statistics from the data, and then write to the final S3 location for the user.

The API endpoint will produce one or more averages for a variety of parameters:

spatial resolution: user can pass location=, city=, or country (V2: coordinates= and radius=)
temporal resolution: user can define if they want temporal_resolution= of daily, weekly, monthly, yearly
time span: user provides a start_date= and end_date
parameter: user can pass one or more parameters[]= to return averages for (see https://docs.openaq.org/#api-Parameters)

Additional work:

In addition to the functionality of the /averages endpoint, the additional work could be done:

returning other types of stats (histograms, std deviations, etc)
data could be transformed to parquet for faster response times
visualizations could be built on top of the averages data to showcase the endpoint

I'm attaching some whiteboard-ing from the last DataKind DC datajam in case they are helpful.

cc @RocketD0g @olafveerman @jflasher @dominicwhite @minh5

Statistical and Data Issues

Statistics Without Borders can provide free assistance in the area of data science and statistical applications in determining longer-term averaging. I would be happy to talk with anyone who thinks this might be of use.

Description of a Simple Averaging Tool

Below is a quick diagram of what a simple averaging tool, accessible via an API endpoint could look like, in terms of:

Possible outputs (e.g. city-level daily average of PM2.5) via an 'averaging' API endpoint.
Steps necessary to clean data (Ideally, we'd clean using all the functionality described in openaq-quality-checks, but may not be the most expedient route, and simply making sure measurement units are consistent, as well as data is only included if GTE to 0, could be sufficient for now.
Details on averaging method (e.g. average over the defined temporal scale for a single station first, before aggregating up to multiple stations comprising a city or country).
Details on some useful associated stats with the calculation. Small adjustment to image: Having a stat that gave a sense of % time stations were reporting would be helpful, probably more so than limiting to only stations reporting a specific - say 75% - proportion of the time.

@jflasher - maybe you could share the SQL query you used to play around with this?