openaq / openaq-averaging Goto Github PK
View Code? Open in Web Editor NEWA repo focused on determining longer-term averages at varying geospatial scales from data accessed from the OpenAQ Platform.
A repo focused on determining longer-term averages at varying geospatial scales from data accessed from the OpenAQ Platform.
NOTE: Please see the README of this repo for the motivation for this work.
This issue describes:
The purpose of this "issue" is to provoke discussion on ways individuals may approach creating these annual averages, flag potential data issues they may (or may not) want to address, and hopefully stimulate more open-source projects in the community that derive these values.
This content of the issue will primarily focus on PM2.5 due to reasons of a) simplicity, b) PM2.5's ubiquitousness in the OpenAQ platform, and c) PM2.5's large public health impact, compared to other commonly-measured air pollutants in most regions.
You are encouraged to give feedback, present ideas, or otherwise comment on any of these three sections.
This issue is part of OpenAQ's 2019 Goals and directly emerged after OpenAQ's participation in a DataKindDC DataDive event in Apr 2019.
At the station level, in the city of interest, remove all negative values from the raw dataset over a given, specified year (1).
At the station level, remove all measurements that report precisely 905ug/m^3 or 985 ug/m^3 (2).
Average all station-level values for a given day in a given city.
Average all days in a given year for a given city.
We did the averaging this way - creating a daily average of values from reporting stations in a given city - rather than averaging all station-level raw values for an entire year, since one may over-weight those days when simply more stations are reporting.
Notes from above:
1 - Removing all negative values may get rid of data that is essentially below the level of detection of the instrument (e.g. values should be represented as '0') and/or data that are valid but poorly calibrated. This may bias measurements toward higher or lower values, depending upon the issue. Without a priori knowledge of the source, it is difficult to account for this potential issue.
2 - Removing data points that equal '985 ug/m^3' or '905 ug/m^3' is done since this value is sometimes used by those who operate Beta Attenuation Monitors to indicate an output error (Source). We are unclear how uniformly this is done globally, and in most cases, we do not have a priori knowledge that PM2.5 monitors are specifically Beta Attenuation Monitors. However, we still chose to remove these values since they are specific values and likely, if removed in error from longer trends, only affect a small subset of measurements over a year-long period.
Related to Point 2: The default measurement range of BAMs does not go above 985 ug/m^3 (Source). We don't have knowledge of how operators have configured their instruments (nor, again, if they are using BAMs in the first place), so we have chosen not to remove data 986 ug/m^3 and over. One could at least flag these values.
Other notes:
This method does not remove statistically significant outliers which may be indicative of a hyper-local issue, nor does it remove positive, repeating values that may indicate an issue with a monitor's reporting. There is a discussion to be had on whether these should be removed at all or at least flagged.
This method does not calculate % of time data are reporting from a given station out of % of possible raw datapoints that make up the yearly value. Nor does it report % of days reporting, for a given year. For instance, the WHO Outdoor Air Pollution Database includes information on temporal coverage, with the aim to report values derived from raw data reported >75% of the time. However, doing such a calculation is entirely possible with the information available from the data available from the OpenAQ platform.
Thanks @jflasher.
This code calculates 2018 annual average values for all cities reporting data in the OpenAQ Platform.
We ran this code in Athena (here is an explainer on how you can do that too).
SELECT date_format(date, '%Y') as year, country, city, sum(count) AS count, avg(average) as average
FROM
(
SELECT date,
sum(count) AS count,
avg(average) AS average,
city,
country
FROM
(
SELECT date_trunc('day', from_iso8601_timestamp(date.utc)) AS date, location, count(*) AS count, avg(value) AS average, city, country
FROM fetches
WHERE date.utc
BETWEEN '2018-01-01'
AND '2018-12-31'
AND parameter = 'pm25'
AND value >= 0
AND value <> 985
AND value <> 905
GROUP BY date_trunc('day', from_iso8601_timestamp(date.utc)), location, city, country
)
GROUP BY date, city, country
)
GROUP BY country, city, date_format(date, '%Y')
ORDER BY date_format(date, '%Y') asc, country asc, city asc;
This post was informed by conversations with @AreteY @jflasher and the DataKindDC Team.
Create a table of information for PM2.5, PM10, and NO2 in our system:
Updated code to generate this request below:
(H/T to @jflasher)
Important notes on this information:
Places without 1-hour averaging periods are not included.
Several data values were removed for both PM2.5 and NO2 because it's been reported that these values are supplier-specific to indicate specific issues and not actual measured values. These values are visible in the code below.
The % reporting indicates the percentage of time, over the entire time interval and given the reported time-averaged interval that the data report, that the data are present in the OpenAQ system. It is important to realize that this percentage is effected by the ability of the source to generate data and share it, as well, as reflects OpenAQ's ability to access the data. For example, if an adapter in the OpenAQ system breaks, data could be generated and publicly shared from the source, yet not accounted for in the OpenAQ system.
No guarantees can be made about accuracy of coordinates, but we require at least 4 values after the decimal to attach to a location. This equates to about about 11m resolution at the equator. To the best of our ability, we have checked the underlying sources to make sure they are reporting station level data at a station location, not reporting from the city-center.
For PM2.5 specifically, data were removed below 4 ug/m^3 due to typical detection limits from suppliers.
From NO2 specifically, data were removed below 2ug/m^3 due to typical detection limits from suppliers.
Since the request was specifically to not include US, CN, and EU data, that has been removed, but note, if added back in: In some of these places, there are multiple sources for single locations (e.g. the US EPA Air Now program and State Air report data from the same location - this messes up the '% of time reporting' information).
PM2.5 code with comments:
SELECT *
FROM
(
SELECT country,
array_agg(distinct(city)) AS city,
array_agg(distinct(location)) AS locations,
date_format(date_trunc('year', from_iso8601_timestamp(date_local)), '%Y') as year,
avg(value) AS average,
round(((count(value) / (24.0 * 365.0)) * 100.0)) as percent_reporting, -- calculates % reporting for hourly-averaged data
count(value) AS num_measurements,
round(coordinates.longitude,5) AS longitude,
round(coordinates.latitude, 5) AS latitude,
array_agg(distinct(cast(attribution as JSON))) as source
FROM measurement
WHERE parameter = 'pm2.5'
AND unit = 'µg/m³'
AND date_local
BETWEEN '2017-01-01'
AND '2019-12-31'
AND value >= 0 -- Other suggestion: remove data below 4ug typical detection limits of PM monitors
AND value <> 985 -- Removes value that is reported as a supplier flagged value
AND value <>999 -- Removes value that is reported as a supplier flagged value
AND value <>1985 -- Removes value that is reported as a supplier flagged value
AND value <>915 -- Removes value that is reported as a supplier flagged value
AND value <>515 -- Removes value that is reported as a supplier flagged value
AND averagingperiod.value = 1.0 --In hours
AND coordinates.longitude BETWEEN -180 AND 180
AND coordinates.latitude BETWEEN -90 AND 90
AND coordinates.latitude <> 0.0
AND coordinates.longitude <> 0.0
AND country <> 'CN' -- Removes Chinese data since not requested and is large
AND country <> 'US' -- Removes US data since not requested and is large
AND country <> 'ES' -- Spanish data not caught in previous filters and is not needed
AND sourcename <> 'AirNow'-- AirNow data is duplicative in our system for Embassy and Canadian data
AND sourcename <> 'GIOS' -- Takes out a chunk of the EU data since not requested and is large
AND sourcename NOT LIKE 'EEA%' -- Takes out a chunk of the EU data since not requested and is large
GROUP BY round(coordinates.longitude,5), round(coordinates.latitude,5), country, date_format(date_trunc('year', from_iso8601_timestamp(date_local)), '%Y')
ORDER BY country, city, year
)
WHERE
percent_reporting >= 0 -- Can be adjusted to put in a station-level % of time reporting threshold
NO2 + PM10 code - Same, just substitute in XX:
WHERE parameter = 'XX'
To make the work of generating averages and other statistics at various temporal and spatial resolutions, OpenAQ could have a new API endpoint which returns air quality measurement averages given a set of parameters. With such an endpoint, users no longer have to query or download the data and then parse and clean the data to get the values of interest for reporting.
This issue proposes to prototype such an endpoint using a separate AWS account running Athena against the OpenAQ S3 bucket fetches_realtime_gzipped
, and then to proceed as follows:
/stats
or /averages
which queries Athena using a set of parameters provided by the user, and returns an S3 location (follows Athena asynchronous request / response cycle)The API endpoint will produce one or more averages for a variety of parameters:
location=
, city=
, or country
(V2: coordinates=
and radius=
)temporal_resolution=
of daily
, weekly
, monthly
, yearly
start_date=
and end_date
parameters[]=
to return averages for (see https://docs.openaq.org/#api-Parameters)In addition to the functionality of the /averages
endpoint, the additional work could be done:
I'm attaching some whiteboard-ing from the last DataKind DC datajam in case they are helpful.
Statistics Without Borders can provide free assistance in the area of data science and statistical applications in determining longer-term averaging. I would be happy to talk with anyone who thinks this might be of use.
Below is a quick diagram of what a simple averaging tool, accessible via an API endpoint could look like, in terms of:
@jflasher - maybe you could share the SQL query you used to play around with this?
cc: @sruti
The Averages call returns the longitude coordinates for both latitude and longitude.
API request:
https://api.openaq.org/beta/averages?country=CN&temporal=year&limit=10000
First few lines of the response:
Any chance this could be fixed?
Thanks
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.