openaq / openaq-data-format Goto Github PK

View Code? Open in Web Editor NEW

30.0 30.0 4.0 24 KB

A description of the data format provided by the OpenAQ platform.

License: MIT License

JavaScript 100.00%

openaq-data-format's People

Contributors

Stargazers

Watchers

Forkers

cuulee dqgorelick moriartyjm nathan-omenge

openaq-data-format's Issues

Adding an insert date/time stamp

Suggest adding field that indicates when data have been added into the system.

The motivation for this suggested change:
This will help make clearer to the user when information has been added, especially back-filled from a RT source or from a researcher. Seems more transparent Noticed that the EU EEA API does this.

Downside:
I know a downside to this would be that we won't be able to assign a time for data already collected into the system, and that is not ideal from using the information from a software development perspective.

Feedback from Multitude

Nick Masson of Mulitiude gave us this feedback and q's on our data format + API. Putting this as one big issue for now, will re-visit in a week or so; thought it might be of interest to others.

@jflasher - please give your feedback on these answers before I email Nick back.

Answering inline:

1) Do you have a list showing, for example, all of the "attribution" sources, the "adapter" associated with that source, and a layman description of the source?

We have a short description by name (e.g. 'US EPA/AirNow) and also the URL for the originating source - all under the 'attribution' field. At this stage, all sources on our platform originate from governmental bodies. You can find more about that here: https://docs.openaq.org/#api-Sources

If you mean more info on the instrument types, calibration procedures etc.: We know this would be valuable information for many folks, but we have no way to systematically or reliably get that from nearly any available governmental source (whether it is currently in our system or not).

2) correct me if i'm wrong, but I assume the attribution field describes where the data is coming from? What is the "sourceName".

Yes, that's correct. the sourceName is just a way to refer to the specific source files here: https://github.com/openaq/openaq-fetch/tree/develop/sources

3) Our immediate use case would be to be able to quickly pull data from all of the US regulatory stations, for any or all of the parameters they measure. It is important that we can identify if the data is from a US regulatory station, or other source. I assume that the "attribution" field and "url" is consistent, so, for example, might be "AirNow" and "www.airnow.gov" for all of the US regulatory data that is acquired real-time?

Yes, the attribution field is consistent but our system is not designed to let you search by attribution.
Two thoughts though on this:

(a) Currently, all real-time data aggregated to our system is from AirNow. (And it should be noted all data currently on our platform is from governmental sources)The exception to this is a few months worth of data that we aggregated from Houston, TX before adding in the AirNow sources (this data was collected by local EPA, but it may or may not have been used for regulation). We no longer aggregate from this source, though. If you wanted, currently, if you used the API to filter by the country field in real-time (or historically, with the exception of Houston, TX), you would only get AirNow data.

(b) But of course we plan to add in other data source types (e.g. research-grade and low cost sensors), and our system will need to indicate this. Currently, we've sketching out a very simple system to differentiate these types. See: Issue #8 This will be helpful to see if data sources are from a governmental level versus a researcher, but it won't let you know if the source is specifically used for regulatory purposes versus others. We would have trouble being able to distinguish this for many if not most countries.

Similarly, if you were to pull in data from the same regulatory stations, but have it be post-QA/QC, we could discern between the two data-sets by sorting on the "attribution" field?

At this stage, frankly, we don't have plans to pull in post-QA/QC data. Our main goal is to capture data that would otherwise be lost for the record. This is not the case for US EPA data, obviously, but it is a useful data set for people to build complementary tools on and to compare with. That said, if the community says this is a 'must' we'll see if we can make it happen, and we would need a tag on our data format that indicated whether data was pre or post QA/QC.

4) Can the date_from and date_to accept datetime formatted to the second (e.g. "2016-05-07T12:44:22.556Z"). I am aware that most of the data is fairly low frequency (hourly averages), but out system would be windowing on time intervals to make sure we get the entire time-series (we do batch processing on consecutive time-intervals of data, and can be sure not to miss anything if we bracket down to accurate time-intervals).

Yup, it should accept down to the second. It should accept anything in the ISO 8601 standard.

5) It would be very useful if you include a field that tells how the time is averaged. Different organizations average differently. For example, is the one hour average centered on the timestamp, or forward or backward looking (e.g. forward looking would have 12:00 represent data averaged between 12:00 and 12:59). Not sure if you have this info, or would be willing to go through and contact your various sources to find out. For us, it's crucial when cross-comparing different data.

Hear you on this. We need to add another field at minimum that differentiates reporting frequency and average. I think we will have trouble - from a sheer communication standpoint with governmental agencies- getting information down to the forward or backward-looking for many sources, but it could be something we could do at least for the larger sources. Here's a separate issue I've created on it here. Also, do you know which way it is done for the US EPA Airnow data? I believe it is timestamped with the ending time (e.g. data taken between 3pm and 4pm is marked 4pm)

Addendum: As I apparently forgot, we do have a protocol in place that we define the timestamp for the average: the ending time for a given average is what we timestamp a measurement with. For example, a timestamp for a measurement taken between 3pm and 4pm will be given a timestamp of 4pm. That said, I know it is probably the case where we access data from sites that only has a single timestamp and it is not readily apparent if this is a beginning, middle, or end time stamp. More here on our format: https://github.com/openaq/openaq-api/wiki/4.-Writing-an-adapter#dealing-with-dates-and-date-ranges

6) Can you include a "totalPages" field in the return JSON? This would help with the logic in pulling data down on our side -- we'll know a priori how many times we need to loop over the pagination. Else we would have to do some more adhoc coding to infer it ourselves.

I think you should be able to get this from dividing 'found' by limit' in the meta data returned (see image below):

If I am misunderstanding that q, let me know.

7) It would be useful to be able to query on the "attribution" or "url" for attribution. I can definitely see us cross-referencing a list as in "1)" and wanting to query data for just one source.

This is good feedback; I've made an issue on that. openaq/openaq-api#256

I also do wonder if this will, in some sense, be solved by this issue: #8

Perhaps not completely though...

8) do you also have a list of all possible pollutants/parameters that are in your schema?

These are currently the ones we capture: PM2.5, PM10, CO, O3, NO2, SO2 and BC (though BC data is the rarest we find).

We don't have immediate plans to expand or truncate this list, but the most current listing should always be here: https://docs.openaq.org/#api-Measurements

If you have feedback on any pollutants you would find useful to be included, let us know. We tend to default to the most common collected globally rather than to get the rarer types some places measure, like benzene, etc.

9) it would be great if all the units were standardized -- we do this in our system, and it's fairly painful, but worth doing at the base level of data ingestion. i.e. only deal in ppb or ppm for certain pollutants, etc... Otherwise we need to write logic on our side that x-references the measurement with the units, and if the units aren't our standard unit, then convert the units against a mapping for different unit conversion.

So, one thing we have to stick to very closely in our system is the precise way the data is shared on the originating site. We think it's important to always have the 'raw' data as it appears saved to our system. We do make conversions from ppb to ppm, say for volume concentrations (so we share all measurements in ppb as ppm - you can see the preferred units here: https://github.com/openaq/openaq-data-format).

BUT, as you notice, we don't convert when an ozone measurement, for example, is made in ug/m^3. We don't do this because of the assumptions of P and T we would have to make at each locations globally. We find this to be a bit of a pain in the butt, too. :) But again, we prioritize having the dataset shared openly and transparently from its originating sources with no assumptions applied on our part.

However, this is something that can be added on top of our system, and clearly would be a useful one, even if done for a region and not globally. I'm making an issue here. If you/your team were to dig into that piece more with some open-source code, we'd advertise widely through our network the tool you generate, write a blog post about or work together, and do anything we could do to call out such awesomeness.

Adding in field(s) that reflect RT vs historical/backfilled + QA/QC (for non-RT data)

Suggest a 'Data Type' Field with four categories, such as:

Real-Time: Any data we currently ingest into the system, and by definition is not QA/QC
Historical/QA-QC: Backfilled historical data that has gone through QA/QC (e.g. EEA or EPA non-RT data, possibly from researchers)
Historical/No QA-QC: 'Raw data'
Historical/Unknown

Or perhaps this is too complicated and it should be broken down into two fields: one RT vis Historical and the other QA/QC: Yes/No/Unknown?

The motivation for this suggested change:
Eventually, we will want to be able to backfill data from sources, fill in holes, or take data from sources (e.g. gov't agencies, researchers) that would rather only shared QA/QC'ed data. When using data that is not real-time, and especially from gov't sources, it will be QA/QC'ed unlike the real-time data we are collecting. For this reason, it would be good to have a field that reflects these differences in known data quality. We have gotten requests for this feature.

We have also gotten a related request to provide info on the exact QA/QC procedures of a given place. That'd be awesome, but at this time, I think it will be difficult to parse precisely the QA/QC controls used by each place, and I think it is unreasonable for us to do that at this time or in the near future. Plus, a user can find the data source agency to contact them for more information.

cc: @olafveerman @dolugen @jflasher - I'll be making a series of these for discussion (and using a new label, dark blue 'v2'.) Will be interested in your thoughts on these and other possible changes to the format for v2.

Is this up-to-date with API v2?

Hello OpenAQ!

I've been testing out API v2, and have found out that there are some differences between the result of the API and this document.

For example, the Swagger file https://docs.openaq.org/#/v2/measurements_get_v2_measurements_get shows a isMobile field, whereas the README here says mobile.

I was wondering if this schema would be updated to reflect API v2?

Is the data available as a AWS RDS snapshot? If not, is there any way to create one?

Data format to differentiate real-time captured-data and QA/QC'ed.

Below, @masalmon mentioned differences she noticed between real-time captured US Emb data from Delhi and data reported on the Emb site later. OpenAQ currently only captures real-time data for government sources, and does not allow the insertion of a QA/QC layer.

From @masalmon:

Sarath asked me to do a graph comparing Beijing and Delhi and for that I decided to use data from the embassy that is not in OpenAQ, for having a whole year for Delhi. I use my usaqmindia Github repo for that. It was a great occasion to compare both datasets for the common times: OpenAQ vs embassy data. So it's interesting to notice that there are differences. They seem to be: sometimes OpenAQ has -999 while the dataset from the embassy website has more credible values.

Here you see the negative measures, I'll make a summary of some sort for the repeated measures.

So a small summary of the issue. I'm looking at 5463 non missing measures between 2015-12-12 03:00:00 and 2016-08-01 00:00:00. 145 are different between the embassy website and OpenAQ. So not a lot but still interesting, do you want to add the "right" value with a corresponding flag as suggested by @RocketD0g (real time vs not)? (edited)

I've just looked at the dates at which non concordant values appear, and it's not at a given period, last time was in July this year.

Discussion on data reqs for 'research' and 'other' sourceTypes

Moving over a discussion from @masalmon on Slack to here:

"Besides these R package questions, I am wondering how much metadata there will be for accompanying research data? Moreover, when monitoring AQ in rural locations research studies also measure Temperature and RH and wind direction&speed that are really important, and they could not be published in OpenAQ at the same time because OpenAQ does not have T and RH. Another variable that would be interesting to have is the device used. I am writing this from a data manager point of view: I have no idea whether the data of my research project will be made public (it's not for me to decide) but I can say that we have AQ and meteorological variables in rural locations. The AQ data without weather data would be poorer and it'd be very hard to get weather data from elsewhere for these locations since they're rural. And in the metadata+in articles we'll give references of the devices. With official sources the information about devices is not always there but for research it should be available anyway. Just my two cents."

Include description of meta data format in the openaq-data-format README

We'll need to adjust the README on openaq-data-format to also point to the metadata format for users' reference:
https://github.com/openaq/openaq-data-format/blob/master/schemas/location.json

cc: @sruti

include 'official government AQ data'

Adjust data format to add sub data type: 'official government AQ data'.

In order to add a few data sources shared directly by governments, we will need to add this sub-data type (as opposed to real-time government data).

Better system for parsing location levels

I don't have a clear suggestion at the moment for solving this issue, so I'll just describe the problem:

Currently, we assign a city to each measurement. But some measurements aren't associated with cities in their originating source (e.g. several data points in the EPA system, also an issue with the DEFRA data for GB). Currently, we assign the non-city associated EPA data with its county-level, instead.

But its a larger issue for other places we add in: how do we handle places that are not associated with cities but are truly in rural sites? I suggest leaving them blank, but I know this is a bad idea! (@jflasher).

Adding in two properites to the data format - feedback sought

Because the SPARTAN data is research-grade and not data created by/on behalf of a governmental body - like all other data in our system to date, adding new properties to our data format will be necessary before we add the SPARTAN data to the system (PR here)

Thoughts/feedback on this structure? cc: @dolugen @olafveerman @jflasher

Additional Tag # 1:
sourceType - This delineates what data source type from which the data originates:

Type 1 - Sensors that are deployed by citizen scientists, often low-cost sensors.
Type 2 - Sensors that are deployed by researchers affiliated with universities and/or research organizations.
Type 3 - Sensors that are deployed by or on behalf of governmental bodies.

Additional Tag # 2:
mobile - This is either T/F, and indicates whether the source is stationary or mobile.

Another nuance to describing averaging + reporting interval: a request to describe how the field is averaged

This is of lower priority than first addressing issues brought up in: #4 (e.g. being clear on averaging period and reporting frequency), but an interesting point brought up by Multitude (here):

They'd like a "field that tells how the time is averaged. Different organizations average differently. For example, is the one hour average centered on the timestamp, or forward or backward looking (e.g. forward looking would have 12:00 represent data averaged between 12:00 and 12:59). Not sure if you have this info, or would be willing to go through and contact your various sources to find out. For us, it's crucial when cross-comparing different data."

Logistically, I don't see this happening for us for all sources any time soon. We may be able to determine this for EPA and EEA, but think it will be very hit or miss elsewhere.

Changing name of "averaging_Period" and adding field that truly indicates time resolution

Suggest changing currently defined 'averaging_Period' to 'reporting_Interval' and adding in true 'averaging_Period'

The motivation for this suggested change:
As I understand it, currently, 'averaging_Period' is the interval of time over which a datapoint on a site is updated, e.g. UB data is updated on an hourly basis on the website. But this is not the same as the temporal resolution of the measurement itself. In a lot of cases, it seems it happens to be the same and sometimes is, but it is not necessarily the same.

2015 GBD/WHO template for including data in their global databases

Originally from @RocketD0g in openaq/openaq-api#52

No Immediate Action Intended - Background Info

FYI, a useful template of the type of information collected for the upcoming 2015 WHO and GBD global databases of annual average PM2.5 and PM10 pollution is below (They are primarily on the search currently for 2014 data). I can't find the issue, but I think @olafveerman brought up the categorization of sites before (e.g. what is the criteria for residential, urban, industrial, etc?). It has been indicated there is not strict criteria for this currently and countries are directed to fill out the template using their best judgement.

http://www.who.int/entity/phe/health_topics/outdoorair/databases/PHE-Template-OAP-database-entries-June2015.xls?ua=1