Git Product home page Git Product logo

spark-xarray's Introduction

Build Status codecov Build status License: MIT PyPI

spark-xarray

spark-xarray is an open source project and Python package that seeks to integrate PySpark and xarray for Climate Data Analysis. It is built on top of PySpark - Spark Python API and xarray.

spark-xarray was originally conceived during the Summer of 2017 as part of PySpark for "Big" Atmospheric & Oceanic Data Analysis - A CISL/SIParCS Research Project.

It is currently maintained by Anderson Banihirwe.

Documentation is available at https://andersy005.github.io/spark-xarray/.

Installation

We will guide you how to install spark-xarray. However, we will assume that an Apache Spark installation is available.

Install

Requirements

For the installation of spark-xarray, the following packages are required:

Install

Clone the repository directly from GitHub and install it aftwards using $ python setup.py. This will also resolve possible missing dependencies.

$ git clone https://github.com/andersy005/spark-xarray.git
$ cd spark-xarray
$ python setup.py install

Development

We welcome new contributors of all experience levels.

Important links

Examples

Single file

>>> from sparkxarray.reader import ncread
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('spark-rdd').getOrCreate()
>>> sc = spark.SparkContext
>>> filepath='spark-xarray/sparkxarray/tests/data/air.sig995.2012.nc'
>>> # Create an RDD
>>> rdd = ncread(sc, filepath, mode='single', partition_on=['time'], partitions=100)
>>> rdd.first()  # Get the first element
<xarray.Dataset>
Dimensions:  (lat: 73, lon: 144, time: 1)
Coordinates:
  * lat      (lat) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 70.0 67.5 ...
  * lon      (lon) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 ...
  * time     (time) datetime64[ns] 2012-01-01
Data variables:
    air      (time, lat, lon) float64 234.5 234.5 234.5 234.5 234.5 234.5 ...
Attributes:
    Conventions:  COARDS
    title:        mean daily NMC reanalysis (2012)
    history:      created 2011/12 by Hoop (netCDF2.3)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
>>> rdd.count()   # Get a count of elements in the rdd
366
>>> # The count above corresponds to number of timesteps in the netCDF file 
>>> rdd.getNumPartitions()  # Get the number of partitions
100
>>> # Compute the daily average for each day (element) in RDD
>>> daily_average = rdd.map(lambda x: x.mean(dim=['lat', 'lon']))
>>> daily_average.take(3)
[<xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-01
Data variables:
    air      (time) float64 277.0, <xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-02
Data variables:
    air      (time) float64 276.8, <xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-03
Data variables:
    air     

Multiple files

>>> from sparkxarray.reader import ncread
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('spark-rdd').getOrCreate()
>>> sc = spark.SparkContext
>>> paths='spark-xarray/sparkxarray/tests/data/NCEP/*.nc'
>>> multi_rdd = ncread(sc, paths, mode='multi', partition_on=['lat', 'lon'], partitions=300)
>>> multi_rdd.count()
16020
>>> multi_rdd.first()
<xarray.Dataset>
Dimensions:   (lat: 1, lon: 1, nv: 2, time: 4, zlev: 1)
Coordinates:
  * zlev      (zlev) float32 0.0
  * lat       (lat) float32 -88.0
  * lon       (lon) float32 0.0
  * time      (time) datetime64[ns] 1854-01-15 1854-02-15 1854-03-15 1854-04-15
Dimensions without coordinates: nv
Data variables:
    lat_bnds  (time, lat, nv) float32 -89.0 -87.0 -89.0 -87.0 -89.0 -87.0 ...
    lon_bnds  (time, lon, nv) float32 -1.0 1.0 -1.0 1.0 -1.0 1.0 -1.0 1.0
    sst       (time, zlev, lat, lon) float64 nan nan nan nan
    anom      (time, zlev, lat, lon) float64 nan nan nan nan
Attributes:
    Conventions:                CF-1.6
    Metadata_Conventions:       CF-1.6, Unidata Dataset Discovery v1.0
    metadata_link:              C00884
    id:                         ersst.v4.185401
    naming_authority:           gov.noaa.ncdc
    title:                      NOAA Extended Reconstructed Sea Surface Tempe...
    summary:                    ERSST.v4 is developped based on v3b after rev...
    institution:                NOAA/NESDIS/NCDC
    creator_name:               Boyin Huang
    creator_email:              boyin.huang@noaa.gov
    date_created:               2014-10-24
    production_version:         Beta Version 4
    history:                    Version 4 based on Version 3b
    publisher_name:             Boyin Huang
    publisher_email:            boyin.huang@noaa.gov
    publisher_url:              http://www.ncdc.noaa.gov
    creator_url:                http://www.ncdc.noaa.gov
    license:                    No constraints on data access or use
    time_coverage_start:        1854-01-15T000000Z
    time_coverage_end:          1854-01-15T000000Z
    geospatial_lon_min:         -1.0f
    geospatial_lon_max:         359.0f
    geospatial_lat_min:         -89.0f
    geospatial_lat_max:         89.0f
    geospatial_lat_units:       degrees_north
    geospatial_lat_resolution:  2.0
    geospatial_lon_units:       degrees_east
    geospatial_lon_resolution:  2.0
    spatial_resolution:         2.0 degree grid
    cdm_data_type:              Grid
    processing_level:           L4
    standard_name_vocabulary:   CF Standard Name Table v27
    keywords:                   Earth Science &gt; Oceans &gt; Ocean Temperat...
    keywords_vocabulary:        NASA Global Change Master Directory (GCMD) Sc...
    project:                    NOAA Extended Reconstructed Sea Surface Tempe...
    platform:                   Ship and Buoy SSTs from ICOADS R2.5 and NCEP GTS
    instrument:                 Conventional thermometers
    source:                     ICOADS R2.5 SST, NCEP GTS SST, HadISST ice, N...
    comment:                    SSTs were observed by conventional thermomete...
    references:                 Huang et al, 2014: Extended Reconstructed Sea...
    climatology:                Climatology is based on 1971-2000 SST, Xue, Y...
    description:                In situ data: ICOADS2.5 before 2007 and NCEP ...

spark-xarray's People

Contributors

andersy005 avatar gmaze avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

spark-xarray's Issues

Bias Correction Module

Bias Correction Process

  • Moving Window

    • to handle seasonality
  • Normalize data

    • recenter (mean 0)
    • rescale ( sd 1)
    • detrend
    • transform to deskew (e.g., log precip)
  • Distribution Mapping (DM)

    • reshape distribution
  • Denormalize

    • reapply trend
    • reverse transform

Refactoring read_nc_single() and read_nc_multi()

  • Create a separate function for single and multi modes that calls xarray.open_dataset() and xarray.open_mfdataset() respectively.

  • After this, have single and multi modes use same function for further dataset selection and indexing.

Make a release

@andersy005 @kmpaul
Can you please make a release so we can officially install this on Cheyenne? CISL can't install unreleased software. If you haven't worked with releases with git, it's simply:

git tag -n1                          # show existing releases, with message
git tag v0.3 -m "release v0.3”       # the message is shown in the previous log and somewhere else
git push --follow-tags               # otherwise tags aren’t pushed

If you make a mistake and want to delete a release:

git tag -d v0.3.1                   # delete tag (not pushed upstream, otherwise need also the following)
git push origin :refs/tags/v0.3.1   # after previous, only if pushed -- very bad thing to do, but if you really must you can

Proposed Bias Correction Module Workflow

The Bias correction workflow will follow the following steps adapted from Downscale Paper doc

  • Prepare data

    • Gather tavg, tmin, tmax, pr data:

      • Observed historical data at target resolution

      • GCM-simulated historical conditions at MODEL resolution

      • GCM-simulated future climate conditions at MODEL resolution

    • regrid GCM historical and GCM future projections to REGRID resolution
      xmap.XMap(da).remap_to([('lat', 1), ('lon', 1)], how='bilinear')

    • Coarsen historical data to REGRID resolution (same as regrid - remap_like)

  • Bias Correct

    • Remove trend from CMIP tasmin/tasmax projections

    • Bias correct each variable, location, and calendar day:

      • Produce paired CDFs/quantile map from GCM → OBS-HIST

      • Replace values in GCM with OBS-HIST for corresponding portion of distribution

    • Reintroduce trend

  • Spatially disaggregate

    • Compute spatial climatology for OBS-HIST

    • Compute GCM-ADJ “factor” (for temp, difference) from OBS-HIST

    • Interpolate factors to target resolution using “SYMAP algorithm (Shepard, 1984), which is basically a modified inverse-distance-squared interpolation.”

    • Merge interpolated factors with target-resolution spatial climatology

Generating a cartesian product from a dictionary values

@kmpaul, This is the code I have so far. I need some help though.

def ncread(sc, filename, mode='single', **kwargs):

    if 'partitions' not in kwargs:
        kwargs['partitions'] = None
    if 'partition_on' not in kwargs:
        kwargs['partition_on'] = ['time']


    if (mode == 'single'):
        print('Calling ... read_nc_single(sc, filename, **kwargs)\n')
        print('*******************************')
        read_nc_single(sc, filename, **kwargs)
       
    elif (mode == 'multi'):
        print('Calling: ...read_nc_multi(sc, filename, **kwargs)')
        read_nc_multi(sc, filename, **kwargs)
    else:
        raise NotImplementedError("You specified a mode that is not implemented.")
def read_nc_single(sc, filename, **kwargs):
    print(kwargs)
    print('*******************************')
    dset = xr.open_dataset(filename)
    partition_on = kwargs.get('partition_on')
    partitions = kwargs.get('partitions')
    D ={dset[dimension].name:dset[dimension].size for dimension in partition_on}
    print("D = ",D)
ncread(sc, filename, mode='single', partition_on=['time', 'lat', 'lon'])
>>> Calling ... read_nc_single(sc, filename, **kwargs)

*******************************
{'partition_on': ['time', 'lat', 'lon'], 'partitions': None}
*******************************
D = {'time': 540, 'lat': 128, 'lon': 256}

By having a dictionary

D = {'time': 540, 'lat': 128, 'lon': 256}

how could you generate a new list that would look like

new_list=[element for element in itertools.product(range(540), range(128), range(256))]

This might be easier than I think, but I couldn't figure it out. I appreciate your help.

Create rdd with indices

  • When partitioned across time the rdd element will be of the form (time, xarray.Dataset)
  • When partitioned across grid_points the rdd element will be of the form (grid_point, xarray.Dataset)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.