spark-xarray

spark-xarray is an open source project and Python package that seeks to integrate PySpark and xarray for Climate Data Analysis. It is built on top of PySpark - Spark Python API and xarray.

spark-xarray was originally conceived during the Summer of 2017 as part of PySpark for "Big" Atmospheric & Oceanic Data Analysis - A CISL/SIParCS Research Project.

It is currently maintained by Anderson Banihirwe.

Documentation is available at https://andersy005.github.io/spark-xarray/.

Installation

We will guide you how to install spark-xarray. However, we will assume that an Apache Spark installation is available.

Install

Requirements

For the installation of spark-xarray, the following packages are required:

Spark 2.0+
netcdf4-python (>=1.2.8)
xarray (>=0.9.5)
dask (>=0.15.1)
toolz (>=0.8.2)

Install

Clone the repository directly from GitHub and install it aftwards using $ python setup.py. This will also resolve possible missing dependencies.

$ git clone https://github.com/andersy005/spark-xarray.git
$ cd spark-xarray
$ python setup.py install

Development

We welcome new contributors of all experience levels.

Important links

Official source code repo: https://github.com/andersy005/spark-xarray
Issue tracker: https://github.com/andersy005/spark-xarray/issues

Examples

Single file

>>> from sparkxarray.reader import ncread
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('spark-rdd').getOrCreate()
>>> sc = spark.SparkContext
>>> filepath='spark-xarray/sparkxarray/tests/data/air.sig995.2012.nc'
>>> # Create an RDD
>>> rdd = ncread(sc, filepath, mode='single', partition_on=['time'], partitions=100)
>>> rdd.first()  # Get the first element
<xarray.Dataset>
Dimensions:  (lat: 73, lon: 144, time: 1)
Coordinates:
  * lat      (lat) float32 90.0 87.5 85.0 82.5 80.0 77.5 75.0 72.5 70.0 67.5 ...
  * lon      (lon) float32 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 ...
  * time     (time) datetime64[ns] 2012-01-01
Data variables:
    air      (time, lat, lon) float64 234.5 234.5 234.5 234.5 234.5 234.5 ...
Attributes:
    Conventions:  COARDS
    title:        mean daily NMC reanalysis (2012)
    history:      created 2011/12 by Hoop (netCDF2.3)
    description:  Data is from NMC initialized reanalysis\n(4x/day).  These a...
    platform:     Model
    references:   http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanaly...
>>> rdd.count()   # Get a count of elements in the rdd
366
>>> # The count above corresponds to number of timesteps in the netCDF file 
>>> rdd.getNumPartitions()  # Get the number of partitions
100
>>> # Compute the daily average for each day (element) in RDD
>>> daily_average = rdd.map(lambda x: x.mean(dim=['lat', 'lon']))
>>> daily_average.take(3)
[<xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-01
Data variables:
    air      (time) float64 277.0, <xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-02
Data variables:
    air      (time) float64 276.8, <xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2012-01-03
Data variables:
    air

Multiple files

>>> from sparkxarray.reader import ncread
>>> from pyspark.sql import SparkSession
>>> spark = SparkSession.builder.appName('spark-rdd').getOrCreate()
>>> sc = spark.SparkContext
>>> paths='spark-xarray/sparkxarray/tests/data/NCEP/*.nc'
>>> multi_rdd = ncread(sc, paths, mode='multi', partition_on=['lat', 'lon'], partitions=300)
>>> multi_rdd.count()
16020
>>> multi_rdd.first()
<xarray.Dataset>
Dimensions:   (lat: 1, lon: 1, nv: 2, time: 4, zlev: 1)
Coordinates:
  * zlev      (zlev) float32 0.0
  * lat       (lat) float32 -88.0
  * lon       (lon) float32 0.0
  * time      (time) datetime64[ns] 1854-01-15 1854-02-15 1854-03-15 1854-04-15
Dimensions without coordinates: nv
Data variables:
    lat_bnds  (time, lat, nv) float32 -89.0 -87.0 -89.0 -87.0 -89.0 -87.0 ...
    lon_bnds  (time, lon, nv) float32 -1.0 1.0 -1.0 1.0 -1.0 1.0 -1.0 1.0
    sst       (time, zlev, lat, lon) float64 nan nan nan nan
    anom      (time, zlev, lat, lon) float64 nan nan nan nan
Attributes:
    Conventions:                CF-1.6
    Metadata_Conventions:       CF-1.6, Unidata Dataset Discovery v1.0
    metadata_link:              C00884
    id:                         ersst.v4.185401
    naming_authority:           gov.noaa.ncdc
    title:                      NOAA Extended Reconstructed Sea Surface Tempe...
    summary:                    ERSST.v4 is developped based on v3b after rev...
    institution:                NOAA/NESDIS/NCDC
    creator_name:               Boyin Huang
    creator_email:              boyin.huang@noaa.gov
    date_created:               2014-10-24
    production_version:         Beta Version 4
    history:                    Version 4 based on Version 3b
    publisher_name:             Boyin Huang
    publisher_email:            boyin.huang@noaa.gov
    publisher_url:              http://www.ncdc.noaa.gov
    creator_url:                http://www.ncdc.noaa.gov
    license:                    No constraints on data access or use
    time_coverage_start:        1854-01-15T000000Z
    time_coverage_end:          1854-01-15T000000Z
    geospatial_lon_min:         -1.0f
    geospatial_lon_max:         359.0f
    geospatial_lat_min:         -89.0f
    geospatial_lat_max:         89.0f
    geospatial_lat_units:       degrees_north
    geospatial_lat_resolution:  2.0
    geospatial_lon_units:       degrees_east
    geospatial_lon_resolution:  2.0
    spatial_resolution:         2.0 degree grid
    cdm_data_type:              Grid
    processing_level:           L4
    standard_name_vocabulary:   CF Standard Name Table v27
    keywords:                   Earth Science &gt; Oceans &gt; Ocean Temperat...
    keywords_vocabulary:        NASA Global Change Master Directory (GCMD) Sc...
    project:                    NOAA Extended Reconstructed Sea Surface Tempe...
    platform:                   Ship and Buoy SSTs from ICOADS R2.5 and NCEP GTS
    instrument:                 Conventional thermometers
    source:                     ICOADS R2.5 SST, NCEP GTS SST, HadISST ice, N...
    comment:                    SSTs were observed by conventional thermomete...
    references:                 Huang et al, 2014: Extended Reconstructed Sea...
    climatology:                Climatology is based on 1971-2000 SST, Xue, Y...
    description:                In situ data: ICOADS2.5 before 2007 and NCEP ...

Generating a cartesian product from a dictionary values

@kmpaul, This is the code I have so far. I need some help though.

def ncread(sc, filename, mode='single', **kwargs):

    if 'partitions' not in kwargs:
        kwargs['partitions'] = None
    if 'partition_on' not in kwargs:
        kwargs['partition_on'] = ['time']


    if (mode == 'single'):
        print('Calling ... read_nc_single(sc, filename, **kwargs)\n')
        print('*******************************')
        read_nc_single(sc, filename, **kwargs)
       
    elif (mode == 'multi'):
        print('Calling: ...read_nc_multi(sc, filename, **kwargs)')
        read_nc_multi(sc, filename, **kwargs)
    else:
        raise NotImplementedError("You specified a mode that is not implemented.")

def read_nc_single(sc, filename, **kwargs):
    print(kwargs)
    print('*******************************')
    dset = xr.open_dataset(filename)
    partition_on = kwargs.get('partition_on')
    partitions = kwargs.get('partitions')
    D ={dset[dimension].name:dset[dimension].size for dimension in partition_on}
    print("D = ",D)

ncread(sc, filename, mode='single', partition_on=['time', 'lat', 'lon'])

>>> Calling ... read_nc_single(sc, filename, **kwargs)

*******************************
{'partition_on': ['time', 'lat', 'lon'], 'partitions': None}
*******************************
D = {'time': 540, 'lat': 128, 'lon': 256}

By having a dictionary

D = {'time': 540, 'lat': 128, 'lon': 256}

how could you generate a new list that would look like

new_list=[element for element in itertools.product(range(540), range(128), range(256))]

This might be easier than I think, but I couldn't figure it out. I appreciate your help.

andersy005 / spark-xarray Goto Github PK

spark-xarray's Introduction

spark-xarray

Installation

Install

Requirements

Install

Development

Important links

Examples

Single file

Multiple files

spark-xarray's People

Contributors

Stargazers

Watchers

Forkers

spark-xarray's Issues

Bias Correction Process

Recommend Projects

Recommend Topics

Recommend Org