Git Product home page Git Product logo

Comments (3)

alxmrs avatar alxmrs commented on July 27, 2024

Right now, we have a two path templating system for our config language. The default way to template files is to use {} as placeholders for values specified in the partition_key section – the target_path. See this config, for example. In addition, we have a system to create files in a data hierarchy via a boolean flag. The process for doing this is documented here.

There are a few problems with the date-name hierarchy approach. Mainly, it causes confusion for users and introduces more areas for code to go wrong. For example, #127 is an issue that has cropped up due to this more complex implementation. Instead, it would be preferable if the values referenced by the partition_keys could be used in the target_path via python's Format string syntax. Then, users could format file patterns in totally arbitrary ways without us having to individually support corner cases. In addition to date hierarchies, users are currently not able to express that an integer string has more than one digit. For example, If I had a config like:

[parameters]
target_path=gs://my-bucket/{}-data.nc
partition_keys=
   days
[selection]
days=1/2/3/4

I have no way of expressing paths like gs://my-bucket/01-data.nc, gs://my-bucket/02-data.nc etc. These require that in the template, that I use something like gs://my-data/{days:02d}.nc.

To support this, somewhere, we basically need to run:

target_template.format(parameter_keys.values(), **parameter_keys)

Today, this is approximately done here:

def prepare_target_name(config: Config) -> str:

In fact, a naive implementation of this issue would involve:

  • Deleting the append_date_dir code (or, raising an error if a user tries this)
  • structuring partition_key_values as an ordered dictionary
  • Calling format
  • Updating the process_config function to encourage correct usage of the parser.
  • Document the usage everywhere

A problem that you would run into is that pretty much all of the values in partition_key_values are strings! You can't format a string like you would an int. Thus, you would not be able to use formatting options like {days:02d}.nc. Thus, a pre-requisite ticket is required – #5.

from weather-tools.

alxmrs avatar alxmrs commented on July 27, 2024

Note: Ideally, a side effect of implementing this change (and it's siblings) is that the following method:

def process_config(file: t.IO) -> Config:

Does not fundamentally alter the data in the config. Specifically, the code around this block:

if use_date_as_directory(config):

Should no longer be needed, and can be removed.

from weather-tools.

alxmrs avatar alxmrs commented on July 27, 2024

There's a tricky case here to handle: partition_keys that have a date. These can't be parsed like integers, and it's hard to specify all of the fields in the target template (e.g., I want to give day and month two digits, and year four digits).

I think the best / simplest solution here would be to parse the date fields as python datetime.date objects. Then, we can encourage config writers to use python's date string formatting function (native to the call to format) to incorporate date information into the config target path.

See this SO post: https://stackoverflow.com/a/22842734

For example, after the change, the MARS config string with append_date_dir could look like:

[parameters]
client=cds
dataset=reanalysis-era5-pressure-levels
# This config creates a date-based directory hierarchy.
# In this case, the two files that will be created are
# gs://ecmwf-output-test/era5/2017/01/01-pressure-500.nc
# gs://ecmwf-output-test/era5/2017/01/02-pressure-500.nc
# gs://ecmwf-output-test/era5/2017/01/01-pressure-1000.nc
# gs://ecmwf-output-test/era5/2017/01/02-pressure-1000.nc
target_filename=
target_path=gs://ecmwf-output-test/era5/{:%Y/%m/%d}-pressure-{}.nc
partition_keys=
     date
     pressure_level
[selection]
product_type=reanalysis
format=netcdf
variable=
    divergence
    fraction_of_cloud_cover
    geopotential
pressure_level=
    500
    1000
date=2017-01-01/to/2017-01-02
time=
    00:00
    06:00
    12:00
    18:00

from weather-tools.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.