Git Product home page Git Product logo

Comments (21)

znicholls avatar znicholls commented on June 15, 2024 1

Haha no it should also throw a warning, something like 'your metadata is nonsense', if it can't work out what it's reading and then it should be put in a big string

from pymagicc.

lewisjared avatar lewisjared commented on June 15, 2024

I had a quick play around with a possible implementation and found a few discussion points:

  • The MAGICC6 IN files only contain a single gas so the resulting dataframe would have a single index. Do we want to make the data frames have the same dimensionality between 6 and 7? I assume yes.
  • Can we make the assumption that a given IN file will only contain a single species?
  • Have you looked at xarray (http://xarray.pydata.org/). Xarray tracks data and the associated metadata for climate variables using a DataArray(http://xarray.pydata.org/en/stable/data-structures.html#dataarray) data structure. Rather than using a MultiIndex pandas DataFrame to hold all of the data with units information on the side, perhaps MAGICCInput could hold a number of DataArray objects (one per Region/Gas) which contain all the data applicible to that variable. The MAGICCInput class still has a metadata dict containing the filelevel metadata extracted from the namelist. It is a slightly different abstraction, but worth thinking about. Positives include data and units living together, downside is that you can't do cool cross variable (albeit slightly confusing) pandas indexing.
  • There should be another attribute header which contains the multiline string before the namelist containing the metadata. This data would be impossible to parse, but should be made available.

I like the hierarchical indexing (MData.df['Gas']['Region']) interface as it is the easiest to understand. This allows for things like mdata.df['CO2I']['R5ASIA'][2000]to get the value for R5Asia emissions for the year 2000. By exposingdf` as an attribute all the other fancy indexing functionality can be accessed if need be.

Also watch the naming of your variables. MData suggests that you are referring to a class, not an object.

@znicholls Would you like me to propose an implementation or are you going to implement this?

from pymagicc.

rgieseke avatar rgieseke commented on June 15, 2024

I think that's a clear yes for the first two points (most gases will be split in industrial and landuse related emissions, so two files for each gas), and yes to a header attribute.

As for xarray -- this would be a great option, since there is no install requirement for anything we don't yet have (Numpy and Pandas):

https://github.com/pydata/xarray/blob/master/setup.py#L29

It's also no additional install burden (NetCDF libs can be tricky to install sometimes), but provides another output/interop variant.

There is great interop with Pandas so this is probably preferable to inventing our own structure for keeping metadata.

from pymagicc.

lewisjared avatar lewisjared commented on June 15, 2024

@znicholls Do the SCEN7 files also have the same file structure, except that each file contains many more species?

from pymagicc.

rgieseke avatar rgieseke commented on June 15, 2024

@lewisjared As far as I remember, yes, they have the same structure, header block of variable length, namelist block, data starting from position defined in namelist block.

from pymagicc.

znicholls avatar znicholls commented on June 15, 2024

In brief:

  • xarray seems like a great option
  • I can't do an implementation this week, if one of you wants to have a go that would be great
  • SCEN7 files have the same structure as IN files
    • SCEN files are the exception but the underlying information is the same so we'd just need a custom read in function to get them in the same format as everything else
  • what is the difference between an object and a class and which sort of casing should I use for objects? All lower (I clearly did not understand PEP8 when I read it)

from pymagicc.

znicholls avatar znicholls commented on June 15, 2024

A few other thoughts:

  • The current philosophy is to only have data with common metadata in a given input file. Hence we should be able to get away with just having one metadata or header attribute per object

from pymagicc.

rgieseke avatar rgieseke commented on June 15, 2024

what is the difference between an object and a class and which sort of casing should I use for objects? All lower (I clearly did not understand PEP8 when I read it)

Usually an object is an instantiation of a class, ant the docs recommend (not sure if there is something clear in PEP8):

Class instantiation uses function notation. Just pretend that the class object is a parameterless function that returns a new instance of the class. For example (assuming the above class):

x = MyClass()

creates a new instance of the class and assigns this object to the local variable x.

https://docs.python.org/3/tutorial/classes.html

from pymagicc.

rgieseke avatar rgieseke commented on June 15, 2024

The current philosophy is to only have data with common metadata in a given input file. Hence we should be able to get away with just having one metadata or header attribute per object

The whole big header string could be part of a metadata dictionary.

from pymagicc.

znicholls avatar znicholls commented on June 15, 2024

Ok cool.

I think in general the header string is not one big string but rather gives us some fairly important info (source, date etc.). I'd advocate reading the info into well named attributes e.g. date rather than one big string

from pymagicc.

rgieseke avatar rgieseke commented on June 15, 2024

Sorry - didn't intend to belittle the metadata :-)

from pymagicc.

lewisjared avatar lewisjared commented on June 15, 2024

I think in general the header string is not one big string but rather gives us some fairly important info (source, date etc.). I'd advocate reading the info into well named attributes e.g. date rather than one big string

That is a pretty gnarly problem as they don't all follow the same conventions. I'll give it ago, but this might be a stretch goal.

from pymagicc.

lewisjared avatar lewisjared commented on June 15, 2024

I pushed a basic implementation of a majority of the features discussed here. After some experimentation, xarray made everything more complicated so it wasn't included in this implementation. Instead, the data was loaded into a pd.DataFrame with a pd.MultiIndex of Gas and Regions for the column names. The units were included in the metadata directory.

If everyone is happy with the proposed implementation I can extend to handle SCEN files and writing. The MAGICCInput files look and act like pandas DataFrames so calls like mdata['CO2'].plot() are possible.

It would also be great to replace the read/writing of scenarios in the top-level of pymagicc with MAGICCInput files. These files need to be lazy loaded as we don't know the target MAGICC run directory at import time. To allow for lazy loading, the proposed implementation has a slightly different signature on the __init__ and read methods. The __init__ method takes an optional filename and the read method takes a directory parameter and an optional filename parameter. This allows the target filename to be defined at import time, but the actual loading of the file occurs at a later time. The actual implemenation of this will happen in a different PR once the MAGICCInput API has been defined.

# In pymagicc/__init__.py
rcp26 = MAGICCInput('RCP26.SCEN')

# In user land
from pymagicc import rcp26
with MAGICC6() as magicc:
    rcp26.load(magicc.run_dir) # But will likely happen in a MAGICC class method i.e. (magicc.set_scenario(rcp26))
    magicc.run()

from pymagicc.

rgieseke avatar rgieseke commented on June 15, 2024

That looks great, I'll review and test this afternoon!

As for not using xarray, that's good to know ... For my Data Package reading tool (https://pypi.org/project/pandas-datapackage-reader/) I ended up just returning metadata as a _metadata attribute along with the DataFrame. It might not survive all filtering or copying operations in Pandas but it seems like a simple solution (and one can always store the metadata in a separate variable).

As for the scenarios, they should be independent of the underlying MAGICC version, so pymagicc.rcp26, loaded from MAGICC6, should still work when switching the MAGICC engine to MAGICC7.

from pymagicc.

znicholls avatar znicholls commented on June 15, 2024

Here's why I think we should add 'units' to the index, it prevents you adding stuff which you shouldn't. For example, this script (with output below)

import pandas as pd

fake_data = [0, 1, 2]
year = [2000, 2010, 2020]
gases = ['CO2']*len(fake_data)
regions = ['WORLD']*len(fake_data)
gtc_units = ['GtC']*len(fake_data)
gtco2_units = ['GtCO2']*len(fake_data)

fake_index_GtC = pd.MultiIndex.from_arrays(
    [gases, regions, gtc_units, year],
    names=['GAS', 'REGION', 'UNITS', 'YEAR'],
)
fake_index_GtCO2 = pd.MultiIndex.from_arrays(
    [gases, regions, gtco2_units, year],
    names=['GAS', 'REGION', 'UNITS', 'YEAR'],
)

GtC_df = pd.DataFrame(
    data=fake_data,
    index=fake_index_GtC,
    columns=['Value'],
)
GtCO2_df = pd.DataFrame(
    data=fake_data,
    index=fake_index_GtCO2,
    columns=['Value'],
)

print(GtC_df.head())
print(GtCO2_df.head())

## This does not work which is super useful
print((GtC_df + GtCO2_df).head(10))

## This works
print((GtC_df + GtC_df).head(10))
$ python units-alignment-example.py 
                       Value
GAS REGION UNITS YEAR       
CO2 WORLD  GtC   2000      0
                 2010      1
                 2020      2
                       Value
GAS REGION UNITS YEAR       
CO2 WORLD  GtCO2 2000      0
                 2010      1
                 2020      2
                       Value
GAS REGION UNITS YEAR       
CO2 WORLD  GtC   2000    NaN
                 2010    NaN
                 2020    NaN
           GtCO2 2000    NaN
                 2010    NaN
                 2020    NaN
                       Value
GAS REGION UNITS YEAR       
CO2 WORLD  GtC   2000      0
                 2010      2
                 2020      4

For now, the header I think we should be able to read in will look like

Data: Average emissions per year
Date: 12-Jun-2018 23:48:01
Description: Emissions data for the SSP5-T2-OS-V25 scenario as quantified by the REMIND-MAGPIE modelling team
Source: Data available from https://db1.ene.iiasa.ac.at/CEDSDB/dsd?Action=htmlpage&page=about. Description paper Riahi et al. 2017 Global Environmental Change (https://doi.org/10.1016/j.gloenvcha.2016.05.009)
Contact: <scenario-contact>
Compiled by: Zebedee Nicholls, Australian-German Climate & Energy College

i.e. we're just looking for the keywords before the colons (and I'd just replace whitespace with '_' in the metadata attributes). I've got rid of the GAS: .... line which we see in some files, it's now redundant. I also wish we could get rid of the data line but given our variable names are so unhelpful right now (e.g. CO2 could be CO2 emissions or CO2 concentrations) I think we have to keep it. I hope to get this sorted in future but it won't happen yet.

I will add a reader for all the notes stuff in another pull request once I have made such notes write out using a yaml writer rather than my custom writer.

from pymagicc.

znicholls avatar znicholls commented on June 15, 2024

It might not survive all filtering or copying operations in Pandas but it seems like a simple solution (and one can always store the metadata in a separate variable).

I think that's probably a good thing actually as the metadata should change if you start doing this sort of stuff and should only be carried if you explicitly choose to carry it.

from pymagicc.

lewisjared avatar lewisjared commented on June 15, 2024

With #42 merged we still have the following functionality to implement:

  • Units
  • Extracting metadata from header

from pymagicc.

znicholls avatar znicholls commented on June 15, 2024

from pymagicc.

znicholls avatar znicholls commented on June 15, 2024

I am also wondering how we should store variables in our dataframes. Just storing e.g. CO2 seems inadequate as that tells you nothing about whether it's RF, EMIS or CONC. At the moment this is also ambiguous in magicc's source code but we could at least provide a more usable interface in pymagicc, abstracting away some of the difficulty. Thoughts? This will affect how #49 is implemented

from pymagicc.

znicholls avatar znicholls commented on June 15, 2024

I've tried to split this into smaller issues: #69 #74 #75

from pymagicc.

znicholls avatar znicholls commented on June 15, 2024

Closed by #49. Solution to units still to come, pending hgrecco/pint#684

from pymagicc.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.