Git Product home page Git Product logo

torchspatiotemporal / tsl Goto Github PK

View Code? Open in Web Editor NEW
235.0 9.0 22.0 1.08 MB

tsl: a PyTorch library for processing spatiotemporal data.

Home Page: https://torch-spatiotemporal.readthedocs.io/

License: MIT License

Python 100.00%
deep-learning gnn graph-neural-networks pytorch spatio-temporal spatio-temporal-analysis spatio-temporal-data spatio-temporal-graph spatio-temporal-prediction spatiotemporal

tsl's People

Contributors

abusagit avatar andreacini avatar ascarrambad avatar dzambon avatar javiersgjavi avatar lucabutera avatar marshka avatar steve3nto avatar torch-spatiotemporal avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

tsl's Issues

Is the definition of connectivity in the AirQuality dataset wrong?

As far as I understand reading Table 5 of Appendix B from the original GRIN article, this dataset has the structure of an undirected graph with 2699 edges.

However, I have executed the following code in a Jupyter Notebook:

from tsl.datasets import AirQuality

x = AirQuality()
index, weight = x.get_connectivity()
print(index.shape, weight.shape)

And I saw the this output:
(2, 66661) (66661,)

I don't understand well how this shape is possible in the edge index. First I thought that maybe you have already provided an implementation of this dataset as a directed graph, but shouldn't the shape be as much as 2*undirected_edges?

Trouble with Hydra, perhaps other way to run the examples?

Hi TSL team,

I have been playing around with your package a lot lately, and it works great!
However, I do experience issues when trying to alter your code to my specific needs.
For example, I would like to run a loop through several different settings (e.g., with a .yaml file).
However, after the first loop is done, and a new iteration is started, I get the following error:

Could not override 'dataset.name'.
To append to your config use +dataset.name=bay
Key 'dataset' is not in struct
    full_key: dataset
    object_type=dict

I think it has something to do with the get_hydra_cli_arg function

def get_hydra_cli_arg(key: str, delete: bool = False):
    try:
        key_idx = [arg.split("=")[0] for arg in sys.argv].index(key)
        arg = sys.argv[key_idx].split("=")[1]
        if delete:
            del sys.argv[key_idx]
        return arg
    except ValueError:
        return None

which seems to remove some of the sys.argv arguments.

My question therefore is, could I prevent this behavior easily? If not, could I then circumvent using Hydra entirely?
I have seen the "A Gentle Introduction to tsl" Jupyter Notebook with a different way of running it, but there I do not get all the important settings of the experiment anymore, and I want to be sure that the settings are correct.

In an ideal scenario, it would be possible to just have a script that runs from line 0 to n in sequential order (such as in the A gentle introduction to tsl notebook) but with the information of the run_traffic_experiment.py and their configs. This information is there in the notebook for the timethenspace model:

model_kwargs = {
    'input_size': dm.n_channels,  # 1 channel
    'horizon': dm.horizon,  # 12, the number of steps ahead to forecast
    'hidden_size': 16,
    'rnn_layers': 1,
    'gcn_layers': 2
}

But not for the other models, right?

I hope I explained my problems clearly, otherwise please tell me is something is not clear! If you could guide me in the right direction that would be really great!

Thanks in advance!

Doubt about Masked Metric's init

In tsl.metric.torch.metric_base.MaskedMetric.__init__ we have the following snippet

if metric_fn is None:
    self.metric_fn = None
else:
    self.metric_fn = partial(metric_fn, **metric_fn_kwargs)

Wouldn't this lead to a behavior in which we have an error thrown for passing metric_fn=None only upon computing the metric instead that having it thrown upon class construction?
Is this a desired behavior?

BUG in static_graph_collate function

Hi, tried to make a StaticBatch from a list of Data objects and it crashed with:

    72 out = out.stores_as(elem)
    74 pattern = elem.pattern

---> 76 for key in elem.keys:
77 if key == 'transform':
78 out[key] = static_scaler_collate([data[key] for data in data_list])

TypeError: 'method' object is not iterable

Putting () after elem.keys fixes the problem.

`libmamba` cannot resolve conda file

Attempted to install following directions on the quickstart page. libmamba reported it was unable to resolve the environment due to pytorch-cuda>=11.7. Relaxed restriction to pytorch-cuda>=11.6 and was able to install the environment and execute the gentle introduction notebook (I did have to pip install tensorboard as well).

I'm not sure if this is particular to my machine (Windows 10, Version 22H2), processor (NVIDIA RTX A2000) and driver (537.58) or not, but the modified env spec is:

name: tsl
channels:
  - pytorch
  - pyg
  - nvidia                    # remove for cpu installation
  - conda-forge
  - defaults
dependencies:
  - python=3.10
  - pytorch
  - pytorch-cuda>=11.6        # remove for cpu installation
  - pyg
  - pytorch-scatter
  - pytorch-sparse
  - lightning
  - pip
  - pip:
      - einops
      - hydra-core
      - numpy>1.20.3
      - omegaconf
      - pandas>=1.4
      - PyYAML
      - scikit-learn
      - scipy
      - tables
      - tensorboard
      - torchmetrics>=0.7
      - tqdm

The explanation of training mask and eval mask

Hi Ivan,

Sorry to bother you. I am confused with the training_mask and eval_mask. May I understand that the training_mask is the mask of input which represents the missing value in input and eval_mask is the mask of target, which stands for the visible ground truth. If I want to conduct forecasting task, is it suitable to only change the parameters horizon and delay in Imputation_dataset.py? Or could you please give some advices for how to build forecasting dataset? Thank you for your help!

Add future covariates

Hi everyone :)
I am having difficulty understanding whether it is possible and if so how to include future covariates, such as weather forecasts.
I would need to include this information in the training, referring to the time horizon of the target

What I am doing is something like this:

`
dataset = AirQuality(small=True, impute_nans=True)

df = dataset.dataframe()

fake_past_cov_df = pd.DataFrame(np.random.randn(*df.shape), columns=df.columns, index=df.index)
fake_fut_cov_df = pd.DataFrame(np.random.randn(*df.shape), columns=df.columns, index=df.index)
day_sin_cos = dataset.datetime_encoded('day').values
weekdays = dataset.datetime_onehot('weekday').values

torch_dataset = SpatioTemporalDataset(target=dataset.dataframe(),
mask=dataset.mask,
horizon=12,
window=12,
stride=1,
precision=16,
name='AQI')

torch_dataset.add_covariate(name='fake_past_cov', value=fake_past_cov_df, add_to_input_map=True, synch_mode=SynchMode.WINDOW, pattern='tnf')
torch_dataset.add_covariate(name='fake_fut_cov', value=fake_fut_cov_df, add_to_input_map=False, synch_mode=SynchMode.HORIZON, pattern='tnf')
torch_dataset.add_exogenous(name='global_u', value=np.concatenate([day_sin_cos, weekdays], axis=-1), add_to_input_map=True, synch_mode=SynchMode.WINDOW)
`

But when training for example an AGCRNModel, i receive this:
Sanity Checking DataLoader 0: 0%| | 0/2 [00:00<?, ?it/s]Arguments ['fake_past_cov'] are filtered out. Only args ['u', 'x'] are forwarded to the model (AGCRNModel).
Anyone knows how to overcome this problem, for both past and future covariates ?
Unfortunately, I couldn't find anything about it in the documentation.

Any help is appreciated ;) Thanks

Pandas version

Hi,

I'm very interested in the library you've created (kudos on your documentation!), but am working on a platform (Dataiku) that won't let me install any pandas version (restriction is pandas>=1.3,<1.4).
Therefore I can't test your library.
Is there an important incompatibility with earlier versions of pandas that forced you to specifically noted pandas>=1.4 in the requirements?

Differences of args in ImputationDataset between tsl-0.1.1 and the latest

Hello,

Thanks for your great contribution in neural spatiotemporal data processing community!

I found that some works used the old version of tsl package, i.e. v-0.1.1, and there are some difference on args compared with the latest version. For example, in tsl-0.1.1 there are training_mask and eval_mask args in class ImputationDataset, but in the latest version training_mask is discarded, so could you please tell me is there something changes or some mechanism allow us can discard training_mask?

Looking forward to your reply, thanks!

[Improving Documentation] Contributing inspectable notebook for imputation on custom dataset

Thank you for this amazing resource! Like others have raised in other issues, it seems:

  • documentation is either incomplete or not up-to-date
  • users express difficulty adapting own dataset to tsl

In addition to those two points, I've also noticed that:

  • the example scripts are not inspectable (ie. objects are not easily unpacked to check dimensions, properties, etc.)
  • the examples sometimes do not illustrate how to accommodate multivariate data, only multiple sensors

As a complete outsider to GNNs, I am wondering if I could get the authors' help in getting feedback on creating an example for beginners. In this way, I am hoping to contribute to the documentation, such that even a complete novice (such as I) can get started using tsl.

For instance, I have been thinking - say there is dataset of car trajectories, collected over time. How can we go from a dataframe (shown below), to training a model in tsl to predict the missing positions x, y, z?

import numpy as np
import pandas as pd

# Define number of trajectories and time points
num_traj = 5
num_timepoints = 10

# Generate random trajectories
data = pd.DataFrame(np.random.randn(num_traj*num_timepoints, 4), columns=['x', 'y', 'z', 't'])

# Assign trajectory ID for each time point
data['trajectory'] = np.repeat(np.arange(num_traj), num_timepoints)

# Set some values to NaN to represent missing positions
data.iloc[np.random.choice(data.index, size=10, replace=False), :3] = np.nan

# Set timepoints to positive integers and the same for all instances of each trajectory
for traj_id in range(num_traj):
    traj_data = data.loc[data['trajectory'] == traj_id]
    traj_data['t'] = np.arange(num_timepoints)
    data.loc[data['trajectory'] == traj_id] = traj_data

Method to specify save location of a dataset

Hello TSL team,

Is it possible to add the functionality of specifying the save directory of a dataset?
Such as:

dataset = MetrLA('current_path or folder')

because I am trying to run TSL on an external GPU cluster where I do not have root access/all permissions. Therefore, I think I am getting the following error when trying to run the MetrLA() command:

import tsl
import torch
import numpy as np
np.set_printoptions(suppress=True)
print(f"tsl version  : {tsl.__version__}")
print(f"torch version: {torch.__version__}")
from tsl.datasets import MetrLA
dataset = MetrLA(root='data')

gives the error:

tsl version  : 0.9.0
torch version: 1.10.1+cu111
Segmentation fault (core dumped)

which seems to be: "Segmentation fault" means that you tried to access memory that you do not have access to".

Thanks in advance!

Add parameter to specify which is the time dimension in MaskedMetric

def update(self, y_hat, y, mask=None):
y_hat = y_hat[:, self.at]
y = y[:, self.at]
if mask is not None:
mask = mask[:, self.at]
if self.is_masked(mask):
val, numel = self._compute_masked(y_hat, y, mask)
else:
val, numel = self._compute_std(y_hat, y)
self.value += val
self.numel += numel

In the referenced snippet, MaskedMetric's update function assumes the time dimension is the second.
However, this leads to adding an unnecessary dummy batch dimension when we represent a batch of graphs as a single big graph.

I suggest two possible solutions to avoid this:

  1. Add a t_dim parameter and use x.select(t_dim, self.at) instead of x[:, self.at].
  2. Add a pattern string and use it to identify the time dimension.

In both cases this can either be a class attribute or a method parameter, it depends on preferring to have it set once or allowing it to be changed each time the metric is updated.

The first solution is the easiest to implement, however the second one may make further dimension semantics dependent aggregations easier to implement down the road.

If this is deemed useful I can implement this behavior with an agreed solution.

Imputation: change number of nodes after training

Hello,

Is it possible to change the number of node after training.
For example I have trained an Imputer with 180 channel and 4 nodes.
Can I perform inference with data that have 180 channel and 5 nodes?

ScalerModule masks Scaler transform

The issue:
The current ScalerModule implementation inherits the bias and scale parameter from the given Scaler and then uses its own transform implementation. This is not transparent to the user, as, even if one overrides the base implementation of Scaler.transform, the ScalerModule will resort to its own way of scaling the input.
This may lead to unexpected results; in particular considering that the SpatioTemporalDataset wraps every given Scaler into a ScalerModule.

Proposed Solution:
The ScalerModule should inherit also the way in which the original Scaler implements the transform.

error at test : test_example_imputation

The error message:
FAILED test_example_imputation.py::test_example_imputation - TypeError: on_train_batch_start() takes 3 positional arguments but 4 were given

Can you please fix the error.

Many thanks

Shapes of input data

Hi,

I just installed tsl, having been really interested in your code and documentation.

However, I realize that I don't know where to start to create my own dataset. The only examle codes I see are using samples already included inyour library.

If I have these three pandas DataFrame (details of their columns below), how would I go creating my SpatioTemporalDataset object?

  • features: DAY, NODE, NODE_FEATURE_1, NODE_FEATURE_2, NODE_FEATURE_3
  • targets: DAY, NODE, TARGET
  • edges: DAY, ORIGIN, DESTINATION, EDGE_FEATURE

`get_connectivity` returns different adjacencies depending on layout

Hello!

Could you please clarify what is the reason for transposing adjacency matrix in get_connectivity method with layout="edge_index" option?

For directed graphs, this method returns reflected edges:

import tsl
import torch
import numpy as np
import pandas as pd
from tsl.datasets import LargeST

dataset = LargeST(root='./data/')


connectivity_dense = dataset.get_connectivity(layout="dense")


edges, weights = dataset.get_connectivity()

connectivity_sparse = np.zeros_like(connectivity_dense)
connectivity_sparse[edges[0], edges[1]] = weights


print(np.allclose(connectivity_sparse, connectivity_dense)) # returns False
print(np.allclose(connectivity_sparse, connectivity_dense.T)) # returns True

Also, connectivity_dense is equal to adjacency from the original work. This behaviour of layout="edge_index" seems misleading for me.

Looking forward to your answer, thank you!

The examples are not working, perhaps some info is missing

Hi,

Thanks for your work on this package, this really helps out at lot!

I am trying to run the examples/forecasting/run_traffic_experiment.py file, but i get the following error:

C:\Users\sdblo\miniconda3\envs\tsl\lib\site-packages\pytorch_lightning\utilities\seed.py:55: UserWarning: No seed found, seed set to 1866288922
  rank_zero_warn(f"No seed found, seed set to {seed}")
Global seed set to 1866288922
[2022-12-02 12:10:51,148][tsl][INFO] -
**** Experiment config ****
run:
  seed: 1866288922
  dir: C:\....\experiments\tsl\examples\forecasting\outputs\2022-12-02\12-10-51
  name: 2022-12-02_12-10-51_1866288922

Error executing job with overrides: []
Traceback (most recent call last):
  File "C:\Users\sdblo\miniconda3\envs\tsl\lib\site-packages\tsl\experiment\experiment.py", line 156, in decorated_run_fn
    self.run_output = func(cfg)
  File "run_traffic_experiment.py", line 73, in run_traffic
    dataset = get_dataset(cfg.dataset.name)
omegaconf.errors.ConfigAttributeError: Key 'dataset' is not in struct
    full_key: dataset
    object_type=dict

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

So I thought maybe i have to change line 215 from:

    exp = Experiment(run_fn=run_traffic, config_path='config/traffic')

to

    exp = Experiment(run_fn=run_traffic, config_path='config/traffic/stcn.yaml')

but that gave the following error:

(tsl) sdblo@DLMACHINE C:\Users\sdblo\...\experiments\tsl\examples\forecasting>python run_traffic_experiment.py
Traceback (most recent call last):
  File "run_traffic_experiment.py", line 216, in <module>
    res = exp.run()
  File "C:\Users\sdblo\miniconda3\envs\tsl\lib\site-packages\tsl\experiment\experiment.py", line 189, in run
    self.run_fn()
  File "C:\Users\sdblo\miniconda3\envs\tsl\lib\site-packages\hydra\main.py", line 90, in decorated_main
    _run_hydra(
  File "C:\Users\sdblo\miniconda3\envs\tsl\lib\site-packages\hydra\_internal\utils.py", line 330, in _run_hydra
    validate_config_path(config_path)
  File "C:\Users\sdblo\miniconda3\envs\tsl\lib\site-packages\hydra\core\utils.py", line 293, in validate_config_path
    raise ValueError(msg)
ValueError: Using config_path to specify the config name is not supported, specify the config name via config_name.
See https://hydra.cc/docs/next/upgrades/0.11_to_1.0/config_path_changes

So it seems there is some information missing in the yaml files. How could this be fixed?

Kind regards

Questions about masking and mask dependencies during train/val phases

Hello everyone,

I am currently using TorchSpatiotemporal to conduct experiments for my Master's thesis in Data Science and Engineering under the supervision of Professor Paolo Garza.

The dataset I am working with is the SDPWF dataset, which was the main subject of the Baidu KDD competition in 2022. This dataset comprises data from over 100 sensors (wind turbines), recording approximately 10 different channels every ten minutes for 245 days. My task involves performing forecasting on this data. The objective is to compare various spatial-temporal deep learning architectures to understand how incorporating spatial information can improve prediction accuracy.

I have set up the necessary features and initialized the SpatioTemporalDataset and SpatioTemporalDataModule classes. Additionally, I have configured the Predictor and Trainer environment (see my Colab notebook here). I successfully trained a GraphWaveNetModel on this data by creating an SDPWFDataset class extending DatetimeDataset. The input dataframes are formatted correctly, with a datetime Pandas index representing the temporal dimension and a multi-column index mapping each wind turbine to its recorded channels. I also generate a dataset mask, a boolean dataframe indicating data availability for specific timeslots and wind turbines.

I am seeking clarification on the dataset mask, as I couldn't find much information in the documentation or GitHub repository. My specific questions are:

  • Can you explain more clearly what is the purpose of the mask?
  • To include the mask as an input to my neural network (referred to as x), should I move it to the covariates, or is it automatically appended to x by the DataModule class?
  • How should I use this mask to filter out some ground truth values with corresponding predictions and adjust the output loss accordingly for each training pass? My dataset contains some missing target values, and I need to mask them out to maintain consistency in loss evaluation. (Refer to Section 4.1 of the SDPWF paper: "In some cases, the wind turbines are stopped for reasons such as renovation or to avoid overloading the grid. In these instances, the actual generated power is unknown and should not be used for model evaluation.")
  • As I understand you have some metrix called MaskedSomething, but in that case they are supposed to mask out only nan values, and I don't see how to use them at training time to mask out missing ground truth values relying on the mask. Both, because I don't know how to retrieve the mask (GraphWaveNetModel only accepts x, u, edges as input) and because the masked metrics do not allow me to do so...

I have summarized the issue here, but please feel free to ask for additional details if needed. Feel also free to correct any misunderstanding here. Thank you for your support!

Best regards,

SpatioTemporalDataset.from_dataset does not accept the transform parameter.

Issue:
The convenience method SpatioTemporalDataset.from_dataset does not accept any transform parameter, so it either has to be set manually after instantiation or the user must use the standard constructor.

Proposed solution:
Add transform as parameter to from_dataset or move to a kwargs based implementation.

I can implement this based on an agreed upon solution.

Running imputation with multiple GPUs

Hello,

I try to run a variant of run_imputation.py with multiple GPU but I got the following error using dp strategy:

pytorch_lightning.utilities.exceptions.MisconfigurationException: Overriding `on_after_batch_transfer` is not supported in DP mode.

Please do you known if it is possible to fix this or if it is possible to take advantage of multiple GPUs for imputation?

Please provide some suggestions for multi timeseries training

Hello,

I am facing some challenges in predicting time-series classification for stock data.
Some are resolvable, though I'm not sure is there any better solution.

  1. Many diffierent series : around 4,000 stocks
    • There is a need to support some type of group training, such as company type or industry. ( use Dataset api)
    • Further training needs to be performed according to time segmentation. ( use get_walk_forward_splits , but catergory value varying by time, some value disappear , and new value come. )
  2. Unequal Time Lengths:
    • Almost no two stocks share the same time lengths; the starting dates for each stock's data are different. ( if split by time, each range only contain different stocks )
  3. Handling Missing Values:
    • Some data points are missing due to a suspension in trading (no trading took place on these days for these particular stocks, although other stocks may have been active).
    • There are also genuine instances of data missing, like some fields in the financial reports. It is not feasible to simply fill in with zeros or the mean value. A dynamic missing value filling method that adjusts over time might be necessary, which I currently don't have a good solution for.

After thinking of these problem, I'm confused about how to get started..

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.