Git Product home page Git Product logo

Comments (6)

wymli avatar wymli commented on September 12, 2024

I run your experiment,and get results like this:

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, "learning_rate": 0.1, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": false}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, "dim_embedding": 32, "dataset": "bbbp"},
"TR_score": 0.9383519814222812, "VL_score": 0.8699779249448123}, "OUTER_TR": 0.8983603755755274, "OUTER_TS": 0.7846266237472319}

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, "learning_rate": 0.1, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": true}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, "dim_embedding": 64, "dataset": "bbbp"},
"TR_score": 0.9381171050338343, "VL_score": 0.840649692712906}, "OUTER_TR": 0.8686136180509775, "OUTER_TS": 0.8346322009403071}

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, "learning_rate": 0.1, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": false}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, "dim_embedding": 64, "dataset": "bbbp"},
"TR_score": 0.9463063212497737, "VL_score": 0.8487089925347578}, "OUTER_TR": 0.951178197317892, "OUTER_TS": 0.8494901713444759}

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, "learning_rate": 0.01, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": false}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, "dim_embedding": 64, "dataset": "bbbp"},
"TR_score": 0.8298400357665173, "VL_score": 0.8640853959716811}, "OUTER_TR": 0.8546555830876911, "OUTER_TS": 0.7710857477319286}

{"best_config": {"config": {"model": "ECC", "device": "cuda", "batch_size": 32, "learning_rate": 0.1, "classifier_epochs": 1000,
"optimizer": "SGD", "scheduler": {"class": "ECCLR", "args": {"gamma": 0.1, "step_size": 10}}, "loss": "MulticlassClassificationLoss",
"gradient_clipping": null, "early_stopper": {"class": "Patience", "args": {"patience": 500, "use_loss": false}}, "shuffle": true, "l2": 0.0,
"dropout": 0.05, "dropout_final": 0.1, "num_layers": 2, "dim_embedding": 64, "dataset": "bbbp"},
"TR_score": 0.8259295808005054, "VL_score": 0.8700949453926788}, "OUTER_TR": 0.899964474619089, "OUTER_TS": 0.7435839880284324}

I'm wondering that how i can choose my real best config given that i have different best configs in every different k-fold?
Should I choose the config that has the best performance or the one that has the most occurrences? Or some choice else?
Thanks!

from gnn-comparison.

diningphil avatar diningphil commented on September 12, 2024

Hi,

This is an important point that we also discuss in our paper (see Sections 3-4 and the Appendix).

TL;DR: There is no overall "best configuration" when doing risk assessment.

When you use k-fold cross-validation to compute an estimate of the performance of your model (i.e., risk assessment), there exists no overall "real best config". You should average the k OUTER_TS results you obtained to get the score you need.
Instead, when we talk about "best config", we usually refer to the process of model selection on a validation set. However, we must not select a best configuration using the k test sets used for risk assessment. This is one of the things we have criticized in our paper.
In k-fold cross-validation for risk assessment, you end up doing model selection k times. Therefore, it is perfectly normal to have k different best configurations, and you should not choose one of them. If you select one of them based on the OUTER_TS (test) results, that corresponds to "cheating". This is why it is way harder to get state of the art results, but eventually the performance estimate will be robust :).

A picture is worth a thousand words (taken from the paper):

overall-process

Note that if you have to deploy your solution in production, that is a different story. When you already have your performance estimate using the above techniques, one possibility is to re-train your model with a simple train/validation/test split to find the best configuration on a validation set. That gives you a single trained model that you can reuse for prediction.

I hope this clarifies.

from gnn-comparison.

pursueorigin avatar pursueorigin commented on September 12, 2024

Hi,

In K_Fold_Assessment.py,

for outer_k in range(self.outer_folds):
conducts model selection for each outer fold. Here, for different folds, this method may find model configurations. It is correct when selecting some hyperparameters, such as learning rates. However, I do NOT think it is correct when selecting the number of layers or the hidden dimension, such as
num_layers:
. For different folds, we should use the same network. With your method, different folds may have different neural network architectures. Am I right?

from gnn-comparison.

diningphil avatar diningphil commented on September 12, 2024

Hi,
You are right, different folds may have different architectures, but that is the point of performing model selection. By treating the number of layers as a hyper-parameter, you are more flexible in finding the right architecture for each data fold. Why would you want to fix that in advance? It depends on the data just as the learning rate. Please let me know if this answer your question :).

from gnn-comparison.

pursueorigin avatar pursueorigin commented on September 12, 2024

Different architectures correspond to different models. If we use different architectures for different folds when evaluating the model performance, which model are we evaluating?

I didn't see any previous works doing like this. Could you please provide the literature supporting your method?

from gnn-comparison.

diningphil avatar diningphil commented on September 12, 2024

Model assessment, by definition, evaluates the performance of the family of models associated with a specific method. There is no single architecture after model assessment because the reason why we do this is to get an estimate of the true error, not an architecture.

The misconception that we are evaluating a single architecture leads to papers where authors find the best hyper-parameters on the test set rather than validation (this was the reason why we published our paper). There exists no "best architecture" in general (literature: "Understanding Machine Learning" book, see "No free lunch theorem").

However, do not take my word for it: See part 3 of Samy Bengio's lecture and the associated references on Statistical Learning Theory.

from gnn-comparison.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.