Comments (2)
Hi @gboeer
I am "happy" to see I am not the only one having issues logging with MLFlow.
I am finetuning a pretrained transformer model on 2000ish images. So not an insane amount of data.
As you can see, metrics such as validation_accuracy
although recorded on_step=False
, on_epoch=True
only always show me the value of the last epoch. I would like to see an actual graph with all my previous epochs, it's just a scalar here.
Also, I tell my trainer to log every 50 steps, but in my epochs-step plot I see points at the following steps only: 49, 199, 349, 499, ... not every 50.
Here is my logger:
logger = MLFlowLogger(
experiment_name=config['logger']['experiment_name'],
tracking_uri=config['logger']['tracking_uri'],
log_model=config['logger']['log_model']
)
Passed to my trainer:
trainer = Trainer(
accelerator=config['accelerator'],
devices=config['devices'],
max_epochs=config['max_epochs'],
logger=logger,
log_every_n_steps=50,
callbacks=[early_stopping, lr_monitor, checkpoint, progress_bar],
)
My metrics are logged in the following way in the training_step and validation_step functions:
def training_step(self, batch, batch_idx):
index, audio_name, targets, inputs = batch
logits = self.model(inputs)
loss = self.loss(logits, targets)
predictions = torch.argmax(logits, dim=1)
self.train_accuracy.update(predictions, targets)
self.log("training_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
self.log("training_accuracy", self.train_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("training_gpu_allocation", torch.cuda.memory_allocated(), on_step=True, on_epoch=False)
return {"inputs":inputs, "targets":targets, "predictions":predictions, "loss":loss}
def validation_step(self, batch, batch_idx):
index, audio_name, targets, inputs = batch
logits = self.model(inputs)
loss = self.loss(logits, targets)
predictions = torch.argmax(logits, dim=1)
self.validation_accuracy(predictions, targets)
self.validation_precision(predictions, targets)
self.validation_recall(predictions, targets)
self.validation_f1_score(predictions, targets)
self.validation_confmat.update(predictions, targets)
self.log("validation_loss", loss, on_step=True, on_epoch=True, batch_size=self.hparams.batch_size, prog_bar=True)
self.log("validation_accuracy", self.validation_accuracy, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_precision", self.validation_precision, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_recall", self.validation_recall, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
self.log("validation_f1_score", self.validation_f1_score, on_step=False, on_epoch=True, batch_size=self.hparams.batch_size)
I guess it's a problem from lightning but not 100% sure.
I hope we'll get suppot soon. I serve my ML models on MLFlow and it works fine, so I don't want to go back to tensorboard for my DL models only.
EDIT: My bad, it seems to do that just when the training is still on. When the training is finished, the plots display correctly.
But still, I thought we were supposed to be able to follow the evolution of metrics as training progresses, and in this case it's not very possible.
from pytorch-lightning.
@Antoine101
Interesting, that your plots change after the training is finished. For me, they stay the same, though. I tried opening the app in private window to see if there are any caching issues, but it didn't change anything.
I guess what you observed about the stepsize may just have to do with zero-indexing.
from pytorch-lightning.
Related Issues (20)
- EarlyStopping override disrupts wandb logging frequency
- Show how to over-fit batches for real HOT 2
- Fabric example trainer fails with validation
- Logging with Fabric using steps HOT 2
- Load from checkpoint doesn't load model for inference HOT 1
- Cannot correctly parse some import paths when using config file for LightningCLI HOT 2
- error: Parser key "data": Problem with given class_path 'my_class_path': __args__
- Pinning the `lightning` package doesn't pin the `pytorch_lightning` package HOT 2
- Smoothing in tqdm progress bar has no effect
- LearningRateFinder during training
- Make distributed checkpoint saving atomic HOT 2
- Gathering a list of strings from multiple devices using Fabric HOT 1
- Teardown trying to copy "meta" tensors HOT 2
- Multiple subclassing levels required to use LightningDataModule in LightningCLI HOT 2
- [Fabric Lightning] Named barriers HOT 1
- Lightning vulnerability CVE-2024-5980 HOT 1
- Can't save models via the ModelCheckpoint() when using custom optimizer
- GAN training crashes with unused parameters
- [Fabric]
- Plans to address the vulnerability CVE-2024-5452 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch-lightning.