Running python -m train --cfg configs/config_h3d_stage1.yaml --nodebug after setting u

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

At least a partial fix has come through at <a href="https://github.com/sc

fantastic - 1.13.0 should be out within the next few weeks </blockquo

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

[Training] Stops with an error : The algorithm failed to converge because the input matrix contained non-finite values.,about openmotionlab/motiongpt

Comments (12)

SuperIRabbit commented on June 9, 2024 1

@Yashaswini-Srirangarajan I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct. See:
scipy/scipy#19415
mseitzer/pytorch-fid#103

from motiongpt.

Yashaswini-Srirangarajan commented on June 9, 2024 1

At least a partial fix has come through at scipy/scipy#20212. We recommend trying again once SciPy 1.13.0 is released, to see whether the problems are gone.

@lucascolley, This fix now works for me :) thanks !!

from motiongpt.

lucascolley commented on June 9, 2024 1

fantastic - 1.13.0 should be out within the next few weeks

It was just released.

from motiongpt.

zybermonk commented on June 9, 2024

Hi @Yashaswini-Srirangarajan,
Noticed a lot of people encountered this issue, including myself. Only fix was to change the 'test' split to 'val' in the config files. Check this for more details: #22 (comment)

However,
this seems to be a strange error as even after manually checking for errors (non-finite values) in the data, and also using a different dataset, this error keeps resurfacing.

Asking @billl-jiang for any support with this issue and debugging.
Cheers.

from motiongpt.

zybermonk commented on June 9, 2024

UPDATE:

Fixed this problem by checking all the .npy files for NAN values and other anomalies with respect to their corresponding names in the .txt files (train, val and test).
Once found the faulty files, remove them from: texts, new_joints, new_joint_vecs and also in the .txt files.-
In the end all your files and the names should be pointing to same number of samples.
Finally most important is to is delete the 'tmp' folder created during the training runs, every time you alter the data.

from motiongpt.

Yashaswini-Srirangarajan commented on June 9, 2024

@zybermonk Thanks for the inputs.. How did you debug for NANs. Looks like all my files in new_joint_vecs and new_joints don't have NANs. I am missing any step from generating the HumanML3D dataset? Thanks a lot!

UPDATE:

Fixed this problem by checking all the .npy files for NAN values and other anomalies with respect to their corresponding names in the .txt files (train, val and test).

Once found the faulty files, remove them from: texts, new_joints, new_joint_vecs and also in the .txt files.-

In the end all your files and the names should be pointing to same number of samples.

Finally most important is to is delete the 'tmp' folder created during the training runs, every time you alter the data.

from motiongpt.

Yashaswini-Srirangarajan commented on June 9, 2024

Tried this approach as well, but I seem to getting some other error as below. Had you faced this before? Thanks!


Trainable params: 267 M                                                         
Non-trainable params: 65.1 M                                                    
Total params: 332 M                                                             
Total estimated model params size (MB): 1.3 K                                   
Sanity Checking ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 0:00:02 • 0:00:00 1.64it/s 2024-01-30 16:40:28,994 Sanity checking ok.
/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_l
ightning/loops/fit_loop.py:293: The number of training batches (1) is smaller 
than the logging interval Trainer(log_every_n_steps=50). Set a lower value for 
log_every_n_steps if you want to see logs for the training epoch.
2024-01-30 16:40:29,481 Training started
Epoch 0/999998 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 • 0:00:00 0.00it/s 
Traceback (most recent call last):
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 94, in <module>
    main()
  File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 85, in main
    trainer.fit(model, datamodule=datamodule)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 137, in run
    self.on_advance_end(data_fetcher)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 285, in on_advance_end
    self.val_loop.run()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 141, in run
    return self.on_run_end()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 253, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 329, in _on_evaluation_epoch_end
    call._call_lightning_module_hook(trainer, hook_name)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/base.py", line 54, in on_validation_epoch_end
    dico.update(self.metrics_log_dict())
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/base.py", line 114, in metrics_log_dict
    metrics_dict = getattr(
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torchmetrics/metric.py", line 610, in wrapped_func
    value = _squeeze_if_scalar(compute(*args, **kwargs))
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/t2m.py", line 195, in compute
    metrics["FID"] = calculate_frechet_distance_np(gt_mu, gt_cov, mu, cov)
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/utils.py", line 205, in calculate_frechet_distance_np
    raise ValueError("Imaginary component {}".format(m))
ValueError: Imaginary component 1.836488313288817e+26

Hi @Yashaswini-Srirangarajan, Noticed a lot of people encountered this issue, including myself. Only fix was to change the 'test' split to 'val' in the config files. Check this for more details: #22 (comment)

However, this seems to be a strange error as even after manually checking for errors (non-finite values) in the data, and also using a different dataset, this error keeps resurfacing.

Asking @billl-jiang for any support with this issue and debugging. Cheers.

from motiongpt.

zybermonk commented on June 9, 2024

@zybermonk Thanks for the inputs.. How did you debug for NANs. Looks like all my files in new_joint_vecs and new_joints don't have NANs. I am missing any step from generating the HumanML3D dataset? Thanks a lot!

Hi @Yashaswini-Srirangarajan, sorry for the late response.
When you build HumanML3D, by default there will be a few files that contain faulty data. You can first notice this during the data building process itself, for example, while using the 3rd notebook of HumanML3D you can see the following output -

Evidently, the .npy files with suffixes 7975, contained NAN data when verified using np.isfinite() or similar.
Following this method, you need to verify all your .npy files in new_joints and new_joint_vecs, corresponding to the file names in the train, test and val .txt files.

You will find the following files also have faulty data, as encountered previously after using the 2nd notebook from HumanML3D

Next step would be to delete these files in .npy folders, and also filenames in the .txt files.

Most importantly, as I previously mentioned, make sure you delete the tmp folder before running your code with new edited dataset

from motiongpt.

lucascolley commented on June 9, 2024

I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct

If anyone has any input on which version is more mathematically correct, that would be great.

from motiongpt.

zybermonk commented on June 9, 2024

I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct

If anyone has any input on which version is more mathematically correct, that would be great.

Just adding to this question, changing these libraries indirectly requires finding the right numpy version as well.

from motiongpt.

lucascolley commented on June 9, 2024

At least a partial fix has come through at scipy/scipy#20212. We recommend trying again once SciPy 1.13.0 is released, to see whether the problems are gone.

from motiongpt.

lucascolley commented on June 9, 2024

fantastic - 1.13.0 should be out within the next few weeks

from motiongpt.

[Training] Stops with an error : The algorithm failed to converge because the input matrix contained non-finite values. about motiongpt HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent