Git Product home page Git Product logo

Comments (12)

SuperIRabbit avatar SuperIRabbit commented on June 9, 2024 1

@Yashaswini-Srirangarajan I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct. See:
scipy/scipy#19415
mseitzer/pytorch-fid#103

from motiongpt.

Yashaswini-Srirangarajan avatar Yashaswini-Srirangarajan commented on June 9, 2024 1

At least a partial fix has come through at scipy/scipy#20212. We recommend trying again once SciPy 1.13.0 is released, to see whether the problems are gone.

@lucascolley, This fix now works for me :) thanks !!

from motiongpt.

lucascolley avatar lucascolley commented on June 9, 2024 1

fantastic - 1.13.0 should be out within the next few weeks

It was just released.

from motiongpt.

zybermonk avatar zybermonk commented on June 9, 2024

Hi @Yashaswini-Srirangarajan,
Noticed a lot of people encountered this issue, including myself. Only fix was to change the 'test' split to 'val' in the config files. Check this for more details: #22 (comment)

However,
this seems to be a strange error as even after manually checking for errors (non-finite values) in the data, and also using a different dataset, this error keeps resurfacing.

Asking @billl-jiang for any support with this issue and debugging.
Cheers.

from motiongpt.

zybermonk avatar zybermonk commented on June 9, 2024

UPDATE:

  • Fixed this problem by checking all the .npy files for NAN values and other anomalies with respect to their corresponding names in the .txt files (train, val and test).
  • Once found the faulty files, remove them from: texts, new_joints, new_joint_vecs and also in the .txt files.-
  • In the end all your files and the names should be pointing to same number of samples.
  • Finally most important is to is delete the 'tmp' folder created during the training runs, every time you alter the data.

from motiongpt.

Yashaswini-Srirangarajan avatar Yashaswini-Srirangarajan commented on June 9, 2024

@zybermonk Thanks for the inputs.. How did you debug for NANs. Looks like all my files in new_joint_vecs and new_joints don't have NANs. I am missing any step from generating the HumanML3D dataset? Thanks a lot!

UPDATE:

  • Fixed this problem by checking all the .npy files for NAN values and other anomalies with respect to their corresponding names in the .txt files (train, val and test).
  • Once found the faulty files, remove them from: texts, new_joints, new_joint_vecs and also in the .txt files.-
  • In the end all your files and the names should be pointing to same number of samples.
  • Finally most important is to is delete the 'tmp' folder created during the training runs, every time you alter the data.

from motiongpt.

Yashaswini-Srirangarajan avatar Yashaswini-Srirangarajan commented on June 9, 2024

Tried this approach as well, but I seem to getting some other error as below. Had you faced this before? Thanks!


Trainable params: 267 M                                                         
Non-trainable params: 65.1 M                                                    
Total params: 332 M                                                             
Total estimated model params size (MB): 1.3 K                                   
Sanity Checking ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2/2 0:00:02 • 0:00:00 1.64it/s 2024-01-30 16:40:28,994 Sanity checking ok.
/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_l
ightning/loops/fit_loop.py:293: The number of training batches (1) is smaller 
than the logging interval Trainer(log_every_n_steps=50). Set a lower value for 
log_every_n_steps if you want to see logs for the training epoch.
2024-01-30 16:40:29,481 Training started
Epoch 0/999998 ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:00:00 • 0:00:00 0.00it/s 
Traceback (most recent call last):
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 94, in <module>
    main()
  File "/home/yasha/workspace/mocap/MotionGPT/train.py", line 85, in main
    trainer.fit(model, datamodule=datamodule)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit
    call._call_and_handle_interrupt(
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 44, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl
    self._run(model, ckpt_path=ckpt_path)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 989, in _run
    results = self._run_stage()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1035, in _run_stage
    self.fit_loop.run()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 202, in run
    self.advance()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py", line 359, in advance
    self.epoch_loop.run(self._data_fetcher)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 137, in run
    self.on_advance_end(data_fetcher)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/training_epoch_loop.py", line 285, in on_advance_end
    self.val_loop.run()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 182, in _decorator
    return loop_run(self, *args, **kwargs)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 141, in run
    return self.on_run_end()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 253, in on_run_end
    self._on_evaluation_epoch_end()
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 329, in _on_evaluation_epoch_end
    call._call_lightning_module_hook(trainer, hook_name)
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 157, in _call_lightning_module_hook
    output = fn(*args, **kwargs)
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/base.py", line 54, in on_validation_epoch_end
    dico.update(self.metrics_log_dict())
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/models/base.py", line 114, in metrics_log_dict
    metrics_dict = getattr(
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torchmetrics/metric.py", line 610, in wrapped_func
    value = _squeeze_if_scalar(compute(*args, **kwargs))
  File "/home/yasha/miniconda3/envs/motiongpt_env/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/t2m.py", line 195, in compute
    metrics["FID"] = calculate_frechet_distance_np(gt_mu, gt_cov, mu, cov)
  File "/home/yasha/workspace/mocap/MotionGPT/mGPT/metrics/utils.py", line 205, in calculate_frechet_distance_np
    raise ValueError("Imaginary component {}".format(m))
ValueError: Imaginary component 1.836488313288817e+26


Hi @Yashaswini-Srirangarajan, Noticed a lot of people encountered this issue, including myself. Only fix was to change the 'test' split to 'val' in the config files. Check this for more details: #22 (comment)

However, this seems to be a strange error as even after manually checking for errors (non-finite values) in the data, and also using a different dataset, this error keeps resurfacing.

Asking @billl-jiang for any support with this issue and debugging. Cheers.

from motiongpt.

zybermonk avatar zybermonk commented on June 9, 2024

@zybermonk Thanks for the inputs.. How did you debug for NANs. Looks like all my files in new_joint_vecs and new_joints don't have NANs. I am missing any step from generating the HumanML3D dataset? Thanks a lot!

Hi @Yashaswini-Srirangarajan, sorry for the late response.
When you build HumanML3D, by default there will be a few files that contain faulty data. You can first notice this during the data building process itself, for example, while using the 3rd notebook of HumanML3D you can see the following output -
Screenshot 2024-02-12 at 09 38 01

Evidently, the .npy files with suffixes 7975, contained NAN data when verified using np.isfinite() or similar.
Following this method, you need to verify all your .npy files in new_joints and new_joint_vecs, corresponding to the file names in the train, test and val .txt files.

You will find the following files also have faulty data, as encountered previously after using the 2nd notebook from HumanML3D
image

Next step would be to delete these files in .npy folders, and also filenames in the .txt files.

  • Most importantly, as I previously mentioned, make sure you delete the tmp folder before running your code with new edited dataset

from motiongpt.

lucascolley avatar lucascolley commented on June 9, 2024

I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct

If anyone has any input on which version is more mathematically correct, that would be great.

from motiongpt.

zybermonk avatar zybermonk commented on June 9, 2024

I hit the same issue and using scipy==1.11.1 solved my problem, although I'm not sure which version is mathematically more correct

If anyone has any input on which version is more mathematically correct, that would be great.

Just adding to this question, changing these libraries indirectly requires finding the right numpy version as well.

from motiongpt.

lucascolley avatar lucascolley commented on June 9, 2024

At least a partial fix has come through at scipy/scipy#20212. We recommend trying again once SciPy 1.13.0 is released, to see whether the problems are gone.

from motiongpt.

lucascolley avatar lucascolley commented on June 9, 2024

fantastic - 1.13.0 should be out within the next few weeks

from motiongpt.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.