mael-zys / t2m-gpt Goto Github PK

(CVPR 2023) Pytorch implementation of “T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations”

Home Page: https://mael-zys.github.io/T2M-GPT/

License: Apache License 2.0

Python 99.52% Shell 0.48%

gpt motion-generation vq-vae

t2m-gpt's Introduction

(CVPR 2023) T2M-GPT

Pytorch implementation of paper "T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations"

[Project Page] [Paper] [Notebook Demo] [HuggingFace] [Space Demo] [T2M-GPT+]

If our project is helpful for your research, please consider citing :

@inproceedings{zhang2023generating,
  title={T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations},
  author={Zhang, Jianrong and Zhang, Yangsong and Cun, Xiaodong and Huang, Shaoli and Zhang, Yong and Zhao, Hongwei and Lu, Hongtao and Shen, Xi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023},
}

1. Visual Results (More results can be found in our project page)

Text: a man steps forward and does a handstand.
GT	T2M	MDM	MotionDiffuse	Ours

Text: A man rises from the ground, walks in a circle and sits back down on the ground.
GT	T2M	MDM	MotionDiffuse	Ours

2. Installation

2.1. Environment

Our model can be learnt in a single GPU V100-32G

conda env create -f environment.yml
conda activate T2M-GPT

The code was tested on Python 3.8 and PyTorch 1.8.1.

2.2. Dependencies

bash dataset/prepare/download_glove.sh

2.3. Datasets

We are using two 3D human motion-language dataset: HumanML3D and KIT-ML. For both datasets, you could find the details as well as download link [here].

Take HumanML3D for an example, the file directory should look like this:

./dataset/HumanML3D/
├── new_joint_vecs/
├── texts/
├── Mean.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D) 
├── Std.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D) 
├── train.txt
├── val.txt
├── test.txt
├── train_val.txt
└── all.txt

2.4. Motion & text feature extractors:

We use the same extractors provided by t2m to evaluate our generated motions. Please download the extractors.

bash dataset/prepare/download_extractor.sh

2.5. Pre-trained models

The pretrained model files will be stored in the 'pretrained' folder:

bash dataset/prepare/download_model.sh

2.6. Render SMPL mesh (optional)

If you want to render the generated motion, you need to install:

sudo sh dataset/prepare/download_smpl.sh
conda install -c menpo osmesa
conda install h5py
conda install -c conda-forge shapely pyrender trimesh mapbox_earcut

3. Quick Start

A quick start guide of how to use our code is available in demo.ipynb

4. Train

Note that, for kit dataset, just need to set '--dataname kit'.

4.1. VQ-VAE

The results are saved in the folder output.

VQ training

python3 train_vq.py \
--batch-size 256 \
--lr 2e-4 \
--total-iter 300000 \
--lr-scheduler 200000 \
--nb-code 512 \
--down-t 2 \
--depth 3 \
--dilation-growth-rate 3 \
--out-dir output \
--dataname t2m \
--vq-act relu \
--quantizer ema_reset \
--loss-vel 0.5 \
--recons-loss l1_smooth \
--exp-name VQVAE

4.2. GPT

The results are saved in the folder output.

GPT training

python3 train_t2m_trans.py  \
--exp-name GPT \
--batch-size 128 \
--num-layers 9 \
--embed-dim-gpt 1024 \
--nb-code 512 \
--n-head-gpt 16 \
--block-size 51 \
--ff-rate 4 \
--drop-out-rate 0.1 \
--resume-pth output/VQVAE/net_last.pth \
--vq-name VQVAE \
--out-dir output \
--total-iter 300000 \
--lr-scheduler 150000 \
--lr 0.0001 \
--dataname t2m \
--down-t 2 \
--depth 3 \
--quantizer ema_reset \
--eval-iter 10000 \
--pkeep 0.5 \
--dilation-growth-rate 3 \
--vq-act relu

5. Evaluation

5.1. VQ-VAE

VQ eval

python3 VQ_eval.py \
--batch-size 256 \
--lr 2e-4 \
--total-iter 300000 \
--lr-scheduler 200000 \
--nb-code 512 \
--down-t 2 \
--depth 3 \
--dilation-growth-rate 3 \
--out-dir output \
--dataname t2m \
--vq-act relu \
--quantizer ema_reset \
--loss-vel 0.5 \
--recons-loss l1_smooth \
--exp-name TEST_VQVAE \
--resume-pth output/VQVAE/net_last.pth

5.2. GPT

GPT eval

Follow the evaluation setting of text-to-motion, we evaluate our model 20 times and report the average result. Due to the multimodality part where we should generate 30 motions from the same text, the evaluation takes a long time.

python3 GPT_eval_multi.py  \
--exp-name TEST_GPT \
--batch-size 128 \
--num-layers 9 \
--embed-dim-gpt 1024 \
--nb-code 512 \
--n-head-gpt 16 \
--block-size 51 \
--ff-rate 4 \
--drop-out-rate 0.1 \
--resume-pth output/VQVAE/net_last.pth \
--vq-name VQVAE \
--out-dir output \
--total-iter 300000 \
--lr-scheduler 150000 \
--lr 0.0001 \
--dataname t2m \
--down-t 2 \
--depth 3 \
--quantizer ema_reset \
--eval-iter 10000 \
--pkeep 0.5 \
--dilation-growth-rate 3 \
--vq-act relu \
--resume-trans output/GPT/net_best_fid.pth

6. SMPL Mesh Rendering

SMPL Mesh Rendering

You should input the npy folder address and the motion names. Here is an example:

python3 render_final.py --filedir output/TEST_GPT/ --motion-list 000019 005485

7. Acknowledgement

We appreciate helps from :

public code like text-to-motion, TM2T, MDM, MotionDiffuse etc.
Mathis Petrovich, Yuming Du, Yingyi Chen, Dexiong Chen and Xuelin Chen for inspiring discussions and valuable feedback.
Minh Chien Vu for the hugging face space demo.

8. ChangLog

2023/02/19 add the hugging face space demo for both skelton and SMPL mesh visualization.

t2m-gpt's People

Contributors

Stargazers

Watchers

t2m-gpt's Issues

It is very difficult generate an FBX or mesh or SMPL file from the colab

Is it really possible to generate it? It doesn't seem to be working at all with the code provided. Thank you!

question about vq-vae reconstruction loss

Hi, I found the velocity calculation in the forward_vel function in losses.py may not use the correct features

def forward_vel(self, motion_pred, motion_gt) : 
    loss = self.Loss(motion_pred[..., 4 : (self.nb_joints - 1) * 3 + 4], motion_gt[..., 4 : (self.nb_joints - 1) * 3 + 4])
    return loss

in the code the data slice is [4:(self.nb_joints - 1) * 3 + 4], which should be rotation independent positions, not the velocity
did I miss something ?

Question about Table 1comparing various state-of-the-art methods on the t2m dataset in the paper

In Table 1 in the paper, you mentioned that for MDM and MotionDiffuse, § reports results using ground-truth motion length. What does this mean? I looked through the evaluation code and it seems that you are also using the ground truth motion lengths during evaluation. Or are you truncating the motions to the length used during training, i.e. window size?

[Feature] How to use `QuantizeEMAReset` with DDP elegantly and efficiently?

training stuck

When I tried to train vq_vae, the program was stuck on line 52 of train_vq.py which is for log printing. And I added "print(1)" after line 52, the "1" cannot be printed. Can you help me with some advice?

I checked again and found that the program was stuck on line 21 of evaluator_wrapper.py:
checkpoint = torch.load(pjoin(opt.checkpoints_dir, opt.dataset_name, 'text_mot_match', 'model', 'finest.tar'), map_location=opt.device)
How can I fix this?

Use Hugging Face

Hello. I'm wondering why 'space demo' is not available ?

Thanks in advance

hello, is there a way to download on of these animation as fbx format?

would like to test in UE

Dataset difference may exist between your work and official HumanML3D

Hi, thanks for this attractive work!

I met the NaN problem when evaluating the VQVAE. And I found this is caused by the HumanML3D data that contains some NaN motion data.

However, I didn't see any preprocess in your repo to handle NaN problem. Does this mean the dataset you used is different from the official HumanML3D dataset?

Besides, with the dataset I processed according to the official HumanML and dropping the two NaN raw motion data, your provided pretrained VQVAE FID evaluation result is about 0.090, which is higher than 0.070 reported in your paper.
Does this also imply there is a minor difference in yours and the official HumanML3D dataset?

Thank you!

Below is the NaN data in the official HumanML3D https://github.com/EricGuo5513/HumanML3D/blob/main/cal_mean_variance.ipynb
The outputs in the image of HumanML3D notebook mean that the 007975.npy and M007975.npy contain NaN.

Unable to reproduce the results using the pretrained weights

Thank you for sharing the great work!

I have downloaded the pretrained weights as instructed, and used them to evaluate your proposed method on HumanML3D.
All the numbers seems to be somewhere in the vicinity of the reported results, with the exception to the FID.

INFO FID. 0.320, conf. 0.024, Diversity. 9.957, conf. 0.072, TOP1. 0.496, conf. 0.011, TOP2. 0.693, conf. 0.007, TOP3. 0.796, conf. 0.008, Matching. 2.894, conf. 0.035, Multi. 1.636, conf. 0.040

I have used the "net_best_fid.pth" file for both VQVAE and the Transformer. It would be lovely if you could enlighten me with how to obtain the results on the paper. Thank you!

Difference between your work and TM2T.

Hi, thanks for your awesome work.

I found that TM2T (TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts) from ECCV 22 is very similar to your work. You both employ VQ-VAE and inference with GPT. TM2T further applies inverse alignment while your work does not. Your pipeline is simpler than TM2T and your results are better.

Why are your results significantly better than TM2T in Table 2 (FID 0.737 vs. 3.599)? In TM2T Table 1, the ablation without the inverse alignment is worse.

Error in running demo code

OS: Debian 12
Graphics card: RTX 3060
Nvidia driver version: 530.30.20
CUDA version: 12.1

Code from: https://colab.research.google.com/drive/1Vy69w2q2d-Hg19F-KibqG0FRdpSj3L4O?usp=sharing#scrollTo=bz1iUfq3O4v8

# change the text here
clip_text = ["a person is jumping"]



import sys
sys.argv = ['GPT_eval_multi.py']
import options.option_transformer as option_trans
args = option_trans.get_args_parser()

args.dataname = 't2m'
args.resume_pth = 'pretrained/VQVAE/net_last.pth'
args.resume_trans = 'pretrained/VQTransformer_corruption05/net_best_fid.pth'
args.down_t = 2
args.depth = 3
args.block_size = 51
import clip
import torch
import numpy as np
import models.vqvae as vqvae
import models.t2m_trans as trans
import warnings
warnings.filterwarnings('ignore')

## load clip model and datasets
clip_model, clip_preprocess = clip.load("ViT-B/32", device=torch.device('cuda'), jit=False, download_root='./')  # Must set jit=False for training
clip.model.convert_weights(clip_model)  # Actually this line is unnecessary since clip by default already on float16
clip_model.eval()
for p in clip_model.parameters():
    p.requires_grad = False

net = vqvae.HumanVQVAE(args, ## use args to define different parameters in different quantizers
                       args.nb_code,
                       args.code_dim,
                       args.output_emb_width,
                       args.down_t,
                       args.stride_t,
                       args.width,
                       args.depth,
                       args.dilation_growth_rate)


trans_encoder = trans.Text2Motion_Transformer(num_vq=args.nb_code, 
                                embed_dim=1024, 
                                clip_dim=args.clip_dim, 
                                block_size=args.block_size, 
                                num_layers=9, 
                                n_head=16, 
                                drop_out_rate=args.drop_out_rate, 
                                fc_rate=args.ff_rate)


print ('loading checkpoint from {}'.format(args.resume_pth))
ckpt = torch.load(args.resume_pth, map_location='cpu')
net.load_state_dict(ckpt['net'], strict=True)
net.eval()
net.cuda()

print ('loading transformer checkpoint from {}'.format(args.resume_trans))
ckpt = torch.load(args.resume_trans, map_location='cpu')
trans_encoder.load_state_dict(ckpt['trans'], strict=True)
trans_encoder.eval()
trans_encoder.cuda()

mean = torch.from_numpy(np.load('./checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy')).cuda()
std = torch.from_numpy(np.load('./checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy')).cuda()

text = clip.tokenize(clip_text, truncate=True).cuda()
feat_clip_text = clip_model.encode_text(text).float()
index_motion = trans_encoder.sample(feat_clip_text[0:1], False)
pred_pose = net.forward_decoder(index_motion)

from utils.motion_process import recover_from_ric
pred_xyz = recover_from_ric((pred_pose*std+mean).float(), 22)
xyz = pred_xyz.reshape(1, -1, 22, 3)

np.save('motion.npy', xyz.detach().cpu().numpy())

import visualization.plot_3d_global as plot_3d
pose_vis = plot_3d.draw_to_batch(xyz.detach().cpu().numpy(),clip_text, ['example.gif'])

Error:

(T2M-GPT) ck@deb:~/Work/T2M-GPT$ python ./wl_test.py 
loading checkpoint from pretrained/VQVAE/net_last.pth
loading transformer checkpoint from pretrained/VQTransformer_corruption05/net_best_fid.pth
Traceback (most recent call last):
  File "./wl_test.py", line 68, in <module>
    feat_clip_text = clip_model.encode_text(text).float()
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/clip/model.py", line 348, in encode_text
    x = self.transformer(x)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/clip/model.py", line 203, in forward
    return self.resblocks(x)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/clip/model.py", line 190, in forward
    x = x + self.attention(self.ln_1(x))
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/clip/model.py", line 187, in attention
    return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]                       
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl                                                                                  
    result = self.forward(*input, **kwargs)                                                          
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 980, in forward                                                                                 
    return F.multi_head_attention_forward(                                                           
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/functional.py", line 4790, in multi_head_attention_forward                                                                   
    attn_output_weights = torch.bmm(q, k.transpose(1, 2))                                            
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

A Collab result inconsistent with the paper

The Collab demo result for the following prompt looks very different (worse) from Figure 1 in the paper. Is the Collab / Notesbook demo configured the same way as the paper ?

"a person walks quickly and intentionally in a zig-zag pattern forward"

Attached is the GIF generated from the Collab demo. As we can see the character strafed to the left abruptly before it turns to the right and walke forward. The expected result as shown in Figure 1 of the paper is a zigzag walk.

There is a mistake in `dataset/prepare/download_smpl.sh`

The Mistake is:

dataset/prepare/download_smpl.sh: 6: gdown: not found

Hence I guess that the download part gdown 1INYlGA76ak_cKGzvpOV2Pe6RkYTlXTW2 has error.

Maximum length of the code index sequence

Hello there!

In your paper you mention that: "The maximum length of the code index sequence is T = 50."

Is this number(50) based on the maximum length of motion that is generated (196)?
Consequently, if we want to re-train the model with double the length of generated motion shall we also double T from 50 to 100?

Thank you in advance!

Generate motion results in bvh format?

There are many npy to bvh scripts that support Human3.6M skeleton, https://github.com/HW140701/VideoTo3dPoseAndBvh
but there is no support for HumanML3D yet,
so it is difficult to apply the motion results generated by T2M-GPT which are in numpy format.
Convert the npy file into bvh can make it support Blender, MMD and more.
Does such a script exist? Or is it possible to generate the results in bvh format directly?

[T2M-GPT] A loss question about VQVAE

This is a very outstanding job, and it has aroused my great interest.
But I have a small question, why is loss_embedding omitted during the training of VQVAE?
It seems that only reconstruction loss and commit loss are used on the loss function,

Error when rendering SMPL

I'm getting an unpickling error when trying to render SMPL.

t2m-gpt/visualize/joints2smpl/src/prior.py:128, in MaxMixturePrior.init(self, prior_folder, num_gaussians, dtype, epsilon, use_merged, **kwargs)
126 with open(full_gmm_fn, 'rb') as f:
127 print(os.path.exists(full_gmm_fn))
--> 128 gmm = pickle.load(f, encoding='latin1')
130 if type(gmm) == dict:
131 means = gmm['means'].astype(np_dtype)

UnpicklingError: the STRING opcode argument must be quoted

How should I resolve this error?

export onnx model

What is the training equipment? We tried to switch to the onnx model to speed up the reasoning, but the 12GB graphics card burst. Are there any other ways to accelerate

Getting issue while evaluating model

def plot_3d_motion(save_path, kinematic_tree, joints, title, dataset, figsize=(10, 10), fps=120, radius=3,
vis_mode='default', gt_frames=[]):
matplotlib.use('Agg')

title = '\n'.join(wrap(title, 20))

def init():
ax.set_xlim3d([-radius / 2, radius / 2])
ax.set_ylim3d([0, radius])
ax.set_zlim3d([-radius / 3., radius * 2 / 3.])
# print(title)
fig.suptitle(title, fontsize=10)
ax.grid(b=False)

def plot_xzPlane(minx, maxx, miny, minz, maxz):
## Plot a plane XZ
verts = [
[minx, miny, minz],
[minx, miny, maxz],
[maxx, miny, maxz],
[maxx, miny, minz]
]
xz_plane = Poly3DCollection([verts])
xz_plane.set_facecolor((0.5, 0.5, 0.5, 0.5))
ax.add_collection3d(xz_plane)

return ax

(seq_len, joints_num, 3)

data = joints.copy().reshape(len(joints), -1, 3)

preparation related to specific datasets

if dataset == 'kit':
data *= 0.003 # scale for visualization
elif dataset == 'humanml':
data *= 1.3 # scale for visualization
elif dataset in ['humanact12', 'uestc']:
data *= -1.5 # reverse axes, scale for visualization

fig = plt.figure(figsize=figsize)
plt.tight_layout()
ax = p3.Axes3D(fig)
init()
MINS = data.min(axis=0).min(axis=0)
MAXS = data.max(axis=0).max(axis=0)
colors_blue = ["#4D84AA", "#5B9965", "#61CEB9", "#34C1E2", "#80B79A"] # GT color
colors_orange = ["#DD5A37", "#D69E00", "#B75A39", "#FF6D00", "#DDB50E"] # Generation color
colors = colors_orange
if vis_mode == 'upper_body': # lower body taken fixed to input motion
colors[0] = colors_blue[0]
colors[1] = colors_blue[1]
elif vis_mode == 'gt':
colors = colors_blue

frame_number = data.shape[0]

print(dataset.shape)

height_offset = MINS[1]
data[:, :, 1] -= height_offset
trajec = data[:, 0, [0, 2]]

data[..., 0] -= data[:, 0:1, 0]
data[..., 2] -= data[:, 0:1, 2]

print(trajec.shape)

def update(index):
# print(index)
ax.lines = []
ax.collections = []
ax.view_init(elev=120, azim=-90)
ax.dist = 7.5
# ax =
plot_xzPlane(MINS[0] - trajec[index, 0], MAXS[0] - trajec[index, 0], 0, MINS[2] - trajec[index, 1],
MAXS[2] - trajec[index, 1])
# ax.scatter(dataset[index, :22, 0], dataset[index, :22, 1], dataset[index, :22, 2], color='black', s=3)

# if index > 1:
#     ax.plot3D(trajec[:index, 0] - trajec[index, 0], np.zeros_like(trajec[:index, 0]),
#               trajec[:index, 1] - trajec[index, 1], linewidth=1.0,
#               color='blue')
# #             ax = plot_xzPlane(ax, MINS[0], MAXS[0], 0, MINS[2], MAXS[2])

used_colors = colors_blue if index in gt_frames else colors
for i, (chain, color) in enumerate(zip(kinematic_tree, used_colors)):
    if i < 5:
        linewidth = 4.0
    else:
        linewidth = 2.0
    ax.plot3D(data[index, chain, 0], data[index, chain, 1], data[index, chain, 2], linewidth=linewidth,
              color=color)
#         print(trajec[:index, 0].shape)

plt.axis('off')
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])

ani = FuncAnimation(fig, update, frames=frame_number, interval=1000 / fps, repeat=False)

writer = FFMpegFileWriter(fps=fps)

ani.save(save_path, fps=fps)

ani = FuncAnimation(fig, update, frames=frame_number, interval=1000 / fps, repeat=False, init_func=init)

ani.save(save_path, writer='pillow', fps=1000 / fps)

plt.close()
File "/usr/local/lib/python3.10/site-packages/matplotlib/animation.py", line 1767, in _draw_frame
self._drawn_artists = self._func(framedata, *self._args)
File "/content/motion-diffusion-model/data_loaders/humanml/utils/plot_script.py", line 95, in update
ax.lines = []
AttributeError: can't set attribute 'lines'

training from scratch

For VQVAE, I have reproduced the test results from the released checkpoints, but I cannot train one with similar performance by myself.

I use the following command:
python3 train_vq.py
--batch-size 256
--lr 2e-4
--total-iter 300000
--lr-scheduler 200000
--nb-code 512
--down-t 2
--depth 3
--dilation-growth-rate 3
--out-dir output
--dataname t2m
--vq-act relu
--quantizer ema_reset
--loss-vel 0.5
--recons-loss l1_smooth
--exp-name VQVAE

After training, I only got FID ~= 0.11 for both net.last & net_best_fid

T2M-GPT generated the Fatal Error joints points

This text is" A man is doing squats." the red part is right side body, the blue part is left side body

squat.mp4

side.view.mp4

Sequences duration

Hi,

Thanks for this job, it's impressive !
I have 2 questions. 1st, can I choose the sequence length before generating it?
2nd can I fine tune it on a certain type of actions (for example football actions)?

Thank you in advance.

visualization for KIT

Can you provide the code for visualization for KIT or joint_map for KIT

Can you provide the pretrained model for KIT?

How can i get action words from npy result

How can I get the input action corresponding to each frame from the output npy action result?

How to calculate the trans between each frame?

The result of real motion

Hello. I'm wondering how to get the results of real motion. Fid should be 0 if you use the same data as test data. However, the results showed it was 0.002. Do you know where it came from?

Thanks in advance

Controlling seed

I got the demo up and running, and it is working like a charm. Is there a way to add seed to the outputs? E.g. do different samples from the same prompts?

Motion sequences relevance to input texts

Dear @Mael-zys ,

Thanks for sharing this work. While I was playing around with a random input: "a man rises from the ground, walks in a circle and dances.", I got a quite weird result where the rising action is totally ignored (shown below with the 1st frame captured). However, when I changed my text prompts to your given text: "a man rises from the ground, walks in a circle and sits back down on the ground.", the rising behavior is clearly shown in the first frame. Are we supposed to get this behavior?

extractor model for kit

It seems that the extractor model "VQVAEV3_CB1024_CMT_H1024_NRES3" for KIT is missing. Can you provide it?

Why the first stage dataset is only 64 frames?

Thank you for your amazing work.
I wonder why the first stage training (VAE) is using only 64 frames rather than the whole sequence.

How to get SMPL parameters?

Hi, thanks for sharing this code! Could you please provide a script to get the smpl parameters?

How were the meta data for the mean and standard deviation generated ?

It looks like the predicted joint position output is denormalized using "mean.npy" and "std.npy" stored in folder "'checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta' for T2M dataset. How were these two normalization data files generated ? Were they generated from the HumanML3D training dataset directly or statistics accumulated during the training process ? Can you share the scripts to generate these meta data ?

Generate more than 50 frames

Thank you for your amazing work.
From my understanding, the model supports only 50 frames, correct?
Is that an easy way to retrain to generate dynamic range?

Can this really run with just 32GB vram?

I tried VQ-VAE training with the parameters suggested in the document (However, since we are building the humanML3D dataset, we used the '--dataname kit' option.).
However, 'RuntimeError: Unable to find a valid cuDNN algorithm to run convolution' occurred.
This is known as the error that occurs when running out of vram.
This error occurred even though the GPU used for training was an RTX A6000 with 48GB of VRAM, which is larger than the suggested 32GB of VRAM.
What's even more puzzling is that the same error occurs even when the batch size is drastically lowered to 1.
The detailed error message is as follows:

Traceback (most recent call last):
  File "train_vq.py", line 116, in <module>
    loss.backward()
  File "/root/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

continuous text-to motion generation

Hi, I see MDM method generate motion from single textual prompt. But your method can generate motions from continuous texts. How do you split the Humanml3D dataset and KIT dataset

notebook run wrong

question about motion representations

in paper, each pose is represented by (r˙a, r˙x, r˙z, ry, jp, jv, jr, cf). Every parameter has an explanation, except ry
What the meaning of ry?

ValueError: Imaginary component

Thanks for sharing such an amazing work!

While running the train_vq.py, after the warm-up iters I am getting the following error:

2023-04-14 22:31:45,883 INFO Training on t2m, motions are with 22 joints
Reading checkpoints/t2m/Comp_v6_KLD005/opt.txt
Loading Evaluation Model Wrapper (Epoch 28) Completed!!
100%|██████████| 23384/23384 [00:22<00:00, 1033.65it/s]
  0%|          | 0/1460 [00:00<?, ?it/s]Total number of motions 20942
100%|██████████| 1460/1460 [00:02<00:00, 497.87it/s]
Pointer Pointing at 0
2023-04-14 22:32:37,786 INFO Warmup. Iter 200 :  lr 0.00004 	 Commit. 0.29882 	 PPL. 82.50 	 Recons.  0.70914
2023-04-14 22:32:59,786 INFO Warmup. Iter 400 :  lr 0.00008 	 Commit. 1.13799 	 PPL. 128.26 	 Recons.  0.50615
2023-04-14 22:33:22,425 INFO Warmup. Iter 600 :  lr 0.00012 	 Commit. 2.25313 	 PPL. 217.44 	 Recons.  0.39540
2023-04-14 22:33:43,956 INFO Warmup. Iter 800 :  lr 0.00016 	 Commit. 3.23983 	 PPL. 270.35 	 Recons.  0.32911
Traceback (most recent call last):
  File "/media/npapargyr/C07AF5B07AF5A2F8/WD_2TB_Elements/t2m-gpt/train_vq.py", line 132, in <module>
    best_fid, best_iter, best_div, best_top1, best_top2, best_top3, best_matching, writer, logger = eval_trans.evaluation_vqvae(
  File "/home/npapargyr/anaconda3/envs/onnx/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/npapargyr/C07AF5B07AF5A2F8/WD_2TB_Elements/t2m-gpt/utils/eval_trans.py", line 98, in evaluation_vqvae
    fid = calculate_frechet_distance(gt_mu, gt_cov, mu, cov)
  File "/media/npapargyr/C07AF5B07AF5A2F8/WD_2TB_Elements/t2m-gpt/utils/eval_trans.py", line 547, in calculate_frechet_distance
    raise ValueError('Imaginary component {}'.format(m))
ValueError: Imaginary component 2.1971031947393037e+113

Any ideas?

About the model's height and body shape

Thank you for doing such a great study.

I would like to ask one question: Can this human model be changed in height? If it can, please tell me how to do that.
I think you guys are doing really interesting research.

How do I export the generated actions as a bvh file？

GuyTevet/motion-diffusion-model#32 (comment)
I found a method to convert the resulting npy file to bvh format, but in that project the structure of the npy file is completely different. Is there a solution to convert the npy file in this project to bvh??

How is the length of the generated motion sequence obtained?

Thank you for sharing this work!
I have some questions:
(1) Is the length predicted through the model network?
(2) What determines the length of the generated motion sequence?

KIT VQ-VAE Training details

Hi, could you provide the opt.txt file for training VQ-VAE on KIT dataset? I am trying to reproduce the performance of the VQ reconstruction of T2M-GPT.

Number of motions computation

Hello, I have a question regarding the method used to compute the number of motion. While investigating the Motiondataset and dataloader, it appears that the motion numbers during training are counted as the total number of motion snippets across all training subset, divided by the batch size. I can understand the reasoning behind this approach, but it leads to a significant reduction on the real data size. Specifically, the number of motions in the training set was initially 23,384. After removing motions under window-size 64, it decreased to 20,942, and further reduced to 14,435 training motions using the aforementioned method.
In your paper the numberof motion mentioned and total number of description seems different and less than those of the training

Your clarification on this behavior would be greatly appreciated. Thank you.

95% confidence interval

Thank you for your amazing work.
I have a question. Do you have the code for "95% confidence interval" in the evaluation function?

Are there any ways to install osmesa without an environment in conda?

It is not possible to install the osmesa library in the docker without conda, since it asks to install llvm-6.0 and there are problems with its installation too.. Has anyone had similar problems? is it possible to render a scene without these tools ? And if I use Conda the i have the next error
ImportError: ('Unable to load OpenGL library', 'libgcrypt.so.11: cannot open shared object file: No such file or directory', '/home/miniconda39/envs/VQTrans/lib/libOSMesa.so.8', '/home/miniconda39/envs/VQTrans/lib/libOSMesa.so.8')

render smpl gif not the same as joints

Hi, I get the predit joints gif like:

But i use the scripit "render_final.py " to get smpl mesh, the results is:

Any help or advice?

skeleton structure and indexes of body joints

Hi,

I notice that the motions in KIT dataset with 21 joints. I would like to know what is the detailed skeleton structure and indexes of body joints of KIT.

Many Thanks for your reply and help.

Environment file

Could you please provide an environment.yml file compatible for CUDA 11.2+? My GPU does not support CUDA 10.x, causing me a series of problems. I would be truly grateful. Thanks in advance.

Generate SMPL mesh

Hello! Can I use LBS to get the SMPL mesh?