Git Product home page Git Product logo

t2m-gpt's Introduction

(CVPR 2023) T2M-GPT

Pytorch implementation of paper "T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations"

[Project Page] [Paper] [Notebook Demo] [HuggingFace] [Space Demo] [T2M-GPT+]

teaser

If our project is helpful for your research, please consider citing :

@inproceedings{zhang2023generating,
  title={T2M-GPT: Generating Human Motion from Textual Descriptions with Discrete Representations},
  author={Zhang, Jianrong and Zhang, Yangsong and Cun, Xiaodong and Huang, Shaoli and Zhang, Yong and Zhao, Hongwei and Lu, Hongtao and Shen, Xi},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2023},
}

Table of Content

1. Visual Results (More results can be found in our project page)

Text: a man steps forward and does a handstand.
GT T2M MDM MotionDiffuse Ours
gif gif gif gif gif
Text: A man rises from the ground, walks in a circle and sits back down on the ground.
GT T2M MDM MotionDiffuse Ours
gif gif gif gif gif

2. Installation

2.1. Environment

Our model can be learnt in a single GPU V100-32G

conda env create -f environment.yml
conda activate T2M-GPT

The code was tested on Python 3.8 and PyTorch 1.8.1.

2.2. Dependencies

bash dataset/prepare/download_glove.sh

2.3. Datasets

We are using two 3D human motion-language dataset: HumanML3D and KIT-ML. For both datasets, you could find the details as well as download link [here].

Take HumanML3D for an example, the file directory should look like this:

./dataset/HumanML3D/
├── new_joint_vecs/
├── texts/
├── Mean.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D) 
├── Std.npy # same as in [HumanML3D](https://github.com/EricGuo5513/HumanML3D) 
├── train.txt
├── val.txt
├── test.txt
├── train_val.txt
└── all.txt

2.4. Motion & text feature extractors:

We use the same extractors provided by t2m to evaluate our generated motions. Please download the extractors.

bash dataset/prepare/download_extractor.sh

2.5. Pre-trained models

The pretrained model files will be stored in the 'pretrained' folder:

bash dataset/prepare/download_model.sh

2.6. Render SMPL mesh (optional)

If you want to render the generated motion, you need to install:

sudo sh dataset/prepare/download_smpl.sh
conda install -c menpo osmesa
conda install h5py
conda install -c conda-forge shapely pyrender trimesh mapbox_earcut

3. Quick Start

A quick start guide of how to use our code is available in demo.ipynb

demo

4. Train

Note that, for kit dataset, just need to set '--dataname kit'.

4.1. VQ-VAE

The results are saved in the folder output.

VQ training
python3 train_vq.py \
--batch-size 256 \
--lr 2e-4 \
--total-iter 300000 \
--lr-scheduler 200000 \
--nb-code 512 \
--down-t 2 \
--depth 3 \
--dilation-growth-rate 3 \
--out-dir output \
--dataname t2m \
--vq-act relu \
--quantizer ema_reset \
--loss-vel 0.5 \
--recons-loss l1_smooth \
--exp-name VQVAE

4.2. GPT

The results are saved in the folder output.

GPT training
python3 train_t2m_trans.py  \
--exp-name GPT \
--batch-size 128 \
--num-layers 9 \
--embed-dim-gpt 1024 \
--nb-code 512 \
--n-head-gpt 16 \
--block-size 51 \
--ff-rate 4 \
--drop-out-rate 0.1 \
--resume-pth output/VQVAE/net_last.pth \
--vq-name VQVAE \
--out-dir output \
--total-iter 300000 \
--lr-scheduler 150000 \
--lr 0.0001 \
--dataname t2m \
--down-t 2 \
--depth 3 \
--quantizer ema_reset \
--eval-iter 10000 \
--pkeep 0.5 \
--dilation-growth-rate 3 \
--vq-act relu

5. Evaluation

5.1. VQ-VAE

VQ eval
python3 VQ_eval.py \
--batch-size 256 \
--lr 2e-4 \
--total-iter 300000 \
--lr-scheduler 200000 \
--nb-code 512 \
--down-t 2 \
--depth 3 \
--dilation-growth-rate 3 \
--out-dir output \
--dataname t2m \
--vq-act relu \
--quantizer ema_reset \
--loss-vel 0.5 \
--recons-loss l1_smooth \
--exp-name TEST_VQVAE \
--resume-pth output/VQVAE/net_last.pth

5.2. GPT

GPT eval

Follow the evaluation setting of text-to-motion, we evaluate our model 20 times and report the average result. Due to the multimodality part where we should generate 30 motions from the same text, the evaluation takes a long time.

python3 GPT_eval_multi.py  \
--exp-name TEST_GPT \
--batch-size 128 \
--num-layers 9 \
--embed-dim-gpt 1024 \
--nb-code 512 \
--n-head-gpt 16 \
--block-size 51 \
--ff-rate 4 \
--drop-out-rate 0.1 \
--resume-pth output/VQVAE/net_last.pth \
--vq-name VQVAE \
--out-dir output \
--total-iter 300000 \
--lr-scheduler 150000 \
--lr 0.0001 \
--dataname t2m \
--down-t 2 \
--depth 3 \
--quantizer ema_reset \
--eval-iter 10000 \
--pkeep 0.5 \
--dilation-growth-rate 3 \
--vq-act relu \
--resume-trans output/GPT/net_best_fid.pth

6. SMPL Mesh Rendering

SMPL Mesh Rendering

You should input the npy folder address and the motion names. Here is an example:

python3 render_final.py --filedir output/TEST_GPT/ --motion-list 000019 005485

7. Acknowledgement

We appreciate helps from :

8. ChangLog

  • 2023/02/19 add the hugging face space demo for both skelton and SMPL mesh visualization.

t2m-gpt's People

Contributors

jiro-zhang avatar linghaochan avatar mael-zys avatar vumichien avatar xishen0220 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

t2m-gpt's Issues

question about vq-vae reconstruction loss

Hi, I found the velocity calculation in the forward_vel function in losses.py may not use the correct features

def forward_vel(self, motion_pred, motion_gt) : 
    loss = self.Loss(motion_pred[..., 4 : (self.nb_joints - 1) * 3 + 4], motion_gt[..., 4 : (self.nb_joints - 1) * 3 + 4])
    return loss

in the code the data slice is [4:(self.nb_joints - 1) * 3 + 4], which should be rotation independent positions, not the velocity
did I miss something ?

training stuck

When I tried to train vq_vae, the program was stuck on line 52 of train_vq.py which is for log printing. And I added "print(1)" after line 52, the "1" cannot be printed. Can you help me with some advice?

I checked again and found that the program was stuck on line 21 of evaluator_wrapper.py:
checkpoint = torch.load(pjoin(opt.checkpoints_dir, opt.dataset_name, 'text_mot_match', 'model', 'finest.tar'), map_location=opt.device)
How can I fix this?

Use Hugging Face

Hello. I'm wondering why 'space demo' is not available ?

Thanks in advance

Dataset difference may exist between your work and official HumanML3D

Hi, thanks for this attractive work!

I met the NaN problem when evaluating the VQVAE. And I found this is caused by the HumanML3D data that contains some NaN motion data.

However, I didn't see any preprocess in your repo to handle NaN problem. Does this mean the dataset you used is different from the official HumanML3D dataset?

Besides, with the dataset I processed according to the official HumanML and dropping the two NaN raw motion data, your provided pretrained VQVAE FID evaluation result is about 0.090, which is higher than 0.070 reported in your paper.
Does this also imply there is a minor difference in yours and the official HumanML3D dataset?

Thank you!

Below is the NaN data in the official HumanML3D https://github.com/EricGuo5513/HumanML3D/blob/main/cal_mean_variance.ipynb
The outputs in the image of HumanML3D notebook mean that the 007975.npy and M007975.npy contain NaN.
image

Unable to reproduce the results using the pretrained weights

Thank you for sharing the great work!

I have downloaded the pretrained weights as instructed, and used them to evaluate your proposed method on HumanML3D.
All the numbers seems to be somewhere in the vicinity of the reported results, with the exception to the FID.

INFO FID. 0.320, conf. 0.024, Diversity. 9.957, conf. 0.072, TOP1. 0.496, conf. 0.011, TOP2. 0.693, conf. 0.007, TOP3. 0.796, conf. 0.008, Matching. 2.894, conf. 0.035, Multi. 1.636, conf. 0.040

I have used the "net_best_fid.pth" file for both VQVAE and the Transformer. It would be lovely if you could enlighten me with how to obtain the results on the paper. Thank you!

Difference between your work and TM2T.

Hi, thanks for your awesome work.

I found that TM2T (TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of 3D Human Motions and Texts) from ECCV 22 is very similar to your work. You both employ VQ-VAE and inference with GPT. TM2T further applies inverse alignment while your work does not. Your pipeline is simpler than TM2T and your results are better.

Why are your results significantly better than TM2T in Table 2 (FID 0.737 vs. 3.599)? In TM2T Table 1, the ablation without the inverse alignment is worse.

Error in running demo code

OS: Debian 12
Graphics card: RTX 3060
Nvidia driver version: 530.30.20
CUDA version: 12.1

Code from: https://colab.research.google.com/drive/1Vy69w2q2d-Hg19F-KibqG0FRdpSj3L4O?usp=sharing#scrollTo=bz1iUfq3O4v8

# change the text here
clip_text = ["a person is jumping"]



import sys
sys.argv = ['GPT_eval_multi.py']
import options.option_transformer as option_trans
args = option_trans.get_args_parser()

args.dataname = 't2m'
args.resume_pth = 'pretrained/VQVAE/net_last.pth'
args.resume_trans = 'pretrained/VQTransformer_corruption05/net_best_fid.pth'
args.down_t = 2
args.depth = 3
args.block_size = 51
import clip
import torch
import numpy as np
import models.vqvae as vqvae
import models.t2m_trans as trans
import warnings
warnings.filterwarnings('ignore')

## load clip model and datasets
clip_model, clip_preprocess = clip.load("ViT-B/32", device=torch.device('cuda'), jit=False, download_root='./')  # Must set jit=False for training
clip.model.convert_weights(clip_model)  # Actually this line is unnecessary since clip by default already on float16
clip_model.eval()
for p in clip_model.parameters():
    p.requires_grad = False

net = vqvae.HumanVQVAE(args, ## use args to define different parameters in different quantizers
                       args.nb_code,
                       args.code_dim,
                       args.output_emb_width,
                       args.down_t,
                       args.stride_t,
                       args.width,
                       args.depth,
                       args.dilation_growth_rate)


trans_encoder = trans.Text2Motion_Transformer(num_vq=args.nb_code, 
                                embed_dim=1024, 
                                clip_dim=args.clip_dim, 
                                block_size=args.block_size, 
                                num_layers=9, 
                                n_head=16, 
                                drop_out_rate=args.drop_out_rate, 
                                fc_rate=args.ff_rate)


print ('loading checkpoint from {}'.format(args.resume_pth))
ckpt = torch.load(args.resume_pth, map_location='cpu')
net.load_state_dict(ckpt['net'], strict=True)
net.eval()
net.cuda()

print ('loading transformer checkpoint from {}'.format(args.resume_trans))
ckpt = torch.load(args.resume_trans, map_location='cpu')
trans_encoder.load_state_dict(ckpt['trans'], strict=True)
trans_encoder.eval()
trans_encoder.cuda()

mean = torch.from_numpy(np.load('./checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/mean.npy')).cuda()
std = torch.from_numpy(np.load('./checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta/std.npy')).cuda()

text = clip.tokenize(clip_text, truncate=True).cuda()
feat_clip_text = clip_model.encode_text(text).float()
index_motion = trans_encoder.sample(feat_clip_text[0:1], False)
pred_pose = net.forward_decoder(index_motion)

from utils.motion_process import recover_from_ric
pred_xyz = recover_from_ric((pred_pose*std+mean).float(), 22)
xyz = pred_xyz.reshape(1, -1, 22, 3)

np.save('motion.npy', xyz.detach().cpu().numpy())

import visualization.plot_3d_global as plot_3d
pose_vis = plot_3d.draw_to_batch(xyz.detach().cpu().numpy(),clip_text, ['example.gif'])

Error:

(T2M-GPT) ck@deb:~/Work/T2M-GPT$ python ./wl_test.py 
loading checkpoint from pretrained/VQVAE/net_last.pth
loading transformer checkpoint from pretrained/VQTransformer_corruption05/net_best_fid.pth
Traceback (most recent call last):
  File "./wl_test.py", line 68, in <module>
    feat_clip_text = clip_model.encode_text(text).float()
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/clip/model.py", line 348, in encode_text
    x = self.transformer(x)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/clip/model.py", line 203, in forward
    return self.resblocks(x)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/container.py", line 119, in forward
    input = module(input)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/clip/model.py", line 190, in forward
    x = x + self.attention(self.ln_1(x))
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/clip/model.py", line 187, in attention
    return self.attn(x, x, x, need_weights=False, attn_mask=self.attn_mask)[0]                       
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/module.py", line 889, in _call_impl                                                                                  
    result = self.forward(*input, **kwargs)                                                          
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/modules/activation.py", line 980, in forward                                                                                 
    return F.multi_head_attention_forward(                                                           
  File "/home/ck/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/nn/functional.py", line 4790, in multi_head_attention_forward                                                                   
    attn_output_weights = torch.bmm(q, k.transpose(1, 2))                                            
RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling `cublasGemmStridedBatchedExFix( handle, opa, opb, m, n, k, (void*)(&falpha), a, CUDA_R_16F, lda, stridea, b, CUDA_R_16F, ldb, strideb, (void*)(&fbeta), c, CUDA_R_16F, ldc, stridec, num_batches, CUDA_R_32F, CUBLAS_GEMM_DEFAULT_TENSOR_OP)`

A Collab result inconsistent with the paper

The Collab demo result for the following prompt looks very different (worse) from Figure 1 in the paper. Is the Collab / Notesbook demo configured the same way as the paper ?

"a person walks quickly and intentionally in a zig-zag pattern forward"

Attached is the GIF generated from the Collab demo. As we can see the character strafed to the left abruptly before it turns to the right and walke forward. The expected result as shown in Figure 1 of the paper is a zigzag walk.

output

Maximum length of the code index sequence

Hello there!

In your paper you mention that: "The maximum length of the code index sequence is T = 50."

Is this number(50) based on the maximum length of motion that is generated (196)?
Consequently, if we want to re-train the model with double the length of generated motion shall we also double T from 50 to 100?

Thank you in advance!

Generate motion results in bvh format?

There are many npy to bvh scripts that support Human3.6M skeleton, https://github.com/HW140701/VideoTo3dPoseAndBvh
but there is no support for HumanML3D yet,
so it is difficult to apply the motion results generated by T2M-GPT which are in numpy format.
Convert the npy file into bvh can make it support Blender, MMD and more.
Does such a script exist? Or is it possible to generate the results in bvh format directly?

[T2M-GPT] A loss question about VQVAE

This is a very outstanding job, and it has aroused my great interest.
But I have a small question, why is loss_embedding omitted during the training of VQVAE?
It seems that only reconstruction loss and commit loss are used on the loss function,

Error when rendering SMPL

I'm getting an unpickling error when trying to render SMPL.

t2m-gpt/visualize/joints2smpl/src/prior.py:128, in MaxMixturePrior.init(self, prior_folder, num_gaussians, dtype, epsilon, use_merged, **kwargs)
126 with open(full_gmm_fn, 'rb') as f:
127 print(os.path.exists(full_gmm_fn))
--> 128 gmm = pickle.load(f, encoding='latin1')
130 if type(gmm) == dict:
131 means = gmm['means'].astype(np_dtype)

UnpicklingError: the STRING opcode argument must be quoted

How should I resolve this error?

export onnx model

What is the training equipment? We tried to switch to the onnx model to speed up the reasoning, but the 12GB graphics card burst. Are there any other ways to accelerate

Getting issue while evaluating model

def plot_3d_motion(save_path, kinematic_tree, joints, title, dataset, figsize=(10, 10), fps=120, radius=3,
vis_mode='default', gt_frames=[]):
matplotlib.use('Agg')

title = '\n'.join(wrap(title, 20))

def init():
ax.set_xlim3d([-radius / 2, radius / 2])
ax.set_ylim3d([0, radius])
ax.set_zlim3d([-radius / 3., radius * 2 / 3.])
# print(title)
fig.suptitle(title, fontsize=10)
ax.grid(b=False)

def plot_xzPlane(minx, maxx, miny, minz, maxz):
## Plot a plane XZ
verts = [
[minx, miny, minz],
[minx, miny, maxz],
[maxx, miny, maxz],
[maxx, miny, minz]
]
xz_plane = Poly3DCollection([verts])
xz_plane.set_facecolor((0.5, 0.5, 0.5, 0.5))
ax.add_collection3d(xz_plane)

return ax

(seq_len, joints_num, 3)

data = joints.copy().reshape(len(joints), -1, 3)

preparation related to specific datasets

if dataset == 'kit':
data *= 0.003 # scale for visualization
elif dataset == 'humanml':
data *= 1.3 # scale for visualization
elif dataset in ['humanact12', 'uestc']:
data *= -1.5 # reverse axes, scale for visualization

fig = plt.figure(figsize=figsize)
plt.tight_layout()
ax = p3.Axes3D(fig)
init()
MINS = data.min(axis=0).min(axis=0)
MAXS = data.max(axis=0).max(axis=0)
colors_blue = ["#4D84AA", "#5B9965", "#61CEB9", "#34C1E2", "#80B79A"] # GT color
colors_orange = ["#DD5A37", "#D69E00", "#B75A39", "#FF6D00", "#DDB50E"] # Generation color
colors = colors_orange
if vis_mode == 'upper_body': # lower body taken fixed to input motion
colors[0] = colors_blue[0]
colors[1] = colors_blue[1]
elif vis_mode == 'gt':
colors = colors_blue

frame_number = data.shape[0]

print(dataset.shape)

height_offset = MINS[1]
data[:, :, 1] -= height_offset
trajec = data[:, 0, [0, 2]]

data[..., 0] -= data[:, 0:1, 0]
data[..., 2] -= data[:, 0:1, 2]

print(trajec.shape)

def update(index):
# print(index)
ax.lines = []
ax.collections = []
ax.view_init(elev=120, azim=-90)
ax.dist = 7.5
# ax =
plot_xzPlane(MINS[0] - trajec[index, 0], MAXS[0] - trajec[index, 0], 0, MINS[2] - trajec[index, 1],
MAXS[2] - trajec[index, 1])
# ax.scatter(dataset[index, :22, 0], dataset[index, :22, 1], dataset[index, :22, 2], color='black', s=3)

# if index > 1:
#     ax.plot3D(trajec[:index, 0] - trajec[index, 0], np.zeros_like(trajec[:index, 0]),
#               trajec[:index, 1] - trajec[index, 1], linewidth=1.0,
#               color='blue')
# #             ax = plot_xzPlane(ax, MINS[0], MAXS[0], 0, MINS[2], MAXS[2])

used_colors = colors_blue if index in gt_frames else colors
for i, (chain, color) in enumerate(zip(kinematic_tree, used_colors)):
    if i < 5:
        linewidth = 4.0
    else:
        linewidth = 2.0
    ax.plot3D(data[index, chain, 0], data[index, chain, 1], data[index, chain, 2], linewidth=linewidth,
              color=color)
#         print(trajec[:index, 0].shape)

plt.axis('off')
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])

ani = FuncAnimation(fig, update, frames=frame_number, interval=1000 / fps, repeat=False)

writer = FFMpegFileWriter(fps=fps)

ani.save(save_path, fps=fps)

ani = FuncAnimation(fig, update, frames=frame_number, interval=1000 / fps, repeat=False, init_func=init)

ani.save(save_path, writer='pillow', fps=1000 / fps)

plt.close()
File "/usr/local/lib/python3.10/site-packages/matplotlib/animation.py", line 1767, in _draw_frame
self._drawn_artists = self._func(framedata, *self._args)
File "/content/motion-diffusion-model/data_loaders/humanml/utils/plot_script.py", line 95, in update
ax.lines = []
AttributeError: can't set attribute 'lines'

training from scratch

For VQVAE, I have reproduced the test results from the released checkpoints, but I cannot train one with similar performance by myself.

I use the following command:
python3 train_vq.py
--batch-size 256
--lr 2e-4
--total-iter 300000
--lr-scheduler 200000
--nb-code 512
--down-t 2
--depth 3
--dilation-growth-rate 3
--out-dir output
--dataname t2m
--vq-act relu
--quantizer ema_reset
--loss-vel 0.5
--recons-loss l1_smooth
--exp-name VQVAE

After training, I only got FID ~= 0.11 for both net.last & net_best_fid

Sequences duration

Hi,

Thanks for this job, it's impressive !
I have 2 questions. 1st, can I choose the sequence length before generating it?
2nd can I fine tune it on a certain type of actions (for example football actions)?

Thank you in advance.

The result of real motion

Hello. I'm wondering how to get the results of real motion. Fid should be 0 if you use the same data as test data. However, the results showed it was 0.002. Do you know where it came from?

Thanks in advance

Controlling seed

I got the demo up and running, and it is working like a charm. Is there a way to add seed to the outputs? E.g. do different samples from the same prompts?

Motion sequences relevance to input texts

Dear @Mael-zys ,

Thanks for sharing this work. While I was playing around with a random input: "a man rises from the ground, walks in a circle and dances.", I got a quite weird result where the rising action is totally ignored (shown below with the 1st frame captured). However, when I changed my text prompts to your given text: "a man rises from the ground, walks in a circle and sits back down on the ground.", the rising behavior is clearly shown in the first frame. Are we supposed to get this behavior?
end_with_dance

extractor model for kit

It seems that the extractor model "VQVAEV3_CB1024_CMT_H1024_NRES3" for KIT is missing. Can you provide it?

How were the meta data for the mean and standard deviation generated ?

It looks like the predicted joint position output is denormalized using "mean.npy" and "std.npy" stored in folder "'checkpoints/t2m/VQVAEV3_CB1024_CMT_H1024_NRES3/meta' for T2M dataset. How were these two normalization data files generated ? Were they generated from the HumanML3D training dataset directly or statistics accumulated during the training process ? Can you share the scripts to generate these meta data ?

Generate more than 50 frames

Thank you for your amazing work.
From my understanding, the model supports only 50 frames, correct?
Is that an easy way to retrain to generate dynamic range?

Can this really run with just 32GB vram?

I tried VQ-VAE training with the parameters suggested in the document (However, since we are building the humanML3D dataset, we used the '--dataname kit' option.).
However, 'RuntimeError: Unable to find a valid cuDNN algorithm to run convolution' occurred.
This is known as the error that occurs when running out of vram.
This error occurred even though the GPU used for training was an RTX A6000 with 48GB of VRAM, which is larger than the suggested 32GB of VRAM.
What's even more puzzling is that the same error occurs even when the batch size is drastically lowered to 1.
The detailed error message is as follows:

Traceback (most recent call last):
  File "train_vq.py", line 116, in <module>
    loss.backward()
  File "/root/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/tensor.py", line 245, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/root/anaconda3/envs/T2M-GPT/lib/python3.8/site-packages/torch/autograd/__init__.py", line 145, in backward
    Variable._execution_engine.run_backward(
RuntimeError: Unable to find a valid cuDNN algorithm to run convolution

continuous text-to motion generation

Hi, I see MDM method generate motion from single textual prompt. But your method can generate motions from continuous texts. How do you split the Humanml3D dataset and KIT dataset

question about motion representations

in paper, each pose is represented by (r˙a, r˙x, r˙z, ry, jp, jv, jr, cf). Every parameter has an explanation, except ry
What the meaning of ry?

ValueError: Imaginary component

Thanks for sharing such an amazing work!

While running the train_vq.py, after the warm-up iters I am getting the following error:

2023-04-14 22:31:45,883 INFO Training on t2m, motions are with 22 joints
Reading checkpoints/t2m/Comp_v6_KLD005/opt.txt
Loading Evaluation Model Wrapper (Epoch 28) Completed!!
100%|██████████| 23384/23384 [00:22<00:00, 1033.65it/s]
  0%|          | 0/1460 [00:00<?, ?it/s]Total number of motions 20942
100%|██████████| 1460/1460 [00:02<00:00, 497.87it/s]
Pointer Pointing at 0
2023-04-14 22:32:37,786 INFO Warmup. Iter 200 :  lr 0.00004 	 Commit. 0.29882 	 PPL. 82.50 	 Recons.  0.70914
2023-04-14 22:32:59,786 INFO Warmup. Iter 400 :  lr 0.00008 	 Commit. 1.13799 	 PPL. 128.26 	 Recons.  0.50615
2023-04-14 22:33:22,425 INFO Warmup. Iter 600 :  lr 0.00012 	 Commit. 2.25313 	 PPL. 217.44 	 Recons.  0.39540
2023-04-14 22:33:43,956 INFO Warmup. Iter 800 :  lr 0.00016 	 Commit. 3.23983 	 PPL. 270.35 	 Recons.  0.32911
Traceback (most recent call last):
  File "/media/npapargyr/C07AF5B07AF5A2F8/WD_2TB_Elements/t2m-gpt/train_vq.py", line 132, in <module>
    best_fid, best_iter, best_div, best_top1, best_top2, best_top3, best_matching, writer, logger = eval_trans.evaluation_vqvae(
  File "/home/npapargyr/anaconda3/envs/onnx/lib/python3.8/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/media/npapargyr/C07AF5B07AF5A2F8/WD_2TB_Elements/t2m-gpt/utils/eval_trans.py", line 98, in evaluation_vqvae
    fid = calculate_frechet_distance(gt_mu, gt_cov, mu, cov)
  File "/media/npapargyr/C07AF5B07AF5A2F8/WD_2TB_Elements/t2m-gpt/utils/eval_trans.py", line 547, in calculate_frechet_distance
    raise ValueError('Imaginary component {}'.format(m))
ValueError: Imaginary component 2.1971031947393037e+113

Any ideas?

About the model's height and body shape

Thank you for doing such a great study.

I would like to ask one question: Can this human model be changed in height? If it can, please tell me how to do that.
I think you guys are doing really interesting research.

KIT VQ-VAE Training details

Hi, could you provide the opt.txt file for training VQ-VAE on KIT dataset? I am trying to reproduce the performance of the VQ reconstruction of T2M-GPT.

Number of motions computation

Hello, I have a question regarding the method used to compute the number of motion. While investigating the Motiondataset and dataloader, it appears that the motion numbers during training are counted as the total number of motion snippets across all training subset, divided by the batch size. I can understand the reasoning behind this approach, but it leads to a significant reduction on the real data size. Specifically, the number of motions in the training set was initially 23,384. After removing motions under window-size 64, it decreased to 20,942, and further reduced to 14,435 training motions using the aforementioned method.
In your paper the numberof motion mentioned and total number of description seems different and less than those of the training

Your clarification on this behavior would be greatly appreciated. Thank you.

95% confidence interval

Thank you for your amazing work.
I have a question. Do you have the code for "95% confidence interval" in the evaluation function?

Are there any ways to install osmesa without an environment in conda?

It is not possible to install the osmesa library in the docker without conda, since it asks to install llvm-6.0 and there are problems with its installation too.. Has anyone had similar problems? is it possible to render a scene without these tools ? And if I use Conda the i have the next error
ImportError: ('Unable to load OpenGL library', 'libgcrypt.so.11: cannot open shared object file: No such file or directory', '/home/miniconda39/envs/VQTrans/lib/libOSMesa.so.8', '/home/miniconda39/envs/VQTrans/lib/libOSMesa.so.8')

skeleton structure and indexes of body joints

Hi,

I notice that the motions in KIT dataset with 21 joints. I would like to know what is the detailed skeleton structure and indexes of body joints of KIT.

Many Thanks for your reply and help.

Environment file

Could you please provide an environment.yml file compatible for CUDA 11.2+? My GPU does not support CUDA 10.x, causing me a series of problems. I would be truly grateful. Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.