Git Product home page Git Product logo

Comments (13)

ghenter avatar ghenter commented on July 17, 2024 2

Have you looked at gestures generated from the model you are training? If the visual quality is good, I wouldn't worry too much.

In general, there is an important disagreement between likelihood and human perception, where the log-likelihood loss is especially sensitive to insufficient diversity in model output, whereas human perception is much more sensitive to the presence of poor-quality output examples instead; see for example Theis et al. (2016). This is why it is common to sneakily "reduce the temperature" to generate better-looking examples from generative models in many papers, whether these models are flows, VAEs, or GANs. (If you are really interested, I have a publication that discusses temperature reduction of generative models and presents a generalisation of the same principle to Gaussian processes and Markov chains.)

In any case, we noticed the same discrepancy between training and validation-set losses that you are describing for the models we trained in our paper presented at EUROGRAPHICS: Even though our validation-set log-likelihood started getting worse early on during training (indicating overfitting), we found that subjective gesture quality still was increasing. In the original manuscript we submitted, we actually included a discussion about this finding, but the reviewers requested that additional content be added to the paper discussion while maintaining the same page budget, so we had to drop it from the revised version of the article for lack of space. You can instead find curves of training and validation losses of different MoGlow locomotion and gesture models in our paper published at INNF+ 2020. These curves are more or less equivalent to the training logs that you are requesting, and should be reasonably comparable to the results that you are getting. That paper also contains some additional discussion of entropy-reduction schemes such as reducing the temperature.

from stylegestures.

rainofmine avatar rainofmine commented on July 17, 2024

@ghenter Thank you for the explanation.

I trained the model nearly 50k steps (batchsize 480, 6 gpus). The training loss is about -200 and the validation loss is 60k. I visualized the result and found that in most cases both hands stay around the chest and move slightly, which is different from the results in your youtube demo (wide range of arm movements, one hand gesture ...). Could you give me some advice to get good results such as the network parameters, the number of training steps(or epochs) and other tricks? Or could you give a pretrained model in your paper?

By the way, since the training time is so long, will a large batchsize(16 or 32 gpus) affect the training results?

Thanks!

from stylegestures.

ghenter avatar ghenter commented on July 17, 2024

I trained the model nearly 50k steps (...) I visualized the result and found that in most cases both hands stay around the chest and move slightly

Try visualising the result at various checkpoints during training and see where you think the motions look the best? Lots of training is not necessarily the best for motion quality.

The paper states that we trained our MG models for 387 epochs; you can probably convert your training setup to epochs for comparison. The training curves for the FB-U model reported in our INNF+ paper only reach a 10k validation loss after 80k steps. We did find that motions were more vivid (but possibly also more idiosyncratic) earlier on during training, which is why we trained our FB gesture models for a shorter amount of time than the earlier MG gesture models.

could you give a pretrained model in your paper?

We are planning to upload pre-trained models matching those in our papers; see #4. Note, however, that such a pre-trained model cannot be used for the GENEA Challenge, since the models in the EUROGRAPHICS paper were trained on a different train/test split than that used for the challenge. (In particular, the models in the paper have been trained on motion that is in the test set for the challenge, and training on test examples is improper and will lead to disqualification.)

will a large batchsize(16 or 32 gpus) affect the training results?

I do not know how the number of samples in a batch affects training. @simonalexanderson will know what batchsize we used. However, I think the flow code we adapted for MoGlow has an issue that it does not work when trained on multiple GPUs. @jonepatr, who knows more about this, tells me that the issue depends on what distributed backend one is using, and that one of the consequences of using the standard dp backend is that the actnorm layers are reset for each batch if more than one GPU is used. (The actnorm may also end up being initialised with different parameters on different parallel GPUs, which seems like another no-no to me.) I would therefore recommend either fixing the multi-GPU issues present in both the MoGlow code and the upstream Glow implementation – which would be a really useful contribution to this repo – or (easier) training on a single GPU for now, like we did in our papers.

from stylegestures.

rainofmine avatar rainofmine commented on July 17, 2024

@ghenter Refer to your advice, I re-train the model with the provided hparams (style_gestures.json) and use single gpu. The training data is GENEA challenge data(3-23 for train and 1-2 for val/test). I augment the data with mirror strategy and only downsample once to 20fps (not like paper with frames t = 0,3,6,..., and t = 1,4,7,..., and t = 2,5,8,...). Finally I get 11978 training samples. After 160k training steps, the training loss is about -300 and validation loss is about 3k. I visualize the result at various checkpoints. The visualization result often shows "lazy moving". For example, the hands move slightly in one position and a few minutes later the hands move to another position and move slightly around it. I don't know what a reasonable result should be with glow method. I also try other method (encoder-decoder with MSE loss), the generated gesture seems more restless(of course many strange gestures).

Don‘t worry. I will not use your model in the GENEA challenge. I just use it for my research to reproduce the result. Looking forward to your updates. Thank you!

And will the code of style-control approach be update? What is the style-control input format?

from stylegestures.

ghenter avatar ghenter commented on July 17, 2024

I visualize the result at various checkpoints. The visualization result often shows "lazy moving".

Hmm. Everything you describe about your approach sounds good to me. I think we have reached the limit of my knowledge of training these systems. @simonalexanderson might be able to chime in with further hands-on insights, but he's on holiday right now and likely won't be back until a week from now, I believe.

And will the code of style-control approach be update? What is the style-control input format?

This is another thing that @simonalexanderson can answer better than I can.

I don't know what a reasonable result should be with glow method.

I think the motion clips in our demonstration videos are representative of output from the method. In the supplementary material found in the EUROGRAPHICS digital library (open access) you can see more examples of gesture motion generated from speech not in the training data, specifically motion clips used in the subjective evaluation and motion generated for voices different from the training-data speaker.

I will not use your model in the GENEA challenge.

Great! And sorry if I was a bit forceful in pointing this out; I just wanted to make sure that there won't be any unfortunate misunderstandings. :)

from stylegestures.

rainofmine avatar rainofmine commented on July 17, 2024

Some Questions about the style control. @ghenter @simonalexanderson

As proposed in the paper, style can be seen as another control input 's' besides the audio 'a'. So, how does the style go into the network with the audio input? The audio input has 27-channel and the style input such as height and speed should have 1-channel. So the final control input will be 28-channel? Is my understanding correct?Is there any imbalance between different controls when they have different channel?

And does style control contain the context? (1 frame or 1+past+future frames ?) And will it be normalized like the preprocess of audio?

Thank you

from stylegestures.

ghenter avatar ghenter commented on July 17, 2024

Hi @rainofmine,

The audio input has 27-channel and the style input such as height and speed should have 1-channel. So the final control input will be 28-channel? Is my understanding correct?

Your description sounds mostly correct to me. Each input frame is associated with an acoustic feature vector 'a_t' and an optional style input vector 's_t'. In the paper, 'a_t' is 27-dimensional while 's_t' is one-dimensional for our style controllable models. There are therefore 27+1=28 total distinct features in the input sequence per output frame in the paper. (But this 28-dimensional vector is not the same as the "control input" 'c_t' in the paper; see the penultimate section of this response.)

Is there any imbalance between different controls when they have different channel?

You mean that the number of channels for the two input types is different (27 vs. 1), and that this might cause the network to be less responsive to the style input compared to the acoustic input? If that is what you mean, the effect of such input-dimensionality imbalance is not something we investigated. All I can say is that I think our results conclusively show that both acoustic and style inputs have a clear influence on the output motion, and that this influence matches our expectations of what 'a' as well as 's' should do.

And does style control contain the context? (1 frame or 1+past+future frames ?)

As illustrated in Fig. 1 in the paper, the control inputs 'c_t' features that are fed into the flow coupling layers for frame 't' is a concatenation of all 'a_t' and 's_t' vectors within the context window. In the paper, this window encompasses 5 frames back and 20 frames into the future, i.e., time indices {t-5,...,t+20}, for a total of 5+1+20=26 contiguous frames. For our style-controllable models, the control input vector 'c_t' for frame 't' thus comprises 26*28=728 elements, a concatenation of 26 distinct 28-dimensional frame-wise input vectors.

will it be normalized like the preprocess of audio?

My understanding is that we standardised (a.k.a. "normalised") all our input and output features, by affinely transforming them to have zero mean and unit variance over the entirety of the data. This can be done either before or after applying the context-windowing operation to the 'a' and 's' sequences to create 'c', with little practical difference. (I would personally do it to 'a' and 's' individually prior to windowing and concatenation, but it is possible that the code implements this differently.)

@simonalexanderson can correct me if I'm wrong on any of these points.

from stylegestures.

simonalexanderson avatar simonalexanderson commented on July 17, 2024

Hi @rainofmine. I can confirm that @ghenter's answer is correct. Please check out the newly updated pre-processing code and guidelines. The GENEA data is in the correct input format (bvh and 48k audio), so you can run the scripts without the manual MotionBuilder/Maya steps described in the guidelines.

from stylegestures.

rainofmine avatar rainofmine commented on July 17, 2024

@simonalexanderson Thank you for sharing the code !

I have used the data_processing code in Gesticulator (your another work) to process the data. I think that code does similar things as your sharing code (correct?). But I noticed that in your data_processing code, you use 6s in the training clip, which is different from described in the paper(4s) and you do not use 3 times downsample(036,147,258...). (why?)

In my reproduction, the visualized result from glow model shows small movement of gesture. When testing with a long-time input, it tends to keep a similar action mode. (Compared with testing with a 20s clip, the autoregressive and LSTM hidden state is not reset.) Do you have any suggestion about how to improvement the performance?

I also tested the model on training set and found that the output gesture is almost the same with the GT. Considering the gap between train and val loss, it is kind of overfitting? Does it result from the insufficient training samples?Or the relationship between audio and gesture is hard to generalize?

And any plan to share the pretrained model?

Finally, I am grateful that you alway answer to me patiently. Thank you again!

from stylegestures.

ghenter avatar ghenter commented on July 17, 2024

I have used the data_processing code in Gesticulator (your another work) to process the data. (...) But I noticed that in your data_processing code, you use 6s in the training clip, which is different from described in the paper(4s) and you do not use 3 times downsample

Are you saying that the code in the data_processing folder that @simonalexanderson checked in to this repo (StyleGestures) is different from the description in the EUROGRAPHICS paper? Or are you saying that the data_processing from the Gesticulator repository differs from the description in the EUROGRAPHICS paper?

The Gesticulator code is expected to differ from this repository since it is a different work with different goals. For instance, in Gesticulator we wanted to enable gestures linked to semantic content, and training on longer segments may allow the model a better semantic understanding of the speech. Also, the work on Gesticulator was led by @Svito-zar, who might have made different design decisions simply because he is a different person than Simon. :)

from stylegestures.

simonalexanderson avatar simonalexanderson commented on July 17, 2024

In the paper we trained the upper body systems with 4s windows and staggered down-sampling (as you say), while the full body systems were trained at a later stage using the settings in the pre-processing code (6s windows, no staggering). Basically the results should be very similar (the staggering mostly adds training time without adding much information) using either setting.

from stylegestures.

rainofmine avatar rainofmine commented on July 17, 2024

OK! Thank you. I get it.
And do you have any idea for my other questions? @simonalexanderson

from stylegestures.

ghenter avatar ghenter commented on July 17, 2024

I also tested the model on training set and found that the output gesture is almost the same with the GT. Considering the gap between train and val loss, it is kind of overfitting?

Are you saying that, if you feed in speech from the training set into your trained model and then sample random output motion, the output motion is of similar quality as the training data itself? But when you feed in similarly-processed data that you held out from training, the resulting sampled output motion is much less lively and you get what you termed "lazy moving"? If that is what you are saying then yes, this sounds like possible overfitting to me. (However, it could also be other things, e.g., implementation flaws such as using different standardisation/normalisation on training and test inputs.)

Does it result from the insufficient training samples? Or the relationship between audio and gesture is hard to generalize?

It is not always that easy to conclusively attribute overfitting to a single item such as "too little training data" or "generalisation is hard on this problem". For one thing, different factors interact, so it could be that "it is hard to learn a model of the relationship between audio and gesture that generalises well, so we need more training data than we have".

That said, since we have managed to train models on the Trinity College Dublin gesture data that generate reasonable beat gestures in time with the speech, I do not think a lack of data is the key limiter here.

from stylegestures.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.