Hi there, I have been running some experiments with mixture of exper

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Improvements in validation/test ppl of the transformer-xl with MoE on wt103 about fastmoe HOT 4 CLOSED

laekov commented on September 15, 2024

Improvements in validation/test ppl of the transformer-xl with MoE on wt103

from fastmoe.

Comments (4)

laekov commented on September 15, 2024

@xptree @Sengxian any ideas?

from fastmoe.

xptree commented on September 15, 2024

@latifisalar It is not surprising. In your WT103 experiment, the MoE-Transformer-XL has 16x parameters comparing to the vanilla one, and it would be much easier for the MoE model to get overfitted (given WT103 is not a very hard dataset to fit).

On the other hand, it means the model capacity of MoE is much larger than the vanilla Transformer-XL, even with similar FLOPs.

from fastmoe.

latifisalar commented on September 15, 2024

@xptree That is a valid point. I was expecting the overfitting issue with a small dataset as wt103. Would you mind if I ask which dataset was used in training GPT model in section 5.4, are you using the latest wikipedia dump? And also, I am guessing the loss curves in Figure 7 correspond to the pre-training phase. Did you check the test accuracy at the end of pre-training phase? Or, validate the model with any downstream tasks? Just wanted to see if the overfitting issue is not a problem in larger datasets and it actually results in improving test accuracy at the end.

from fastmoe.

xptree commented on September 15, 2024

@latifisalar We report the train loss and ppl in our manuscript, and I agree that a validation loss/ppl curve is necessary here (perhaps we can update it in our next version). The dataset we used for pre-training is wiki.

from fastmoe.

Improvements in validation/test ppl of the transformer-xl with MoE on wt103 about fastmoe HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent