This is an umbrella issue where we can collectively tackled some problems and improve

As of <a class="commit-link" data-hovercard-type="commit" data-hovercard-url="https://

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

TODOs about qanet HOT 6 OPEN

localminimum commented on June 26, 2024

TODOs

from qanet.

Comments (6)

commented on June 26, 2024 1

As of f0c79cc, I have changed the location of dropouts to "after" layer norm from "before" layer norm. It doesn't make sense to drop input channels to layer norm as they normalize across channel dimensions, this will cause distribution mismatch during inference time and training time. We shall see how this improves the model.

from qanet.

commented on June 26, 2024

To overcome your GPU memory constraints, what about just decreasing batch size?

On a 1080 Ti (11GB), I'm able to run 128 hidden units, 8 attention heads, 300 glove_dim, 300 char_dim with a batch size of 12. At least 16 and above, CUDA is out of memory. Accuracy seems comparable so far.

from qanet.

commented on June 26, 2024

You have a valid point, and I would like to know how your experiment goes. I would also suggest trying group norm instead of layer norm as they report better performance with lower batch sizes.

from qanet.

commented on June 26, 2024

Good suggestion, Min. Since the paper compares against batch norm, have you found that layer norm generally outperforms batch norm lately? One could try batch norm also for comparison. Interestingly the 'break-even' point is about batch size 12 between batch norm and group norm for those paper's conditions. Layer norm is supposedly more robust to small mini batches compared to batch norm.

Also the conditions from the above comment run fine on a 1070 gpu.

Do you have a sense if model parallelization across multiple gpus is worth it for this type of model?

from qanet.

localminimum commented on June 26, 2024

Hi @mikalyoung , I haven't tried parallelisation across multiple GPUs so I wouldn't know what the best way to go about it is. I heard that data parallelism is easier to get working than model parallelisation. It seems that from #15 using bigger hidden size and bigger number of heads in attention improves the performance, so I would try fitting the bigger model with smaller batches into multiple GPUs.

from qanet.

JACKHAHA363 commented on June 26, 2024

Right now what is the status reproducing the paper's result?

from qanet.

Recommend Projects

TODOs about qanet HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent