Git Product home page Git Product logo

mixture-of-experts's Introduction

The Sparsely Gated Mixture of Experts Layer for PyTorch

source: https://techburst.io/outrageously-large-neural-network-gated-mixture-of-experts-billions-of-parameter-same-d3e901f2fe05

This repository contains the PyTorch re-implementation of the sparsely-gated MoE layer described in the paper Outrageously Large Neural Networks for PyTorch.

from moe import MoE
import torch

# instantiate the MoE layer
model = MoE(input_size=1000, output_size=20, num_experts=10,hidden_size=66, k= 4, noisy_gating=True)

X = torch.rand(32, 1000)

#train
model.train()
# forward
y_hat, aux_loss = model(X)

# evaluation

model.eval()
y_hat, aux_loss = model(X)

Requirements

To install the requirements run:

pip install -r requirements.py

Example

The file example.py contains a minimal working example illustrating how to train and evaluate the MoE layer with dummy inputs and targets. To run the example:

python example.py

CIFAR 10 example

The file cifar10_example.py contains a minimal working example of the CIFAR 10 dataset. It achieves an accuracy of 39% with arbitrary hyper-parameters and not fully converged. To run the example:

python cifar10_example.py

Used by

FastMoE: A Fast Mixture-of-Expert Training System This implementation was used as a reference PyTorch implementation for single-GPU training.

Acknowledgements

The code is based on the TensorFlow implementation that can be found here.

Citing

@misc{rau2019moe,
    title={Sparsely-gated Mixture-of-Experts PyTorch implementation},
    author={Rau, David},
    journal={https://github.com/davidmrau/mixture-of-experts},
    year={2019}
}

mixture-of-experts's People

Contributors

davidmrau avatar elias-ramzi avatar inisis avatar panmianzhi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

mixture-of-experts's Issues

MoE for transformers

Hi,

I want to use the MoE inside a transformer model. So instead on the current input size of shape [batch_size, input_size] I have an input size of size [batch_size, sequence_length, input_size]. Do you know how can I make this work?

Best,
Elias

Why there is prob_if_in/out in MoE-Loss-load?

Can you explain that there is "prob_if_in" , "prob_if_out" in "MoE._prob_in_top_k"?
I am a little confused that the original paper doesn't talk about the L_load needs two kinds of prob?

Why not gpu?

README says the code is on cpu, but i want to know whether the code can run on GPU, such as A100?

Zero Grad of w_gate

I implement MoE in the Transformer layer, and I discard the load-balance loss l_aux during my training. However, I found that the gradient of the w_gate is always zero and that parameter does not update. Is this because I don't consider l_aux ?

Why logsoftmax in the expert's output?

Hi, thanks for the rep!

Could you kindly clarify why do you use nn.LogSoftmax( ) in the last layer of the MLP experts? Thanks!

I'm guessing the reason is because you are using NLL loss in one of the examples. However, in the CIFAR example crossentropyloss is used.

Question about the noisy top-k gating

Hi! Thanks for your implementation of MoE! I have confused about the derivatives of w_gate and w_noise. It seems the computation of logits:

https://github.com/davidmrau/mixture-of-experts/blob/master/moe.py#L239

is not differentiable because of the top-k operation. So the w_gate and w_noise can not be updated from the NLLloss. Not sure is the appropriate way to train the MoE.

cv_squared

Hi David,

Thanks for your great code!
I think there is a small mistake in your cv_squared function.
Seems it should be return x.float().var() / (x.float().mean()**2 + eps)

Thanks

about aux_loss

Hello, I would like to ask what will be the overall trend of aux_loss training? After I added MOE to my model, although the loss decreased, the aux_loss kept fluctuating
捕获
.

Why is weighted sum calculated in the logarithmic space?

In the implementation of combine in class SparseDispatcher, the code first apply exp(), then calculate weighted sum and finally go back to log space. Why to do that? I think the result is not as same as the original paper.
In the paper, we have
y = sum(G(x) * E(x))
but in your code, I think you calculate
y = log(sum(G(x) * exp(E(X)) ) )
It seems not the same

multiple_by_gates after exp

Hi,

Thank for your pytorch implementation of sparse MOE! In the combine method of SparseDispatcher, the stitched is transformed by exp and then multiply_by_gates. I wonder if the stitched should be first multiplied by gates and then transformed by exp, which is consistent with the tf implementation.

def combine(self, expert_out, multiply_by_gates=True):
        # apply exp to expert outputs, so we are not longer in log space
        stitched = torch.cat(expert_out, 0).exp()

        if multiply_by_gates:
            stitched = stitched.mul(self._nonzero_gates)

How to use this layer in a sequence setting?

Hi, I am trying to use the MOE class in the decoder portion of a transformer architecture in which I want to replace the feed forward step with a mixture of experts. The input dimension of the class is of type [batch, input_size]. The sequence in each step is variable which leads to a variable input size. How can I use this class in that case

A question for changing input size of moe

Hi, thanks for your updating!
I noticed that you updated this project recently, and I am wondering if I want to change the input size to torch.Size([1, seq, 64]), what code should I change in the moe.py file?

Wrong Implementation in SparseDispatcher

code line 57 self._batch_index = sorted_experts[index_sorted_experts[:, 1],0]
should change to self._batch_index = torch.nonzero(gates)[index_sorted_experts[:, 1],0]

some questions about the code

Hello, your code has provided me with great inspiration for my work. I also have some questions to ask you. Firstly, why do we need to calculate the self.cv_squared loss for both importance and load? Secondly, in _prob_in_top_k, can we replace prob_if_out with 1 - prob_if_in?

Issue with gates parameters

Hi,

Thank you for this repo!

I think there is an issue with the gates parameters defined in the MoE as:

self.w_gate = nn.Parameter(torch.zeros(input_size, num_experts), requires_grad=True).to(self.device)
self.w_noise = nn.Parameter(torch.zeros(input_size, num_experts), requires_grad=True).to(self.device)

In my version of torch (1.11.0) applying .to(self.device) to a nn.Parameter returns a nn.Tensor, so the weights are not learned during training.
A simple fix is to juste remove the .to(self.device) as the gates weights are registered by torch as model parameters and can be set to the correct device outside of the __init__.
Again I don't know if this is specific to my torch version, but this prohibited learning the MoE for me.

Hope this helps!

why apply exp() log() in expert_out result in combine() function of SparseDispatcher class

    def combine(self, expert_out, multiply_by_gates=True):
        """Sum together the expert output, weighted by the gates.
        The slice corresponding to a particular batch element `b` is computed
        as the sum over all experts `i` of the expert output, weighted by the
        corresponding gate values.  If `multiply_by_gates` is set to False, the
        gate values are ignored.
        Args:
          expert_out: a list of `num_experts` `Tensor`s, each with shape
            `[expert_batch_size_i, <extra_output_dims>]`.
          multiply_by_gates: a boolean
        Returns:
          a `Tensor` with shape `[batch_size, <extra_output_dims>]`.
        """
        # apply exp to expert outputs, so we are not longer in log space
        stitched = torch.cat(expert_out, 0).exp()

        if multiply_by_gates:
            stitched = stitched.mul(self._nonzero_gates)
        zeros = torch.zeros(self._gates.size(0), expert_out[-1].size(1), requires_grad=True, device=stitched.device)
        # combine samples that have been processed by the same k experts
        combined = zeros.index_add(0, self._batch_index, stitched.float())
        # add eps to all zero values in order to avoid nans when going back to log space
        combined[combined == 0] = np.finfo(float).eps
        # back to log space
        return combined.log()

I'm confusing about the applied exp() and log() in above code.

if we just want to predict one data item in inference, do I need to apply these functions ?

Log and Exp- Space

Hey,

thank you for the pytorch implementation. In your combine implementation you transform the expert outputs by Exp and after that you transform it back. Can you explain, why you doing this.

    def combine(self, expert_out, multiply_by_gates=True):
        """Sum together the expert output, weighted by the gates.
        The slice corresponding to a particular batch element `b` is computed
        as the sum over all experts `i` of the expert output, weighted by the
        corresponding gate values.  If `multiply_by_gates` is set to False, the
        gate values are ignored.
        Args:
          expert_out: a list of `num_experts` `Tensor`s, each with shape
            `[expert_batch_size_i, <extra_output_dims>]`.
          multiply_by_gates: a boolean
        Returns:
          a `Tensor` with shape `[batch_size, <extra_output_dims>]`.
        """
        # apply exp to expert outputs, so we are not longer in log space
        stitched = torch.cat(expert_out, 0).exp()

        if multiply_by_gates:
            stitched = stitched.mul(self._nonzero_gates)
        zeros = torch.zeros(self._gates.size(0), expert_out[-1].size(1), requires_grad=True)
        # combine samples that have been processed by the same k experts
        combined = zeros.index_add(0, self._batch_index, stitched.float())
        # add eps to all zero values in order to avoid nans when going back to log space
        combined[combined == 0] = np.finfo(float).eps
        # back to log space
        return combined.log()

For def _prob_in_top_k

Hey, sorry for bothering! I am just confused that you wrote the function as:
prob_if_in = normal.cdf((clean_values - threshold_if_in)/noise_stddev)
prob_if_out = normal.cdf((clean_values - threshold_if_out)/noise_stddev)

In my perspective, in the origin paper, regardless of threshold_if_in or threshold_if_out, it should be the initial logits(after adding the noise) before the softmax(initial logits). After I run the code, I find out the inputted threshold_if_in/out are actually from the softmax(initial logits).

Also in this commond "is_in = torch.gt(noisy_values, threshold_if_in)" the noisy_values are the initial logits, the threshold_if_in is the output of softmax(initial logits)

requires_grad = True not required for a variable under combine() method?

Inside the combine() under SparseDispatcher, there is a line:

zeros = torch.zeros(self._gates.size(0), expert_out[-1].size(1), requires_grad=True, device=stitched.device)

It seems to me 'requires_grad=True' is not required, as 'zeros' is not any weight or parameters to be learned. Any specific reason to set it to True?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.