This is actually spanning both Qibojit and Qibo itself, but being specific for backend

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-

Backend review about qibojit HOT 7 OPEN

qiboteam commented on August 12, 2024

Backend review

from qibojit.

Comments (7)

alecandido commented on August 12, 2024

Keep thinking about it: is there a reason to support something else than NumPy for CPU and CuPy for GPU?

TensorFlow is much more complex than CuPy, and more dissimilar to NumPy.
Numba for CPU should not provide a significant advantage over NumPy if array operations are used (even though there is an advantage, since, according to the QiboJIT paper, it can simulate 5-6 qubits more, i.e. a factor 50... I really wonder where it is coming from...), and CuQuantum is mostly duplicating CuPy.

If ever we'll go through this exercise, I'd really consider trimming down the number of backends, in such a way to support all platforms, but shaving off as much overhead as possible.
(in principle, we could swap CuPy with TensorFlow, relegating the existing backends to QiboJIT, but in practice this would mean to keep the whole backends' mechanism as it is, while if only have to support NumPy and NumPy-compatible for simulation + Qibolab we might be able to refactor more)

from qibojit.

renatomello commented on August 12, 2024

@alecandido I agree with the your point about cuquantum and tensorflow. However, there has been some demand about a pytorch backend, which could be a replacement for the tensorflow backend.

from qibojit.

alecandido commented on August 12, 2024

@alecandido I agree with the your point about cuquantum and tensorflow. However, there has been some demand about a pytorch backend, which could be a replacement for the tensorflow backend.

In the spirit of the first message, PyTorch would be definitely better than TensorFlow.

However, considering the potential simplification, even PyTorch is still one more backend.
@renatomello are you aware of the benefit of a PyTorch backend?
If it's only to use PyTorch somewhere (possibly external to Qibo, or at least out of the circuit simulation) we could use DLpack and friends to (zero-copy) cast arrays from one library to another one, without the need of a full backend for it.

But if there is a need deeply connected to the circuit simulation, of course it's much better to plan including a PyTorch backend from the beginning (if we'll ever start a refactor, this issue was mostly investigation until now - I just wanted to check if there is room for improvements and simplifications).

from qibojit.

renatomello commented on August 12, 2024

@alecandido I personally haven't used pytorch (just have not needed yet). But what I heard from multiple people that is using Qibo for optimization is that these tensor-based backends allow for automatic differentiation. If one's simulating the circuits instead of sending to actual hardware, AD becomes a basic necessity. After that, it's a matter of preferring pytorch over tensorflow in general. But the main point of having at least one backend based on tensors is AD.

from qibojit.

stavros11 commented on August 12, 2024

I like the suggestions of the first post, I need to read it in more detail later, but I agree that the methods of Qibo's AbstractBackend could be simplified.

Other than that, regarding the existing backends:

Numba for CPU should not provide a significant advantage over NumPy if array operations are used (even though there is an advantage, since, according to the QiboJIT paper, it can simulate 5-6 qubits more, i.e. a factor 50... I really wonder where it is coming from...),

The advantage is only when the custom kernels are used, which are only for applying gates to states and some state initialization. All other operations are delegated to numpy. I would say (without real proof) that the advantage is coming from the following points, ordered with decreasing importance:

In-place updates. In numba we modify the state vector in-place, while np.einsum creates a copy. An easy way to test this

import qibo
import numpy as np
from qibo import gates, Circuit

qibo.set_backend("qibojit") # or "numpy"

c = Circuit(2)
c.add(gates.H(0))
c.add(gates.H(1))

state = np.random.random(4).astype(complex)
state2 = circuit(state)
print(state)
print(state2)

With numpy state2 != state while with numba state2 == state.

Numpy is single-thread while our numba kernels use parallelization (prange) to take advantage of multi-threading CPUs. That being said, maybe there are simpler ways to make Numpy (in particular np.einsum) compatible with multi-threading.
We are using some binary operations to find the indices during gate multiplications which are fast, however we never really proved how much advantage we get from this. I am guessing the low-level implementation of

np.einsum("ec,abcd->abed", gate, state)

which applies a single-qubit gate to the 3rd qubit of a 4-qubit state, uses similar tricks but I have never checked the actual code.

and CuQuantum is mostly duplicating CuPy.

That's true, CuQuantum is there only for supporting an additional backend, which is backed by NVIDIA, and for allowing easy benchmarking (CuPy vs CuQuantum). It does not offer any additional features.

TensorFlow is much more complex than CuPy, and more dissimilar to NumPy.

As @renatomello said, the main motivation for using Tensorflow is automatic differentiation. Compared to numpy, it also supports multi-threading and GPUs but is still slower than qibojit primarily due to creating copies (point 1 above), which are needed for automatic differentiation. Indeed, there are alternative backends we could add for this point (PyTorch, JAX, etc.), I think we only have TensorFlow for historic reasons, as we started with this and also qibotf, the predecessor of qibojit.

from qibojit.

alecandido commented on August 12, 2024

Thanks, @stavros11, for the summary, I believe now everything should be clear enough.

My current understanding is that we'll need:

basic and parallel CPU support
hardware accelerators support (mostly GPU, but if possible any)
automatic differentiation

So, I'm not sure that point 3. is strictly required for simulation, because strict simulation can not derive a circuit (otherwise the same code would not run on hardware out of the box).
However, we could assume that we want it, unless it's blocking greater improvements it would also be fine like that.

On the one side, I have always been tempted to add a further requirement: go beyond Python. However, this, together with the three above, would be incredibly time-consuming, and I'm pretty sure it's not worth for the current state of the project.
In Python many array libraries are available, with a NumPy-like API and broad hardware support, while to move to C the only strategy I can think of would be to make direct usage of XLA, with all the niceties of Bazel...

Speaking of XLA, it seems like all the major ML frameworks are using it (in particular TensorFlow, JAX, and even PyTorch), and it should satisfy all the conditions above on its own.

Thinking twice, I actually wonder if it would be worth to investigate deeper CuPy vs XLA-based libraries. Because if JAX or PyTorch are good enough (maybe not TF, since it's the least interoperable one, and it already "failed" somehow), and they support all the use cases, why should we dedicate effort ourselves to develop/maintain multiple simulation backends?

Eventually, if we really needed something more fine-grained than what these libraries could provide, making a trip into XLA itself might even be worth (but I really hope not, at least for a long while... also because we would lose all/most of the autodiff...).
In particular: is there anything to be executed on GPU or differentiated that is not possible to be implemented with TensorFlow(/...)?

P.S.: about the copies, I was worried the problem could have persisted with the others, but there is room in JAX and PyTorch (all the trailing_ methods) for in-place operations. However, since it could even be an outer product, there is no way that einsum could be in-place (the output requires more memory than the input), we'd need explicit contractions (as I believe you implemented in qibojit)

from qibojit.

renatomello commented on August 12, 2024

So, I'm not sure that point 3. is strictly required for simulation, because strict simulation can not derive a circuit (otherwise the >same code would not run on hardware out of the box).
However, we could assume that we want it, unless it's blocking greater improvements it would also be fine like that.

Yes, AD is much better for gradient simulation than any other method that is hardware-compatible. So it is very necessary to keep.

Getting the same computational complexity as AD on hardware is actually a hot topic right now in the QML circles, and there are some theoretical results showing that it even may be impossible to do it for a general circuit without violating complexity bounds. Of course, it can still be possible for specific circuits.

But the point is that AD is indispensable.

from qibojit.

Backend review about qibojit HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent