qiboteam / qibo Goto Github PK
View Code? Open in Web Editor NEWA framework for quantum computing
Home Page: https://qibo.science
License: Apache License 2.0
A framework for quantum computing
Home Page: https://qibo.science
License: Apache License 2.0
As discussed in #56 , the code has potentially several places where switching from GPU and CPU is a requirement in order to bypass the memory limitation.
We should propose a strategy to deal with this issue consistently in all parts of the code.
Hello,
I was trying to implement my own VQE (custom hamiltonian) as based on the vqe
branch. I noticed that there is a sigma
parameter passed to the abstract _build
function, which from what I understand only makes sense for building a Heisenberg hamiltonian. Maybe consider having a _build
signature of _build(self)
and using internal variables, or _build(self, *args, **kwargs)
?
As a side note, the master README
already mentions VQEs, even though the PR hasn't been merged yet.
Cheers
Currently the state is stored and manipulated as a rank-n tensor with dimension 2 for each index, instead of a 2^n vector. This requires reshaping the state at the beginning and end of the simulation and applying 2-qubit gates as rank-4 tensors.
I plan to create an alternative implementation (initially in a different branch) where all manipulations happen on the 2^n vector and then compare the two approaches using the QFT benchmarks. We can then keep the method that performs better.
So far Qibo includes a linear algebra approach which is limited by RAM.
There are however other strategies to simulate more qubits, e.g.:
As soon as we are happy with the linear algebra approach we should consider the implementation of other methods.
Hello,
last week we talked about the possibility of creating an algorithm such that
def create_circuit(state):
# Some code
return circuit
circuit has to be made such that, when applied to 0, it returns the original state.
This operation is exponentially costly, but it can be done
The example where we have found that is in QISkit UnivariateDistribution.build(). Univariate distribution is here https://github.com/Qiskit/qiskit-aqua/blob/master/qiskit/aqua/components/uncertainty_models/univariate_distribution.py
Thanks!
Would be interesting to check our performance against QSystem published in https://arxiv.org/abs/2004.03560 with code available https://gitlab.com/evandro-crr/qsystem.
I am opening this issue to discuss the custom operators needed for applying gates. I will start with a quick summary of the approaches that are currently implemented (using tf primmitives) and then discuss how I think we can improve using a custom kernel.
Assume that we have a 5-qubit state
and we want to apply a one-qubit gate
to the 3rd qubit. gate
is a (2, 2)
tensor. Current implementations are able to apply any m-qubit gate to an n-qubit state for arbitrary m, n (m < n), but let's keep it simpler here for the sake of the example.
DefaultEinsum
backendstate
is a tensor with shape 5 * (2,) = (2, 2, 2, 2, 2)
. The gate can be applied using a single einsum call contracting in the 3rd index: tf.einsum("abcde,Cc->abCde", state, gate)
.
MatmulEinsum
backendThis is algorithmically equivalent to einsum but uses tf.matmul
(initially because tf.einsum
had some gradient issues). In order to use matmul we have to reshape the state properly so that the gate is applied to the proper target qubit. For the example (state
is again a tensor with shape (2, 2, 2, 2, 2)
):
# Bring the target qubit (with index 2) in the beginning
state = tf.transpose(state, [2, 0, 1, 3, 4])
# Reshape state to a matrix because tf.matmul works only with matrices
state = tf.reshape(state, (2, 2 ** 4))
# Apply gate
state = tf.matmul(gate, state)
# Recover original shape
state = tf.reshape(state, 5 * (2,))
# Recover original qubit order
state = tf.transpose(state, [1, 2, 0, 3, 4])
Some disadvantages of these approaches are the following:
tf.einsum
/tf.matmul
kernels.Regarding the custom operator implementation, I have in mind the following (at a very high-level):
# Generate indices for the |i,j,0,k,l> and |i,j,1,k,l> subspaces
subspace0, subspace1 = ...
# Keep a copy of the state in the first subspace
buffer = copy(state[subspace0])
# Update the first subspace in-place
state[subspace0] = gate[0, 0] * state[subspace0] + gate[0, 1] * state[subspace1]
# Update the second subspace in-place
state[subspace1] = gate[1, 0] * buffer + gate[1, 1] * state[subspace1]
This applies a general rotation (non-sparse gate) and the total memory required is 1.5 * state vector (buffer is half state vector). This is assuming that indexing happens in a smart way and without having to cast the indices subspace0
, subspace1
in memory. Another advantage of this approach is that for sparse gates (eg. if gate[1, 0] = 0
) we can avoid the buffer at all.
In order to do this we need:
state
has shape 5 * (2,)
and the target qubit is 3:subspace0 = [(0, 0, 0, 0, 0), (0, 0, 0, 0, 1), (0, 0, 0, 1, 0), (0, 0, 0, 1, 1), (0, 1, 0, 0, 0), (0, 1, 0, 0, 1), (0, 1, 0, 1, 0), (0, 1, 0, 1, 1), (1, 0, 0, 0, 0), ...]
. If the state has shape (2 ** 5,)
we have the same indices in their decimal representation. A way to handle this is the binary
matrix in the Fortran code of #74.Regarding benchmarks (once we have a candidate implementation), I believe the easiest way is to compare with a simple einsum/matmul implementation such as the code I wrote above (just for >25 qubits). We should have at least equal performance for non-sparse gates but improve a lot on sparse gates.
Let me know what you think.
Another suggestion from @joseignaciolatorre , here some circuits that we should ship with QIBO for the release:
Are there new strategies to improve on the simulation of layers of single-qubit gates?
Here is a piece of fortran code with a different strategy to direct matrix rotation.
Idea: Precompute which element mix. Then, for a gate you do a single loop.
There is a need to update the state. The rotation needs an auxiliary vector, tmpsi, which is
casted back onto the vector psi after each gate. There are two options
Option 2) is better.
For n=24, t_1=2.23s, t_2=1.56s
Sorry for my fortran background.
Some operations may be much easier and fast in python.
integer qbits,n,nn
parameter(qbits=24,n=qbits-1,nn=2**qbits-1)
complex*8 psi(0:nn),tmppsi(0:nn)
real*8 norm
real*8 th(0:n)
integer iitmp,binary(0:nn,0:n)
integer igate,ii,jj,i0,i1
real*8 uz(2,2)
real t1,t2
do ii=0,nn
iitmp=ii
do jj=0,n
binary(ii,jj)=mod(iitmp,2)
iitmp=int(iitmp/2)
enddo
c write(6,*) ii,(binary(ii,jj),jj=0,n-1)
enddo
do igate=0,n
th(igate)=acos(-1d0)/4.
enddo
c STRATEGY 1
c Use Binary to remember which gates mixed. Keep a separate vector "tmppsi"
c to the right rotation. Then, recast tmpsi onto psi.
if(.true.)then
call CPU_TIME(t1)
do igate=0,n
thtmp=th(igate)
do ii=0,nn
iitmp=binary(ii,igate)
jj=ii-2**igate*iitmp+2**igate*(1-iitmp)
tmppsi(ii)= cos(thtmp)*psi(ii)
. + (2*iitmp-1)*sin(thtmp)*psi(jj)
enddo
do ii=0,nn
psi(ii)=tmppsi(ii)
c write(6,*)ii,psi(ii)
enddo
enddo
call CPU_TIME(t2)
c write(6,*)psi
write(6,*)'immediate recast ',t2-t1
endif
c STRATEGY 2
c Use Binary to remember which gates mixed. Recast psi - tmpsi for even and odd gates.
if(.true.)then
call CPU_TIME(t3)
do igate=0,n,2
thtmp=th(igate)
do ii=0,nn
iitmp=binary(ii,igate)
jj=ii-2**igate*iitmp+2**igate*(1-iitmp)
tmppsi(ii)= cos(thtmp)*psi(ii)
. + (2*iitmp-1)*sin(thtmp)*psi(jj)
enddo
thtmp=th(igate+1)
do ii=0,nn
iitmp=binary(ii,igate+1)
jj=ii-2**(igate+1)*iitmp+2**(igate+1)*(1-iitmp)
psi(ii)= cos(thtmp)*tmppsi(ii)
. + (2*iitmp-1)*sin(thtmp)*tmppsi(jj)
enddo
enddo
call CPU_TIME(t4)
c write(6,*)psi
write(6,*)'cast every two gates ',t4-t3
endif
stop
end
I am playing around with QIBO, and I believe I might have found an issue. In particular, I am applying RY gates to real value wavefunctions, i.e. |0>, and I am obtaining weird superpositions with complex values, which as far as I know, should not happen when applying RY gates.
For instance, RY(np.pi)|0> is delivering [0.04865738-0.21515073j -0.21515073+0.95134262j] when it should be [0.0 1.0]
Can anyone check? In case is due to something I might be doing wrong. If it's a bug, may be happening as well for the other rotations?
Following a private exchange with @joseignaciolatorre , we should consider the implementation of gate union, i.e. in the VQE we could merge the first column of 1 qubits gates with the CZPow and thus perform the evaluation with a single custom operator.
Refers to developments in #90.
There are some features that are expected by users of quantum simulators that we should try to implement if we intend to release this to the public. QCGPU is the fastest simulator available, but it is still heavily underused when compared to cirq or even Qiskit (which is remarkably slow) as it lacks basic QoL features.
Measure gates: you should be able to measure only a set of qubits and get their input in. binary form. Ideally you are able to keep measures of different registers as separate even in the same experiment.
Shot simulation: An easy way to simulate running an experiment with a set of measurements and to extract the numbers each qubit string has been repeated. Useful to plot histograms and extract data without having to deal with the full state.
Multi-controlled Toffoli: Implementation of a CNOT gate controlled by as many qubits as necessary.
Controlled-by gates: An easy way to implement any gate controlled by a set of other qubits. This simplifies a lot simulations, and circuit creation.
Simple error map: Error map where probabilities can be set for bitflip and phaseflip errors, as well as measurement errors.
Circuit visualization: A way to print the circuit on screen
Gate decomposer: Method to decompose a circuit to a basic set of gates in order to gauge its real depth on a quantum device.
Gate compiler: Simplification of the circuit if redundant gates are detected.
Circuit architecture: Implementation of a particular device architecture in which to run the circuit.
Advanced error map: Control over T1 and T2 as a device-like enviroment. Map to a similar error model as found in an actual device.
These are some of the most relevant things that I can come up with both as short and long term implementations. But this is open for anyone to comment on other issues that we may find interesting to add to qibo on an end user perspective.
Given that the computation of circuits with more than 30 qubits requires non negligible amount of memory, we can consider a file-based solution such as Parquet, with I.O controlled through Pandas and the orchestration of parallel files processing through Dask.
As discussed in the previous meeting, an error simulation that does not require more memory is a good alternative in order to test noise effects on larger quantum circuits.
This approach is not fast, as one will have to compute the noisy circuit every time a measurement is made, but might be the only alternative in cases where there is not enough memory to store the density matrices.
I think it would be useful to be able to pass a list, an array, a range or similar as argument to the measurement gate. I encountered this need when programing a quantum classifier with qibo. Depending on the number of clases to be classified, the number of qubits measured is different. In order to automatize which qubits are measured as a function of the number of clases, this feature is desirable.
A useful feature when designing quantum circuits is the ability to analyze the scaling of different types of gates as the circuit size grows.
A way to access the number of Toffoli gates (three-qubit), CNOT gates (two-qubit) and single qubit gates queued in a given circuit would be a welcome addition to Qibo.
Additionally, maybe latter down the line, an automated way to know the real depth of the circuit would also be useful. By real depth I refer to depth accounting for simultaneous application of gates in the real device. That is, applying Hadamard gates to a quantum register would incur only 1 depth.
Hi all,
as we talked yesterday, I write some facts about noisy circuits.
Google's cirq uses three different methods for simulating wavefunctions:
I think if we want to implement noisy circuits, the optimal general approach should be 3) Density matrix. mixtures could become intractable if many errors happen.
I was playing around with this kind of tensors, an einsum strategy could work in this setting. For controlled gates, einsum would work with proper indices. On the contrary, an slicing strategy like the one in QCGPU should be done in different steps.
This can stay here until QIBO is ready to implement noisy circuits.
Cheers
Currently Circuit
is just a list and Circuit.execute()
applies each gate from the list sequentially. Alternatively one can apply gates that act on different qubits in parallel, for example in
c = Circuit(4)
c.add(X(0))
c.add(H(1))
c.add(Y(2))
all gates would be applied simultaneously in a hardware implementation.
Mathematically it is optimal to apply gates sequentially rather than composing gates to a large matrix. The cost of applying M gates on N qubits sequentially is O(M * 2^N), while composing a 2^M x 2^M matrix and applying this is O(2 ^ (N + M)).
From an implementation point of view, it may be beneficial to group gates to moments (as in Cirq) and include multiple gates acting on different qubits in one einsum call. Note that both numpy and tensorflow einsum support multiple inputs. This would require to change the Circuit
data structure a little bit and use a nested instead of a simple list.
I am leaving this open for now as its implementation would depend on whether we stick to einsum or switch to matmul or something else (eg. the result of #14).
While from qibo import gates
works, doing from qibo.gates import *
fails with ModuleNotFoundError: No module named 'qibo.gates'. This is probably because we there is no file named gates.py
in the source directory and gates is imported indirectly in __init__
.
Quick issue to outline the noise discussion we had in the meeting.
Having the possibility to add noise exactly where we want to as an extra gate is really useful. However, if one wanted to add noise to an already designed circuit, adding a gate after every gate of the system would be very time consuming.
Therefore, an option to run an existing (noiseless) circuit with an extra variable to automatically implement a noise gate after every circuit gate would be really helpful.
The main bottleneck of the multi-GPU implementation is swapping the global qubits. In some circuits the number of swaps required can be reduced by rearranging commuting gates. It would be useful to implement an algorithm that does this automatically to the extend that this is possible. I don't know if there is a general algorithm that can find the optimal gate arrangement for any circuit.
As soon as we complete the basic setup for GPUs, we should remember to check again the CPU performance and eventually apply changes to obtain the best performance.
As commented earlier, a small issue with the measurement gates:
This might not be relevant to all computations but in the hash program we rely on relabeling qubits in order to simulate bit shuffling.
Implement a tensorflow backend and testing.
Our parametrized gates should work if we pass a tf.Variable
for theta
instead of a constant float
value. We should add a test for this and also check if automatic differentiation works properly.
We have discussed about the definitions of Pauli gates with regard to the overall phase. On our side, we define gates differently:
R_x(t) = [[cos(t / 2), -i sin(t / 2)],[-i sin (t / 2), cos(t / 2)]]
R_y(t) = [[cos(t / 2), -sin(t / 2)],[sin (t / 2), cos(t / 2)]]
R_x(t) = [[e ^ (-i t / 2), 0],[0, e ^ (i t / 2)]]
The reason why we use this convention is as follows:
R_j(t) = exp(-i t / 2 sigma_j) = I - i t / 2 sigma_j - 1 / 2 (t / 2)^2 I + ... (Taylor series)
= cos(t / 2) I - i sin(t / 2) sigma_j
Thus, there is no place for the overall complex phase (although it could be added somewhat artificially). In addition we think that getting this phase represents an extra computation step.
What do you think about it?
I think we need something like qibo.set_precision('complex64')
function which provides a programmatic mechanism to switch between simple and double precision simulation.
As requested in #30, I did some GPU benchmarks of the newly implemented measurements and there are some interesting findings.
First, I used a supremacy-like circuit with alternating layers of random one-qubit and fixed two-qubit gates. Although the exact gate sets are different than the ones from the experiment (for code simplification), I expect no change in performance as we are still doing matrix multiplications with the same sizes (just different numbers).
The first observation is that without measurements, just simple state vector simulation, our difference with Cirq on GPU is better than the QFT:
When we add measurements we get the following plot
which shows two problems:
(A) tf.random.categorical
which we currently use for sampling measurements fails on GPU for more than 18 qubits as it tries to allocate a large tensor of shape (1, nshots, 2 ** nqubits)
. This does not happen on CPU where you can sample distributions up to 2^30 without problem. Particularly the code
probs = np.random.random(2 ** n)
probs = probs / probs.sum()
samples = tf.random.categorical(np.log(probs)[np.newaxis], 10000)[0]
works on CPU but fails on GPU.
(B) Compiling becomes slower than Eager when measurements are used. This is not true without measurements as shown in the plot above.
I tried several things and I think there are easy solutions to both problems. For (A) we can move the sampling part on CPU using with tf.device("/:CPU0"):
and for (B) we can move the measurement gate action outside of the function that gets compiled. With these two fixes the GPU performance with measurements becomes similar to the state vector case:
@scarrazza if these two solutions are acceptable I can open a short PR that implements them (I already have the code). Otherwise, if for some reason we want to keep sampling on GPU, we may have to look for an alternative to tf.random.categorical
that is better implemented on GPUs and then check again if the compile problem remains.
We should add tests that ensure that our custom gates action agrees with the equivalent gates from Cirq or any other similar library. This will help identify potential bugs in our custom operators.
It may be possible to use this testing mechanism for benchmarks as well.
As was introduced in the last meeting, similarly to the to_qasm
module, qibo could also include a from_qasm
function that reads a qasm file and transforms it into qibo circuit.
This could be useful in order to transfer circuits from outside to qibo (either generated from another library or found elsewhere) and also potentially as a way to receive data from the quantum device and get it back in qibo as well.
Check if we can reproduce the GCGPU approach (slicing+in place amplitude update) using a custom tf operator with cuda, following the approach presented here:
As commented in today's meeting, in order to properly simulate the errors present on a real machine, there has to be a way to decompose multi-controlled gates up to Toffoli, CNOT and single qubit gates.
There has been a recent implementation in the new cirq version:
https://cirq.readthedocs.io/en/stable/generated/cirq.decompose_multi_controlled_x.html?highlight=decompose
(as one example, there are a few more decompositions)
I implemented a function that specifically decomposed a multi-CNOT gate with one work qubit present in the Hash notebooks:
def n_mCNOT(controls, target, work):
i = 0
yield gates.TOFFOLI(controls[-1], work[-1], target)
for i in range(1,len(controls)-2):
yield gates.TOFFOLI(controls[-1-i], work[-1-i], work[-1-i+1])
yield gates.TOFFOLI(controls[0], controls[1], work[-1-i])
for i in reversed(range(1,len(controls)-2)):
yield gates.TOFFOLI(controls[-1-i], work[-1-i], work[-1-i+1])
yield gates.TOFFOLI(controls[-1], work[-1], target)
for i in range(1,len(controls)-2):
yield gates.TOFFOLI(controls[-1-i], work[-1-i], work[-1-i+1])
yield gates.TOFFOLI(controls[0], controls[1], work[-1-i])
for i in reversed(range(1,len(controls)-2)):
yield gates.TOFFOLI(controls[-1-i], work[-1-i], work[-1-i+1])
def n_2CNOT(controls, target, work):
m1 = int(((len(controls)+2)/2)+0.5)
m2 = int(len(controls)+2-m1-1)
yield n_mCNOT(controls[0:m1], work, controls[m1:len(controls)]+[target])
yield n_mCNOT((controls+[work])[m1:m1+m2], target, controls[0:m1])
yield n_mCNOT(controls[0:m1], work, controls[m1:len(controls)]+[target])
yield n_mCNOT((controls+[work])[m1:m1+m2], target, controls[0:m1])
Both my implementation and cirq's are based on the following paper:
https://arxiv.org/abs/quant-ph/9503016
I think implementing something similar is important for error studies and scaling purposes.
I accidentally found QuantumFlow yesterday, which seems to be similar to QIBO in the sense that they simulate circuits using various backends. They have a numpy and a Tensorflow backend. I have not tried using it yet but the idea seems very similar to QIBO and they also say that Tensorflow can be used for gradient descent, etc. It seems that it is an old Rigetti project and the code looks much more complicated than QIBO. We may want to check how similar we are and also run some benchmarks.
Here an example of tf + multi-gpu setup:
https://arxiv.org/abs/2002.12921
Hi all,
I am including some issues about errors.
First of all: measurements error
As far as we know, some others simulation packages include measurement errors in two differnt ways. One way is to apply a channel before applying the gate. The other way is to perform a measurement and then flip the result with certain probability. This probability does not have to be the same for 0-> 1 and 1-> 0, so it allows non-symmetric errors. Is it possible to implement this kind of errors?
Second: thermal relaxation errors
These errors are designed to simulate the interaction between the qubit and the environment. It is inherited from NMR: https://en.wikipedia.org/wiki/Relaxation_%28NMR%29, https://arxiv.org/pdf/1609.04485.pdf . It can be understood as "how the qubit loses its quantum behaviour", however I am not an expert of this file. Take a look at that and tell me if this makes sense to you.
@scarrazza, I am opening to discuss how we would like measurements to be implemented. The main question is whether we want to reuse qubits after measurements.
If not, then all measurements should happen at the end of simulation and are equivalent to sampling the final wave function. This allows to efficiently sample as many samples (shots) as we want, without having to re-simulate the circuit.
If we want to allow reusing measured qubits then we have to implement measurements as gates that project the state to the measured value. Since the measurement is non-deterministic, getting many shots would require to run the simulation many times. Therefore this approach is less efficient but at the same time it will be more robust when we add noise.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.