qiboteam / qibo Goto Github PK

View Code? Open in Web Editor NEW

279.0 32.0 56.0 88.39 MB

A framework for quantum computing

Home Page: https://qibo.science

License: Apache License 2.0

Python 99.88% Shell 0.05% Nix 0.07%

quantum quantum-circuit quantum-computing quantum-algorithms gpu quantum-annealing

qibo's People

Contributors

Stargazers

Watchers

qibo's Issues

Implement standard fallback mechanism for memory overflow

As discussed in #56 , the code has potentially several places where switching from GPU and CPU is a requirement in order to bypass the memory limitation.

We should propose a strategy to deal with this issue consistently in all parts of the code.

On Creating Custom Hamiltonian Objects

Hello,

I was trying to implement my own VQE (custom hamiltonian) as based on the vqe branch. I noticed that there is a sigma parameter passed to the abstract _build function, which from what I understand only makes sense for building a Heisenberg hamiltonian. Maybe consider having a _build signature of _build(self) and using internal variables, or _build(self, *args, **kwargs)?

As a side note, the master README already mentions VQEs, even though the PR hasn't been merged yet.

Cheers

Update README with minimal information and link to docs

Create version that avoids reshaping

Currently the state is stored and manipulated as a rank-n tensor with dimension 2 for each index, instead of a 2^n vector. This requires reshaping the state at the beginning and end of the simulation and applying 2-qubit gates as rank-4 tensors.

I plan to create an alternative implementation (initially in a different branch) where all manipulations happen on the 2^n vector and then compare the two approaches using the QFT benchmarks. We can then keep the method that performs better.

Simulation approaches

So far Qibo includes a linear algebra approach which is limited by RAM.

There are however other strategies to simulate more qubits, e.g.:

feynman path method
tensor networks

As soon as we are happy with the linear algebra approach we should consider the implementation of other methods.

Implement VQE

Perform tests/workflows on GPU

Unroll - extract circuit to create a given quantum state

Hello,

last week we talked about the possibility of creating an algorithm such that

def create_circuit(state):
# Some code
return circuit

circuit has to be made such that, when applied to 0, it returns the original state.

This operation is exponentially costly, but it can be done

The example where we have found that is in QISkit UnivariateDistribution.build(). Univariate distribution is here https://github.com/Qiskit/qiskit-aqua/blob/master/qiskit/aqua/components/uncertainty_models/univariate_distribution.py

Thanks!

Compare qibo to QSystem

Would be interesting to check our performance against QSystem published in https://arxiv.org/abs/2004.03560 with code available https://gitlab.com/evandro-crr/qsystem.

The most impressive performance claim is the GHZ state:

Custom operators for applying gates

I am opening this issue to discuss the custom operators needed for applying gates. I will start with a quick summary of the approaches that are currently implemented (using tf primmitives) and then discuss how I think we can improve using a custom kernel.

Assume that we have a 5-qubit state and we want to apply a one-qubit gate to the 3rd qubit. gate is a (2, 2) tensor. Current implementations are able to apply any m-qubit gate to an n-qubit state for arbitrary m, n (m < n), but let's keep it simpler here for the sake of the example.

Approach 1: DefaultEinsum backend

state is a tensor with shape 5 * (2,) = (2, 2, 2, 2, 2). The gate can be applied using a single einsum call contracting in the 3rd index: tf.einsum("abcde,Cc->abCde", state, gate).

Approach 2: MatmulEinsum backend

This is algorithmically equivalent to einsum but uses tf.matmul (initially because tf.einsum had some gradient issues). In order to use matmul we have to reshape the state properly so that the gate is applied to the proper target qubit. For the example (state is again a tensor with shape (2, 2, 2, 2, 2)):

# Bring the target qubit (with index 2) in the beginning
state = tf.transpose(state, [2, 0, 1, 3, 4])
# Reshape state to a matrix because tf.matmul works only with matrices
state = tf.reshape(state, (2, 2 ** 4))
# Apply gate
state = tf.matmul(gate, state)
# Recover original shape
state = tf.reshape(state, 5 * (2,))
# Recover original qubit order
state = tf.transpose(state, [1, 2, 0, 3, 4])

Some disadvantages of these approaches are the following:

We have no control on how many copies of state are used as this is handled automatically by the tf.einsum/tf.matmul kernels.
All gates are handled equivalently as matrices. This is okay for general rotations (non-sparse gates), however for sparse gates (such as the Z gate or controlled gates) it is possible to have more efficient in-place updates without using any copies.

Regarding the custom operator implementation, I have in mind the following (at a very high-level):

# Generate indices for the |i,j,0,k,l> and |i,j,1,k,l> subspaces
subspace0, subspace1 = ...
# Keep a copy of the state in the first subspace
buffer = copy(state[subspace0]) 
# Update the first subspace in-place
state[subspace0] = gate[0, 0] * state[subspace0] + gate[0, 1] * state[subspace1]
# Update the second subspace in-place
state[subspace1] = gate[1, 0] * buffer + gate[1, 1] * state[subspace1]

This applies a general rotation (non-sparse gate) and the total memory required is 1.5 * state vector (buffer is half state vector). This is assuming that indexing happens in a smart way and without having to cast the indices subspace0, subspace1 in memory. Another advantage of this approach is that for sparse gates (eg. if gate[1, 0] = 0) we can avoid the buffer at all.

In order to do this we need:

A mechanism that efficiently finds the subspace indices according to the target qubit. If state has shape 5 * (2,) and the target qubit is 3:
subspace0 = [(0, 0, 0, 0, 0), (0, 0, 0, 0, 1), (0, 0, 0, 1, 0), (0, 0, 0, 1, 1), (0, 1, 0, 0, 0), (0, 1, 0, 0, 1), (0, 1, 0, 1, 0), (0, 1, 0, 1, 1), (1, 0, 0, 0, 0), ...]. If the state has shape (2 ** 5,) we have the same indices in their decimal representation. A way to handle this is the binary matrix in the Fortran code of #74.
A way to update the entries at these indices in-place.
A way to copy the entries at these indices in a temporary buffer.

Regarding benchmarks (once we have a candidate implementation), I believe the easiest way is to compare with a simple einsum/matmul implementation such as the code I wrote above (just for >25 qubits). We should have at least equal performance for non-sparse gates but improve a lot on sparse gates.

Let me know what you think.

Circuit examples for release

Another suggestion from @joseignaciolatorre , here some circuits that we should ship with QIBO for the release:

optimization of single qubit gates layer

Are there new strategies to improve on the simulation of layers of single-qubit gates?

Here is a piece of fortran code with a different strategy to direct matrix rotation.
Idea: Precompute which element mix. Then, for a gate you do a single loop.
There is a need to update the state. The rotation needs an auxiliary vector, tmpsi, which is
casted back onto the vector psi after each gate. There are two options

plain recast vector after each gate
cast even gates to a auxiliary vector, and then back to the register with the odd ones.

Option 2) is better.
For n=24, t_1=2.23s, t_2=1.56s

Sorry for my fortran background.
Some operations may be much easier and fast in python.

       integer qbits,n,nn
       parameter(qbits=24,n=qbits-1,nn=2**qbits-1)
       complex*8 psi(0:nn),tmppsi(0:nn)
       real*8 norm
       real*8 th(0:n)
       integer iitmp,binary(0:nn,0:n)
       integer igate,ii,jj,i0,i1
       real*8 uz(2,2)
       real t1,t2

	do ii=0,nn
	  iitmp=ii
	  do jj=0,n
	    binary(ii,jj)=mod(iitmp,2)
	    iitmp=int(iitmp/2)
	  enddo
c	write(6,*) ii,(binary(ii,jj),jj=0,n-1)
	enddo

       do igate=0,n
         th(igate)=acos(-1d0)/4.
       enddo


c STRATEGY 1
c Use Binary to remember which gates mixed. Keep a separate vector "tmppsi"
c  to the right rotation. Then, recast tmpsi onto psi.
       if(.true.)then
       call CPU_TIME(t1)
       do igate=0,n
	  thtmp=th(igate)
	  do ii=0,nn
            iitmp=binary(ii,igate)
	    jj=ii-2**igate*iitmp+2**igate*(1-iitmp)
            tmppsi(ii)= cos(thtmp)*psi(ii)
     .      + (2*iitmp-1)*sin(thtmp)*psi(jj)
	  enddo
         do ii=0,nn
            psi(ii)=tmppsi(ii)
c            write(6,*)ii,psi(ii)
         enddo
       enddo
       call CPU_TIME(t2)
c       write(6,*)psi
       write(6,*)'immediate recast ',t2-t1
       endif

c STRATEGY 2
c Use Binary to remember which gates mixed. Recast psi - tmpsi for even and odd gates.


       if(.true.)then
       call CPU_TIME(t3)
       do igate=0,n,2
	  thtmp=th(igate)
	  do ii=0,nn
            iitmp=binary(ii,igate)
	    jj=ii-2**igate*iitmp+2**igate*(1-iitmp)
            tmppsi(ii)= cos(thtmp)*psi(ii)
     .      + (2*iitmp-1)*sin(thtmp)*psi(jj)
	  enddo
	  thtmp=th(igate+1)
	  do ii=0,nn
            iitmp=binary(ii,igate+1)
	    jj=ii-2**(igate+1)*iitmp+2**(igate+1)*(1-iitmp)
            psi(ii)= cos(thtmp)*tmppsi(ii)
     .      + (2*iitmp-1)*sin(thtmp)*tmppsi(jj)
	  enddo
       enddo
       call CPU_TIME(t4)
c       write(6,*)psi
       write(6,*)'cast every two gates ',t4-t3
       endif

       stop
       end

RY gate

I am playing around with QIBO, and I believe I might have found an issue. In particular, I am applying RY gates to real value wavefunctions, i.e. |0>, and I am obtaining weird superpositions with complex values, which as far as I know, should not happen when applying RY gates.

For instance, RY(np.pi)|0> is delivering [0.04865738-0.21515073j -0.21515073+0.95134262j] when it should be [0.0 1.0]

Can anyone check? In case is due to something I might be doing wrong. If it's a bug, may be happening as well for the other rotations?

Gates union

Following a private exchange with @joseignaciolatorre , we should consider the implementation of gate union, i.e. in the VQE we could merge the first column of 1 qubits gates with the CZPow and thus perform the evaluation with a single custom operator.

Refers to developments in #90.

Create pip package

Quality of life improvements

There are some features that are expected by users of quantum simulators that we should try to implement if we intend to release this to the public. QCGPU is the fastest simulator available, but it is still heavily underused when compared to cirq or even Qiskit (which is remarkably slow) as it lacks basic QoL features.

Basic:

Measure gates: you should be able to measure only a set of qubits and get their input in. binary form. Ideally you are able to keep measures of different registers as separate even in the same experiment.
Shot simulation: An easy way to simulate running an experiment with a set of measurements and to extract the numbers each qubit string has been repeated. Useful to plot histograms and extract data without having to deal with the full state.
Multi-controlled Toffoli: Implementation of a CNOT gate controlled by as many qubits as necessary.
Controlled-by gates: An easy way to implement any gate controlled by a set of other qubits. This simplifies a lot simulations, and circuit creation.
Simple error map: Error map where probabilities can be set for bitflip and phaseflip errors, as well as measurement errors.

Advanced:

Circuit visualization: A way to print the circuit on screen
Gate decomposer: Method to decompose a circuit to a basic set of gates in order to gauge its real depth on a quantum device.
Gate compiler: Simplification of the circuit if redundant gates are detected.
Circuit architecture: Implementation of a particular device architecture in which to run the circuit.
Advanced error map: Control over T1 and T2 as a device-like enviroment. Map to a similar error model as found in an actual device.

These are some of the most relevant things that I can come up with both as short and long term implementations. But this is open for anyone to comment on other issues that we may find interesting to add to qibo on an end user perspective.

File-based solution for large circuits

Given that the computation of circuits with more than 30 qubits requires non negligible amount of memory, we can consider a file-based solution such as Parquet, with I.O controlled through Pandas and the orchestration of parallel files processing through Dask.

Prepare qibo landing page

Errors without density matrices

As discussed in the previous meeting, an error simulation that does not require more memory is a good alternative in order to test noise effects on larger quantum circuits.

This approach is not fast, as one will have to compute the noisy circuit every time a measurement is made, but might be the only alternative in cases where there is not enough memory to store the density matrices.

Make sure code coverage is acceptable

Passing a list, an array or a range as argument to the Measurement gate

I think it would be useful to be able to pass a list, an array, a range or similar as argument to the measurement gate. I encountered this need when programing a quantum classifier with qibo. Depending on the number of clases to be classified, the number of qubits measured is different. In order to automatize which qubits are measured as a function of the number of clases, this feature is desirable.

Circuit summary and gate counting

A useful feature when designing quantum circuits is the ability to analyze the scaling of different types of gates as the circuit size grows.

A way to access the number of Toffoli gates (three-qubit), CNOT gates (two-qubit) and single qubit gates queued in a given circuit would be a welcome addition to Qibo.

Additionally, maybe latter down the line, an automated way to know the real depth of the circuit would also be useful. By real depth I refer to depth accounting for simultaneous application of gates in the real device. That is, applying Hadamard gates to a quantum register would incur only 1 depth.

Adding noisy circuits

Hi all,
as we talked yesterday, I write some facts about noisy circuits.
Google's cirq uses three different methods for simulating wavefunctions:

Standard wavefunction
Mixture: several independent wavefunctions and the probability of occuring
Density matrix
Cirq/cirq/sim has got the files for the wf and the dm. Mixtures are defined in Cirq/cirq/protocols

I think if we want to implement noisy circuits, the optimal general approach should be 3) Density matrix. mixtures could become intractable if many errors happen.

I was playing around with this kind of tensors, an einsum strategy could work in this setting. For controlled gates, einsum would work with proper indices. On the contrary, an slicing strategy like the one in QCGPU should be done in different steps.

This can stay here until QIBO is ready to implement noisy circuits.

Cheers

Applying gates in parallel

Currently Circuit is just a list and Circuit.execute() applies each gate from the list sequentially. Alternatively one can apply gates that act on different qubits in parallel, for example in

c = Circuit(4)
c.add(X(0))
c.add(H(1))
c.add(Y(2))

all gates would be applied simultaneously in a hardware implementation.

Mathematically it is optimal to apply gates sequentially rather than composing gates to a large matrix. The cost of applying M gates on N qubits sequentially is O(M * 2^N), while composing a 2^M x 2^M matrix and applying this is O(2 ^ (N + M)).

From an implementation point of view, it may be beneficial to group gates to moments (as in Cirq) and include multiple gates acting on different qubits in one einsum call. Note that both numpy and tensorflow einsum support multiple inputs. This would require to change the Circuit data structure a little bit and use a nested instead of a simple list.

I am leaving this open for now as its implementation would depend on whether we stick to einsum or switch to matmul or something else (eg. the result of #14).

Importing qibo.gates

While from qibo import gates works, doing from qibo.gates import * fails with ModuleNotFoundError: No module named 'qibo.gates'. This is probably because we there is no file named gates.py in the source directory and gates is imported indirectly in __init__.

Add unit testing examples

Noise as a gate vs. Noise everywhere

Quick issue to outline the noise discussion we had in the meeting.

Having the possibility to add noise exactly where we want to as an extra gate is really useful. However, if one wanted to add noise to an already designed circuit, adding a gate after every gate of the system would be very time consuming.

Therefore, an option to run an existing (noiseless) circuit with an extra variable to automatically implement a noise gate after every circuit gate would be really helpful.

Add QFT test

Implement github workflow action for pip package upload

Fix nvcc compilation on AWS

Gate order optimization for multi-GPU circuits

The main bottleneck of the multi-GPU implementation is swapping the global qubits. In some circuits the number of swaps required can be reduced by rearranging commuting gates. It would be useful to implement an algorithm that does this automatically to the extend that this is possible. I don't know if there is a general algorithm that can find the optimal gate arrangement for any circuit.

Complete documentation

Review CPU performance

As soon as we complete the basic setup for GPUs, we should remember to check again the CPU performance and eventually apply changes to obtain the best performance.

Measurement gate order.

As commented earlier, a small issue with the measurement gates:

the output order should match the order in which the measurement gates are applied.

This might not be relevant to all computations but in the hash program we rely on relabeling qubits in order to simulate bit shuffling.

Tensorflow backend

Implement a tensorflow backend and testing.

Add tf.Variable tests for parametrized gates

Our parametrized gates should work if we pass a tf.Variable for theta instead of a constant float value. We should add a test for this and also check if automatic differentiation works properly.

Is there an extra phase in the gates?

We have discussed about the definitions of Pauli gates with regard to the overall phase. On our side, we define gates differently:

R_x(t) = [[cos(t / 2), -i sin(t / 2)],[-i sin (t / 2), cos(t / 2)]]
R_y(t) = [[cos(t / 2), -sin(t / 2)],[sin (t / 2), cos(t / 2)]]
R_x(t) = [[e ^ (-i t / 2), 0],[0, e ^ (i t / 2)]]

The reason why we use this convention is as follows:

R_j(t) = exp(-i t / 2 sigma_j) = I - i t / 2 sigma_j - 1 / 2 (t / 2)^2 I + ... (Taylor series)
= cos(t / 2) I - i sin(t / 2) sigma_j

Thus, there is no place for the overall complex phase (although it could be added somewhat artificially). In addition we think that getting this phase represents an extra computation step.

What do you think about it?

Implement conda recipe in conda-forge

Complete TF gates

Implement github workflow

Implement switcher between complex64 and complex128

I think we need something like qibo.set_precision('complex64') function which provides a programmatic mechanism to switch between simple and double precision simulation.

Measurements fail on GPU

As requested in #30, I did some GPU benchmarks of the newly implemented measurements and there are some interesting findings.

First, I used a supremacy-like circuit with alternating layers of random one-qubit and fixed two-qubit gates. Although the exact gate sets are different than the ones from the experiment (for code simplification), I expect no change in performance as we are still doing matrix multiplications with the same sizes (just different numbers).

The first observation is that without measurements, just simple state vector simulation, our difference with Cirq on GPU is better than the QFT:

When we add measurements we get the following plot

which shows two problems:

(A) tf.random.categorical which we currently use for sampling measurements fails on GPU for more than 18 qubits as it tries to allocate a large tensor of shape (1, nshots, 2 ** nqubits). This does not happen on CPU where you can sample distributions up to 2^30 without problem. Particularly the code

probs = np.random.random(2 ** n)
probs = probs / probs.sum()
samples = tf.random.categorical(np.log(probs)[np.newaxis], 10000)[0]

works on CPU but fails on GPU.

(B) Compiling becomes slower than Eager when measurements are used. This is not true without measurements as shown in the plot above.

I tried several things and I think there are easy solutions to both problems. For (A) we can move the sampling part on CPU using with tf.device("/:CPU0"): and for (B) we can move the measurement gate action outside of the function that gets compiled. With these two fixes the GPU performance with measurements becomes similar to the state vector case:

@scarrazza if these two solutions are acceptable I can open a short PR that implements them (I already have the code). Otherwise, if for some reason we want to keep sampling on GPU, we may have to look for an alternative to tf.random.categorical that is better implemented on GPUs and then check again if the compile problem remains.

Test gates using Cirq

We should add tests that ensure that our custom gates action agrees with the equivalent gates from Cirq or any other similar library. This will help identify potential bugs in our custom operators.

It may be possible to use this testing mechanism for benchmarks as well.

From qasm

As was introduced in the last meeting, similarly to the to_qasm module, qibo could also include a from_qasm function that reads a qasm file and transforms it into qibo circuit.

This could be useful in order to transfer circuits from outside to qibo (either generated from another library or found elsewhere) and also potentially as a way to receive data from the quantum device and get it back in qibo as well.

Custom operator for GPU

Check if we can reproduce the GCGPU approach (slicing+in place amplitude update) using a custom tf operator with cuda, following the approach presented here:

https://www.tensorflow.org/guide/create_op

Multi-controlled gates decomposition

As commented in today's meeting, in order to properly simulate the errors present on a real machine, there has to be a way to decompose multi-controlled gates up to Toffoli, CNOT and single qubit gates.

There has been a recent implementation in the new cirq version:
https://cirq.readthedocs.io/en/stable/generated/cirq.decompose_multi_controlled_x.html?highlight=decompose
(as one example, there are a few more decompositions)

I implemented a function that specifically decomposed a multi-CNOT gate with one work qubit present in the Hash notebooks:

def n_mCNOT(controls, target, work):
    i = 0
    yield gates.TOFFOLI(controls[-1], work[-1], target)
    for i in range(1,len(controls)-2):
        yield gates.TOFFOLI(controls[-1-i], work[-1-i], work[-1-i+1])
    yield gates.TOFFOLI(controls[0], controls[1], work[-1-i])
    for i in reversed(range(1,len(controls)-2)):
        yield gates.TOFFOLI(controls[-1-i], work[-1-i], work[-1-i+1])
    yield gates.TOFFOLI(controls[-1], work[-1], target)
    for i in range(1,len(controls)-2):
        yield gates.TOFFOLI(controls[-1-i], work[-1-i], work[-1-i+1])
    yield gates.TOFFOLI(controls[0], controls[1], work[-1-i])
    for i in reversed(range(1,len(controls)-2)):
        yield gates.TOFFOLI(controls[-1-i], work[-1-i], work[-1-i+1])

def n_2CNOT(controls, target, work):
    m1 = int(((len(controls)+2)/2)+0.5)
    m2 = int(len(controls)+2-m1-1)
    yield n_mCNOT(controls[0:m1], work, controls[m1:len(controls)]+[target])
    yield n_mCNOT((controls+[work])[m1:m1+m2], target, controls[0:m1])
    yield n_mCNOT(controls[0:m1], work, controls[m1:len(controls)]+[target])
    yield n_mCNOT((controls+[work])[m1:m1+m2], target, controls[0:m1])

Both my implementation and cirq's are based on the following paper:
https://arxiv.org/abs/quant-ph/9503016

I think implementing something similar is important for error studies and scaling purposes.

QuantumFlow: Similar project?

I accidentally found QuantumFlow yesterday, which seems to be similar to QIBO in the sense that they simulate circuits using various backends. They have a numpy and a Tensorflow backend. I have not tried using it yet but the idea seems very similar to QIBO and they also say that Tensorflow can be used for gradient descent, etc. It seems that it is an old Rigetti project and the code looks much more complicated than QIBO. We may want to check how similar we are and also run some benchmarks.

Multi-GPU approach

Here an example of tf + multi-gpu setup:
https://arxiv.org/abs/2002.12921

Thermal relaxation and measurement errors

Hi all,

I am including some issues about errors.

First of all: measurements error
As far as we know, some others simulation packages include measurement errors in two differnt ways. One way is to apply a channel before applying the gate. The other way is to perform a measurement and then flip the result with certain probability. This probability does not have to be the same for 0-> 1 and 1-> 0, so it allows non-symmetric errors. Is it possible to implement this kind of errors?

Second: thermal relaxation errors
These errors are designed to simulate the interaction between the qubit and the environment. It is inherited from NMR: https://en.wikipedia.org/wiki/Relaxation_%28NMR%29, https://arxiv.org/pdf/1609.04485.pdf . It can be understood as "how the qubit loses its quantum behaviour", however I am not an expert of this file. Take a look at that and tell me if this makes sense to you.

Implement measurements

@scarrazza, I am opening to discuss how we would like measurements to be implemented. The main question is whether we want to reuse qubits after measurements.

If not, then all measurements should happen at the end of simulation and are equivalent to sampling the final wave function. This allows to efficiently sample as many samples (shots) as we want, without having to re-simulate the circuit.

If we want to allow reusing measured qubits then we have to implement measurements as gates that project the state to the measured value. Since the measurement is non-deterministic, getting many shots would require to run the simulation many times. Therefore this approach is less efficient but at the same time it will be more robust when we add noise.

qiboteam / qibo Goto Github PK

qibo's People

Contributors

Stargazers

Watchers

Forkers

qibo's Issues

Basic:

Advanced:

Recommend Projects

Recommend Topics

Recommend Org