We should benchmark the performance in single precision vs that in double precision an

<a class="user-mention notranslate" data-hovercard-type="user" data-hover

Which function call did you benchmark? e.g. exactly <code class="no

Single precision vs double precision performance,about qiboteam/qibojit

Comments (19)

stavros11 commented on August 12, 2024

@stavros11 when you have time, could you post your results with multiqubit gates?

Thanks for opening the issue. Here are the results for the multiqubit benchmark from #51 on CPU:

nqubits=18 - simulation times

ntargets	qibo single	qibo double	qibojit single	qibojit double	qiskit single	qiskit double
3	0.00252	0.00293	0.00279	0.00293	0.00255	0.00273
4	0.00257	0.00386	0.00345	0.00363	0.00316	0.00374
5	0.00523	0.00559	0.00499	0.00553	0.00391	0.00476
6	0.01301	0.01194	0.01284	0.01184	0.00863	0.00972
7	0.02037	0.01597	0.02032	0.01590	0.04467	0.04942
8	0.03582	0.02775	0.03580	0.02734	0.16707	0.18929
9	0.06475	0.04812	0.06510	0.04808	0.88611	0.91469
10	0.12337	0.07982	0.12047	0.11339	5.63898	6.16987

nqubits=19 - simulation times

ntargets	qibo single	qibo double	qibojit single	qibojit double	qiskit single	qiskit double
3	0.00480	0.00435	0.00433	0.00443	0.00324	0.00362
4	0.00645	0.00667	0.00638	0.00695	0.00356	0.00470
5	0.01023	0.01049	0.00996	0.01044	0.00552	0.00699
6	0.02816	0.02578	0.02810	0.02563	0.01156	0.01446
7	0.04366	0.03425	0.04413	0.03410	0.06701	0.07808
8	0.09011	0.06150	0.07839	0.06002	0.21946	0.26088
9	0.14856	0.11064	0.14990	0.11285	1.12933	1.11194
10	0.26818	0.19493	0.23539	0.21493	6.54414	7.63080

nqubits=20 - simulation times

ntargets	qibo single	qibo double	qibojit single	qibojit double	qiskit single	qiskit double
3	0.00847	0.00832	0.00885	0.00893	0.00448	0.00557
4	0.01322	0.01523	0.01336	0.01410	0.00565	0.00783
5	0.02097	0.01520	0.02083	0.02229	0.00867	0.01183
6	0.06118	0.05561	0.06151	0.05508	0.01762	0.02485
7	0.09197	0.07478	0.11024	0.07456	0.12033	0.13969
8	0.16788	0.09990	0.18165	0.13632	0.29780	0.40871
9	0.32098	0.20416	0.26321	0.22551	1.48667	1.46820
10	0.32529	0.38266	0.58053	0.35156	7.72730	8.52348

nqubits=21 - simulation times

ntargets	qibo single	qibo double	qibojit single	qibojit double	qiskit single	qiskit double
3	0.01766	0.02236	0.01769	0.02222	0.00683	0.00936
4	0.02769	0.03944	0.02949	0.02284	0.00920	0.01479
5	0.04744	0.05407	0.04725	0.05816	0.01613	0.02674
6	0.13562	0.09652	0.13764	0.12275	0.03251	0.04883
7	0.22272	0.12269	0.21994	0.18551	0.22816	0.30534
8	0.37997	0.28580	0.37997	0.29958	0.49158	0.67552
9	0.46696	0.50157	0.67388	0.31254	2.14137	2.14235
10	0.74231	0.79214	0.79269	0.85835	9.66886	15.46075

nqubits=22 - simulation times

ntargets	qibo single	qibo double	qibojit single	qibojit double	qiskit single	qiskit double
3	0.04161	0.04240	0.04167	0.04218	0.01237	0.01709
4	0.06720	0.07693	0.07538	0.08310	0.01839	0.02563
5	0.12196	0.13427	0.12148	0.13557	0.03189	0.04959
6	0.20233	0.26014	0.26395	0.23517	0.06211	0.09527
7	0.32837	0.37219	0.41109	0.24786	0.47313	0.35765
8	0.52610	0.58221	0.80508	0.51965	0.84729	1.28193
9	0.85750	0.87660	0.97282	0.78351	3.37298	3.36922
10	1.61882	1.59471	1.58853	1.66941	12.36304	24.87413

nqubits=23 - simulation times

ntargets	qibo single	qibo double	qibojit single	qibojit double	qiskit single	qiskit double
3	0.04492	0.09326	0.08448	0.06830	0.02291	0.03834
4	0.15546	0.17647	0.15955	0.16432	0.03577	0.05503
5	0.27392	0.25915	0.19172	0.24949	0.06591	0.10277
6	0.54600	0.45729	0.64013	0.48950	0.12116	0.18796
7	0.71717	0.65344	0.61897	0.78138	0.57905	0.74864
8	1.07134	0.98115	1.09886	1.19669	1.98408	2.51820
9	2.20403	1.54533	1.85110	1.62834	5.86612	5.98414
10	3.48476	3.47243	3.40456	3.38084	18.15899	42.43832

nqubits=24 - simulation times

ntargets	qibo single	qibo double	qibojit single	qibojit double	qiskit single	qiskit double
3	0.11587	0.15120	0.18880	0.21074	0.05287	0.08805
4	0.24533	0.35044	0.30943	0.30915	0.07027	0.12071
5	0.53195	0.58445	0.43483	0.56411	0.13495	0.20928
6	1.19677	1.09115	1.10584	1.02924	0.24610	0.34992
7	1.37310	1.16172	1.42246	1.22445	1.41638	1.23091
8	2.55380	2.01835	2.28003	1.91831	3.50500	5.66647
9	4.18953	3.32717	3.98467	3.47910	10.78300	11.75414
10	7.22219	7.11032	7.72539	6.52692	28.78554	73.70552

nqubits=25 - simulation times

ntargets	qibo single	qibo double	qibojit single	qibojit double	qiskit single	qiskit double
3	0.29793	0.31643	0.20984	0.36482	0.10907	0.24589
4	0.60118	0.59639	0.51942	0.58661	0.15394	0.26426
5	1.04315	1.27955	1.14314	1.26141	0.25628	0.38248
6	2.11482	2.18827	1.98463	2.08555	0.46081	0.71935
7	2.77332	2.47914	2.94143	2.32125	2.50428	2.56939
8	4.86569	4.05328	4.74529	4.07807	7.38578	11.53120
9	8.43331	6.88069	8.43319	6.86786	23.06335	22.85148
10	15.40516	14.79330	15.42111	13.21936	51.77071	133.96834

For qiskit single precision is faster than double, as expected, while for qibo there is no difference and in some cases single is even slower.

from qibojit.

stavros11 commented on August 12, 2024

Here are some numbers for single vs double precision for various custom operators on CPU:

apply_gate

nqubits	single (64 threads)	double (64 threads)	single (1 thread)	double (1 thread)
10	0.00003	0.00003	0.00001	0.00001
11	0.00003	0.00003	0.00001	0.00001
12	0.00003	0.00003	0.00001	0.00001
13	0.00003	0.00003	0.00002	0.00002
14	0.00003	0.00003	0.00003	0.00004
15	0.00003	0.00004	0.00007	0.00007
16	0.00004	0.00004	0.00013	0.00013
17	0.00005	0.00006	0.00025	0.00025
18	0.00007	0.00010	0.00050	0.00050
19	0.00012	0.00015	0.00100	0.00099
20	0.00017	0.00026	0.00197	0.00200
21	0.00033	0.00061	0.00396	0.00410
22	0.00059	0.00132	0.00791	0.00815
23	0.00121	0.00366	0.01579	0.01631
24	0.00334	0.00609	0.03154	0.03221
25	0.00635	0.00891	0.06291	0.08589
26	0.00987	0.01903	0.14184	0.17122
27	0.02124	0.03823	0.28522	0.34151
28	0.04245	0.07628	0.56901	0.68857

apply_x

nqubits	single (64 threads)	double (64 threads)	single (1 thread)	double (1 thread)
10	0.00005	0.00003	0.00000	0.00000
11	0.00003	0.00003	0.00001	0.00001
12	0.00005	0.00003	0.00001	0.00001
13	0.00003	0.00003	0.00001	0.00001
14	0.00003	0.00003	0.00002	0.00002
15	0.00003	0.00003	0.00003	0.00003
16	0.00003	0.00004	0.00006	0.00005
17	0.00004	0.00005	0.00011	0.00010
18	0.00006	0.00008	0.00022	0.00020
19	0.00008	0.00013	0.00043	0.00063
20	0.00014	0.00025	0.00085	0.00086
21	0.00024	0.00055	0.00173	0.00201
22	0.00061	0.00137	0.00360	0.00400
23	0.00123	0.00360	0.00709	0.00771
24	0.00339	0.00645	0.01476	0.01651
25	0.00613	0.00811	0.02816	0.05608
26	0.00888	0.01910	0.09013	0.10297
27	0.01873	0.03655	0.17821	0.20354
28	0.03727	0.07210	0.35670	0.41257

apply_two_qubit_gate

nqubits	single (64 threads)	double (64 threads)	single (1 thread)	double (1 thread)
10	0.00003	0.00003	0.00001	0.00001
11	0.00003	0.00003	0.00001	0.00001
12	0.00003	0.00003	0.00002	0.00002
13	0.00003	0.00003	0.00004	0.00004
14	0.00003	0.00004	0.00007	0.00008
15	0.00004	0.00004	0.00014	0.00015
16	0.00005	0.00005	0.00028	0.00029
17	0.00006	0.00008	0.00055	0.00055
18	0.00010	0.00014	0.00111	0.00113
19	0.00020	0.00022	0.00220	0.00226
20	0.00036	0.00036	0.00438	0.00454
21	0.00067	0.00077	0.00878	0.00909
22	0.00110	0.00159	0.01757	0.01817
23	0.00201	0.00344	0.03469	0.03619
24	0.00465	0.00766	0.07004	0.07252
25	0.00866	0.00967	0.13983	0.17484
26	0.01276	0.02150	0.33970	0.48276
27	0.02858	0.04276	0.89865	0.97283
28	0.05609	0.09051	1.79465	1.93739

apply_multi_qubit_gate - ntargets=3

nqubits	single (64 threads)	double (64 threads)	single (1 thread)	double (1 thread)
10	0.00003	0.00003	0.00002	0.00002
11	0.00004	0.00003	0.00003	0.00003
12	0.00003	0.00004	0.00006	0.00005
13	0.00004	0.00004	0.00010	0.00010
14	0.00007	0.00005	0.00019	0.00019
15	0.00007	0.00007	0.00037	0.00038
16	0.00011	0.00010	0.00074	0.00075
17	0.00010	0.00016	0.00148	0.00150
18	0.00030	0.00026	0.00294	0.00304
19	0.00051	0.00051	0.00590	0.00604
20	0.00092	0.00082	0.01178	0.01216
21	0.00148	0.00174	0.02375	0.02441
22	0.00274	0.00306	0.04756	0.04895
23	0.00556	0.00645	0.09580	0.09791
24	0.01099	0.01273	0.19194	0.19609
25	0.02151	0.02069	0.38340	0.40367
26	0.04200	0.03610	0.78617	0.91813
27	0.07741	0.07561	1.76592	1.98405
28	0.13882	0.15438	3.84446	3.95124

apply_multi_qubit_gate - ntargets=4

nqubits	single (64 threads)	double (64 threads)	single (1 thread)	double (1 thread)
10	0.00004	0.00004	0.00003	0.00003
11	0.00004	0.00003	0.00005	0.00005
12	0.00004	0.00004	0.00009	0.00009
13	0.00004	0.00005	0.00018	0.00018
14	0.00005	0.00006	0.00035	0.00034
15	0.00010	0.00010	0.00071	0.00067
16	0.00014	0.00014	0.00135	0.00133
17	0.00018	0.00025	0.00269	0.00270
18	0.00042	0.00049	0.00540	0.00608
19	0.00085	0.00082	0.01268	0.01607
20	0.00145	0.00166	0.03330	0.03659
21	0.00225	0.00424	0.07047	0.07545
22	0.00456	0.00619	0.14903	0.15434
23	0.01973	0.01497	0.30761	0.31311
24	0.01979	0.03174	0.61445	0.62966
25	0.05212	0.06423	1.24524	1.29693
26	0.10581	0.11538	2.55039	2.41001
27	0.17856	0.16543	4.73377	4.29563
28	0.29317	0.23916	8.79219	8.72099

apply_multi_qubit_gate - ntargets=5

nqubits	single (64 threads)	double (64 threads)	single (1 thread)	double (1 thread)
10	0.00004	0.00004	0.00004	0.00004
11	0.00005	0.00004	0.00008	0.00008
12	0.00005	0.00005	0.00015	0.00015
13	0.00006	0.00006	0.00031	0.00030
14	0.00008	0.00010	0.00060	0.00060
15	0.00013	0.00014	0.00125	0.00118
16	0.00023	0.00025	0.00232	0.00237
17	0.00040	0.00040	0.00464	0.00471
18	0.00043	0.00078	0.00923	0.01119
19	0.00125	0.00112	0.04019	0.02804
20	0.00203	0.00219	0.05659	0.06124
21	0.00375	0.00773	0.12062	0.22236
22	0.01474	0.01367	0.41249	0.45251
23	0.02889	0.02938	0.85467	0.93634
24	0.05221	0.05733	1.72745	1.88799
25	0.11024	0.10648	3.42486	3.81983
26	0.18496	0.20348	7.00453	7.75527
27	0.37176	0.35060	14.17189	14.21730
28	0.67633	0.83049	26.16925	36.41165

apply_multi_qubit_gate - ntargets=6

nqubits	single (64 threads)	double (64 threads)	single (1 thread)	double (1 thread)
10	0.00006	0.00004	0.00011	0.00008
11	0.00006	0.00005	0.00021	0.00016
12	0.00007	0.00006	0.00041	0.00031
13	0.00008	0.00013	0.00080	0.00062
14	0.00018	0.00016	0.00160	0.00124
15	0.00024	0.00023	0.00320	0.00250
16	0.00052	0.00045	0.00641	0.00506
17	0.00100	0.00052	0.01308	0.00998
18	0.00120	0.00137	0.02585	0.02033
19	0.00271	0.00197	0.05230	0.04444
20	0.00402	0.00431	0.10630	0.08368
21	0.00917	0.01023	0.21530	0.30033
22	0.02224	0.02078	0.63213	0.62978
23	0.04513	0.04106	1.31287	1.31433
24	0.08741	0.08156	2.65060	2.64123
25	0.16587	0.16127	5.30112	5.19375
26	0.31596	0.31030	10.74703	10.39643
27	0.63359	0.62503	21.16741	20.80903
28	1.24665	1.26802	42.51790	41.84317

apply_multi_qubit_gate - ntargets=7

nqubits	single (64 threads)	double (64 threads)	single (1 thread)	double (1 thread)
10	0.00009	0.00008	0.00020	0.00015
11	0.00008	0.00009	0.00040	0.00030
12	0.00014	0.00009	0.00079	0.00060
13	0.00018	0.00010	0.00156	0.00121
14	0.00028	0.00024	0.00344	0.00233
15	0.00048	0.00025	0.00630	0.00467
16	0.00092	0.00071	0.01247	0.00938
17	0.00144	0.00102	0.02492	0.01889
18	0.00219	0.00190	0.05043	0.03780
19	0.00409	0.00288	0.10056	0.07645
20	0.00864	0.00657	0.20283	0.15400
21	0.01519	0.01503	0.40817	0.46041
22	0.03118	0.02859	1.02093	0.93423
23	0.06537	0.05591	2.10929	1.95459
24	0.12585	0.11040	4.22156	3.92964
25	0.23822	0.21403	8.40546	7.69408
26	0.45770	0.41094	16.92144	15.54976
27	0.94539	0.83017	33.80684	31.00753
28	1.82106	1.65167	67.47716	62.36736

apply_multi_qubit_gate - ntargets=8

nqubits	single (64 threads)	double (64 threads)	single (1 thread)	double (1 thread)
10	0.00013	0.00020	0.00038	0.00028
11	0.00014	0.00021	0.00075	0.00056
12	0.00025	0.00020	0.00148	0.00111
13	0.00025	0.00020	0.00296	0.00221
14	0.00048	0.00038	0.00591	0.00439
15	0.00090	0.00070	0.01181	0.00879
16	0.00134	0.00074	0.02359	0.01753
17	0.00235	0.00198	0.04727	0.03541
18	0.00423	0.00317	0.09454	0.07123
19	0.00735	0.00588	0.19099	0.14196
20	0.01550	0.01102	0.37985	0.28841
21	0.02907	0.02451	0.76924	0.77398
22	0.05943	0.04698	1.82490	1.57743
23	0.11351	0.09371	3.69565	3.22068
24	0.21270	0.17893	7.31751	6.34632
25	0.40647	0.34237	14.74788	12.70273
26	0.79980	0.67554	29.88017	25.75951
27	1.57292	1.33787	59.09961	51.23731
28	3.10821	2.62125	118.59848	102.75361

Apparently the issue appears only for the multi-qubit op and exists for single thread too. For other ops the difference between single and double is as expected.

from qibojit.

mlazzarin commented on August 12, 2024

Good to know, at least the problem is restricted to a specific op. Two questions:

Which function call did you benchmark? e.g. exactly apply_multi_qubit_gate or the whole multi_qubit_base?
Did you try to set parallel=False instead of setting the number of threads?

from qibojit.

stavros11 commented on August 12, 2024

Which function call did you benchmark? e.g. exactly apply_multi_qubit_gate or the whole multi_qubit_base?

I called apply_multi_qubit_gate directly and the same for the other ops. Here is the full script I used:

Benchmark script

import os
import argparse
import json
import time
import numpy as np


parser = argparse.ArgumentParser()
parser.add_argument("--nqubits", default=20, type=int)
parser.add_argument("--nreps", default=1, type=int)
parser.add_argument("--method", default="apply_gate", type=str)
parser.add_argument("--ntargets", default=1, type=int)
parser.add_argument("--precision", default="double", type=str)
parser.add_argument("--nthreads", default=None, type=int)
parser.add_argument("--filename", default=None, type=str)


def main(nqubits, nreps, method, ntargets, precision, nthreads, filename):
    if filename is not None:
        if os.path.isfile(filename):
            with open(filename, "r") as file:
                logs = json.load(file)
            print("Extending existing logs from {}.".format(filename))
        else:
            print("Creating new logs in {}.".format(filename))
            logs = []
    else:
        logs = []

    logs.append({
        "nqubits": nqubits, "nreps": nreps,
        "method": method, "ntargets": ntargets,
        "CUDA_VISIBLE_DEVICES": os.environ.get("CUDA_VISIBLE_DEVICES")
        })

    # Create backend object
    start_time = time.time()
    from qibo import K
    logs[-1]["import_time"] = time.time() - start_time

    # Set backend precision
    K.set_precision(precision)
    logs[-1]["precision"] = K.precision
    logs[-1]["backend"] = K.name
    # Set number of threads
    if nthreads is not None:
        K.set_threads(nthreads)
    logs[-1]["nthreads"] = K.nthreads

    # Calculate ``qubits`` tensor
    targets = list(range(ntargets))
    qubits = [nqubits - q - 1 for q in targets]
    #qubits.extend(nqubits - q - 1 for q in controls)
    qubits = K.cast(sorted(qubits), dtype="int32")

    # Generate ``gate`` matrix if needed
    if method in ("apply_gate", "apply_two_qubit_gate", "apply_multi_qubit_gate"):
        shape = 2 * (2 ** len(targets),)
        gate = K.cast(np.random.random(shape) + 1j * np.random.random(shape))
        executor = lambda x: getattr(K, method)(x, gate, nqubits, targets, qubits)
    else:
        executor = lambda x: getattr(K, method)(x, nqubits, targets, qubits)

    start_time = time.time()
    state  = K.initial_state(nqubits)
    logs[-1]["initial_state_time"] = time.time() - start_time
    start_time = time.time()
    state = executor(state)
    logs[-1]["dry_run_time"] = time.time() - start_time
    logs[-1]["dtype"] = str(state.dtype)
    del(state)

    logs[-1]["simulation_times"] = []
    for _ in range(nreps):
        state  = K.initial_state(nqubits)
        start_time = time.time()
        state = executor(state)
        logs[-1]["simulation_times"].append(time.time() - start_time)
        del(state)
    logs[-1]["simulation_time_mean"] = np.mean(logs[-1]["simulation_times"])
    logs[-1]["simulation_time_std"] = np.std(logs[-1]["simulation_times"])

    for k, v in logs[-1].items():
        print("{}: {}".format(k, v))
    print()

    if filename is not None:
        with open(filename, "w") as file:
            json.dump(logs, file)


if __name__ == "__main__":
    args = vars(parser.parse_args())
    main(**args)

Did you try to set parallel=False instead of setting the number of threads?

No, I haven't tried that. As shown in the above script I use

from qibo import K
K.set_threads(nthreads)

in order to set the number of threads. I also haven't tried running this on GPU but in principle the same script would work.

EDIT: Here are some results with parallel=False and single thread

apply_multi_qubit_gate - ntargets=3

nqubits	single	double
10	0.00002	0.00002
11	0.00003	0.00003
12	0.00005	0.00005
13	0.00010	0.00010
14	0.00019	0.00019
15	0.00037	0.00038
16	0.00074	0.00075
17	0.00148	0.00149
18	0.00296	0.00296
19	0.00591	0.01222
20	0.01182	0.01196
21	0.02399	0.02434
22	0.04796	0.04896
23	0.09642	0.09773
24	0.19295	0.19594
25	0.38674	0.40418
26	0.78994	0.89543
27	1.77743	1.97553

apply_multi_qubit_gate - ntargets=4

nqubits	single	double
10	0.00003	0.00003
11	0.00005	0.00005
12	0.00009	0.00009
13	0.00017	0.00018
14	0.00035	0.00035
15	0.00073	0.00073
16	0.00142	0.00140
17	0.00278	0.00279
18	0.00567	0.00566
19	0.01255	0.01666
20	0.03295	0.03674
21	0.07555	0.07934
22	0.14994	0.15966
23	0.30840	0.32393
24	0.62203	0.65096
25	1.25948	1.34278
26	2.57411	2.47533
27	4.86673	4.34140

apply_multi_qubit_gate - ntargets=5

nqubits	single	double
10	0.00004	0.00004
11	0.00008	0.00008
12	0.00015	0.00015
13	0.00029	0.00030
14	0.00059	0.00060
15	0.00118	0.00115
16	0.00228	0.00235
17	0.00458	0.00464
18	0.00912	0.01179
19	0.02251	0.02766
20	0.05465	0.06129
21	0.12234	0.22476
22	0.39308	0.43870
23	0.80829	0.89043
24	1.64349	1.81944
25	3.27383	3.67461
26	6.82895	7.51540

apply_multi_qubit_gate - ntargets=6

nqubits	single	double
10	0.00006	0.00007
11	0.00011	0.00013
12	0.00022	0.00025
13	0.00044	0.00050
14	0.00088	0.00099
15	0.00177	0.00200
16	0.00355	0.00427
17	0.00723	0.00826
18	0.01462	0.01709
19	0.02899	0.03485
20	0.06035	0.07053
21	0.11944	0.37921
22	0.69294	0.77325
23	1.45051	1.58418
24	2.98322	3.17140
25	6.04239	6.37544
26	12.26439	13.06399

apply_multi_qubit_gate - ntargets=7

nqubits	single	double
10	0.00007	0.00009
11	0.00014	0.00016
12	0.00027	0.00033
13	0.00053	0.00064
14	0.00107	0.00126
15	0.00211	0.00252
16	0.00428	0.00507
17	0.00866	0.01026
18	0.01786	0.02033
19	0.03554	0.04163
20	0.07032	0.08453
21	0.14418	0.43070
22	0.78342	0.89268
23	1.62434	1.82199
24	3.36772	3.65349
25	6.83152	7.30669
26	14.06274	15.98979

apply_multi_qubit_gate - ntargets=8

nqubits	single	double
10	0.00009	0.00012
11	0.00018	0.00024
12	0.00034	0.00048
13	0.00067	0.00096
14	0.00135	0.00197
15	0.00275	0.00372
16	0.00552	0.00764
17	0.01094	0.01451
18	0.02138	0.03014
19	0.04558	0.06331
20	0.09554	0.13397
21	0.18050	0.54658
22	0.91581	1.08427
23	1.82818	2.20956
24	3.75763	4.48631
25	7.77429	9.19321
26	15.59674	18.21344

from qibojit.

mlazzarin commented on August 12, 2024

I checked the parallelization logs, and I don't see anything wrong in them. (even if I don't understand all details)

================================================================================
 Parallel Accelerator Optimizing:  Function apply_multi_qubit_gate_kernel, 
.../qibojit/src/qibojit/custom_operators/gates.py (381)  
================================================================================


Parallel loop listing for  Function apply_multi_qubit_gate_kernel, .../qibojit/src/qibojit/custom_operators/gates.py (381) 
-----------------------------------------------------------------------------------------|loop #ID
@njit(["complex64[:](complex64[:], complex64[:,::1], int32[:], int64, int64[:])",        | 
       "complex128[:](complex128[:], complex128[:,::1], int32[:], int64, int64[:])"],    | 
       parallel=True, nogil=True,                                                        | 
       cache=True)                                                                       | 
def apply_multi_qubit_gate_kernel(state, gate, qubits, nstates, targets):                | 
    nsubstates = 1 << len(targets)                                                       | 
    for g in prange(nstates):  # pylint: disable=not-an-iterable-------------------------| #41
        ig = multicontrol_index(g, qubits)                                               | 
        buffer = np.empty(nsubstates, dtype=state.dtype)                                 | 
        for i in range(nsubstates):                                                      | 
            t = ig - multitarget_index(nsubstates - i - 1, targets)                      | 
            buffer[i] = state[t]                                                         | 
        for i in range(nsubstates):                                                      | 
            t = ig - multitarget_index(nsubstates - i - 1, targets)                      | 
            state[t] = np.dot(gate[i], buffer)-------------------------------------------| #40
    return state                                                                         | 
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
----------------------------- Before Optimisation ------------------------------
Parallel region 0:
+--41 (parallel)
   +--40 (parallel)


--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel region 0:
+--41 (parallel)
   +--40 (serial)


 
Parallel region 0 (loop #41) had 0 loop(s) fused and 1 loop(s) serialized as 
part of the larger parallel loop (#41).
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 
---------------------------Loop invariant code motion---------------------------
Allocation hoisting:
The memory allocation derived from the instruction at 
.../qibojit/src/qibojit/custom_operators/gates.py (389) is 
hoisted out of the parallel loop labelled #41 (it will be performed before the 
loop is executed and reused inside the loop):
   Allocation:: buffer = np.empty(nsubstates, dtype=state.dtype)
    - numpy.empty() is used for the allocation.

Instruction hoisting:
loop #41:
  Has the following hoisted:
    $100load_global.1 = global(range: <class 'range'>)
    $114load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fd5b0ae0040>))
    $const122.8 = const(int, 1)
    $const18.7.2435 = const(int, 0)
    $np_typ_var.2470 = const(str, complex128)
    $26load_global.2 = global(multicontrol_index: CPUDispatcher(<function multicontrol_index at 0x7fd5dc117040>))
    $36load_global.6 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)
    $38load_attr.7 = getattr(value=$36load_global.6, attr=empty)
    $52load_global.13 = global(range: <class 'range'>)
    $66load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fd5b0ae0040>))
    $const74.8 = const(int, 1)
  Failed to hoist the following:
    not pure: $104call_function.3 = call $push_global_to_block.2576(nsubstates, func=$push_global_to_block.2576, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: $106get_iter.4 = getiter(value=$104call_function.3)
    dependency: $108for_iter.2 = iternext(value=$106get_iter.4)
    dependency: $i.1.2494 = pair_first(value=$108for_iter.2)
    dependency: $108for_iter.4 = pair_second(value=$108for_iter.2)
    dependency: $120binary_subtract.7 = nsubstates - $i.1.2494
    dependency: $124binary_subtract.9 = $120binary_subtract.7 - $const122.8
    dependency: $128call_function.11 = call $push_global_to_block.2577($124binary_subtract.9, targets, func=$push_global_to_block.2577, args=[Var($124binary_subtract.9, gates.py:394), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: t.1 = ig - $128call_function.11
    dependency: $a.2426.2495 = getitem(value=gate, index=$i.1.2494, fn=<built-in function getitem>)
    dependency: $16load_attr.6.2434 = getattr(value=$a.2426.2495, attr=shape)
    dependency: l.2437 = getitem(value=$16load_attr.6.2434, index=$const18.7.2435, fn=<built-in function getitem>)
    dependency: $dtype_attr_var.2471 = getattr(value=$np_g_var.2469, attr=dtype)
    dependency: $dtype_var.2472 = call $dtype_attr_var.2471($np_typ_var.2470, func=$dtype_attr_var.2471, args=[Var($np_typ_var.2470, gates.py:389)], kws=(), vararg=None, target=None)
    dependency: g = $parfor__index_2489.2560
    dependency: ig = call $push_global_to_block.2572($parfor__index_2489.2560, qubits, func=$push_global_to_block.2572, args=[Var($parfor__index_2489.2560, <string>:2), Var(qubits, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: buffer = call $push_getattr_to_block.2573(nsubstates, func=$push_getattr_to_block.2573, args=[Var(nsubstates, gates.py:386)], kws=[('dtype', Var($dtype_var.2472, gates.py:389))], vararg=None, target=None)
    not pure: $56call_function.15 = call $push_global_to_block.2575(nsubstates, func=$push_global_to_block.2575, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: $58get_iter.16 = getiter(value=$56call_function.15)
    dependency: $60for_iter.2 = iternext(value=$58get_iter.16)
    dependency: $i.2497 = pair_first(value=$60for_iter.2)
    dependency: $60for_iter.4 = pair_second(value=$60for_iter.2)
    dependency: $72binary_subtract.7 = nsubstates - $i.2497
    dependency: $76binary_subtract.9 = $72binary_subtract.7 - $const74.8
    dependency: $80call_function.11 = call $push_global_to_block.2578($76binary_subtract.9, targets, func=$push_global_to_block.2578, args=[Var($76binary_subtract.9, gates.py:391), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: t = ig - $80call_function.11
    dependency: $90binary_subscr.15 = getitem(value=state, index=t, fn=<built-in function getitem>)

Besides parallelization, I tried to inspect the intermediate representation of such kernel and I believe types are inferred properly:

>>> gates.apply_multi_qubit_gate_kernel.inspect_types()
apply_multi_qubit_gate_kernel (array(complex64, 1d, A), array(complex64, 2d, C), array(int32, 1d, A), int64, array(int64, 1d, A))
--------------------------------------------------------------------------------
# File: .../qibojit/src/qibojit/custom_operators/gates.py
# --- LINE 381 --- 

@njit(["complex64[:](complex64[:], complex64[:,::1], int32[:], int64, int64[:])",

       # --- LINE 382 --- 

       "complex128[:](complex128[:], complex128[:,::1], int32[:], int64, int64[:])"],

       # --- LINE 383 --- 

       parallel=True,

       # --- LINE 384 --- 

       cache=True)

# --- LINE 385 --- 

def apply_multi_qubit_gate_kernel(state, gate, qubits, nstates, targets):

    # --- LINE 386 --- 
    # label 0
    #   state = arg(0, name=state)  :: array(complex64, 1d, A)
    #   gate = arg(1, name=gate)  :: array(complex64, 2d, C)
    #   qubits = arg(2, name=qubits)  :: array(int32, 1d, A)
    #   nstates = arg(3, name=nstates)  :: int64
    #   targets = arg(4, name=targets)  :: array(int64, 1d, A)
    #   $const2.0 = const(int, 1)  :: Literal[int](1)
    #   $4load_global.1 = global(len: <built-in function len>)  :: Function(<built-in function len>)
    #   $8call_function.3 = targets_size0.647  :: int64
    #   del $4load_global.1
    #   nsubstates = $const2.0 << targets_size0.647  :: int64
    #   del $const2.0
    #   del $8call_function.3

    nsubstates = 1 << len(targets)

    # --- LINE 387 --- 
    #   $14load_global.5 = global(prange: <class 'numba.misc.special.prange'>)  :: Function(<class 'numba.misc.special.prange'>)
    #   $18call_function.7 = call $14load_global.5(nstates, func=$14load_global.5, args=[Var(nstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
    #   del nstates
    #   del $14load_global.5
    #   $20get_iter.8 = getiter(value=$18call_function.7)  :: range_iter_int64
    #   del $18call_function.7
    #   $phi22.0 = $20get_iter.8  :: range_iter_int64
    #   del $20get_iter.8
    #   jump 22
    # label 22
    #   $22for_iter.1 = iternext(value=$20get_iter.8)  :: pair<int64, bool>
    #   $22for_iter.2 = pair_first(value=$22for_iter.1)  :: int64
    #   $22for_iter.3 = pair_second(value=$22for_iter.1)  :: bool
    #   del $22for_iter.1
    #   parfor_index.655 = parfor_index.655  :: uint64
    #   del $22for_iter.2
    #   branch $22for_iter.3, 24, 1597
    # label 24
    #   del $22for_iter.3
    #   g = $parfor__index_655.726  :: int64
    #   del $phi24.1

    for g in prange(nstates):  # pylint: disable=not-an-iterable

        # --- LINE 388 --- 
        #   $26load_global.2 = global(multicontrol_index: CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))  :: type(CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))
        #   ig = call $push_global_to_block.738($parfor__index_655.726, qubits, func=$push_global_to_block.738, args=[Var($parfor__index_655.726, <string>:2), Var(qubits, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int32, 1d, A)) -> int64
        #   del g
        #   del $26load_global.2

        ig = multicontrol_index(g, qubits)

        # --- LINE 389 --- 
        #   $36load_global.6 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)  :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
        #   $38load_attr.7 = getattr(value=$36load_global.6, attr=empty)  :: Function(<built-in function empty>)
        #   del $36load_global.6
        #   $44load_attr.10 = $dtype_var.638  :: dtype(complex64)
        #   buffer = call $push_getattr_to_block.739(nsubstates, func=$push_getattr_to_block.739, args=[Var(nsubstates, gates.py:386)], kws=[('dtype', Var($dtype_var.638, gates.py:389))], vararg=None, target=None)  :: (int64, dtype(complex64)) -> array(complex64, 1d, C)
        #   del $44load_attr.10
        #   del $38load_attr.7

        buffer = np.empty(nsubstates, dtype=state.dtype)

        # --- LINE 390 --- 
        #   $52load_global.13 = global(range: <class 'range'>)  :: Function(<class 'range'>)
        #   $56call_function.15 = call $push_global_to_block.741(nsubstates, func=$push_global_to_block.741, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
        #   del $52load_global.13
        #   $58get_iter.16 = getiter(value=$56call_function.15)  :: range_iter_int64
        #   del $56call_function.15
        #   $phi60.1 = $58get_iter.16  :: range_iter_int64
        #   del $58get_iter.16
        #   jump 60
        # label 60
        #   $60for_iter.2 = iternext(value=$58get_iter.16)  :: pair<int64, bool>
        #   $i.663 = pair_first(value=$60for_iter.2)  :: int64
        #   $60for_iter.4 = pair_second(value=$60for_iter.2)  :: bool
        #   del $60for_iter.2
        #   $phi62.2 = $i.663  :: int64
        #   del $60for_iter.3
        #   branch $60for_iter.4, 62, 100
        # label 62
        #   del $60for_iter.4
        #   i = $i.663  :: int64
        #   del $phi62.2

        for i in range(nsubstates):

            # --- LINE 391 --- 
            #   $66load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))  :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
            #   $72binary_subtract.7 = nsubstates - $i.663  :: int64
            #   $const74.8 = const(int, 1)  :: Literal[int](1)
            #   $76binary_subtract.9 = $72binary_subtract.7 - $const74.8  :: int64
            #   del $const74.8
            #   del $72binary_subtract.7
            #   $80call_function.11 = call $push_global_to_block.744($76binary_subtract.9, targets, func=$push_global_to_block.744, args=[Var($76binary_subtract.9, gates.py:391), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int64, 1d, A)) -> int64
            #   del $76binary_subtract.9
            #   del $66load_global.4
            #   t = ig - $80call_function.11  :: int64
            #   del $80call_function.11

            t = ig - multitarget_index(nsubstates - i - 1, targets)

            # --- LINE 392 --- 
            #   $90binary_subscr.15 = getitem(value=state, index=t, fn=<built-in function getitem>)  :: complex64
            #   del t
            #   buffer[$i.663] = $90binary_subscr.15  :: (array(complex64, 1d, C), int64, complex64) -> none
            #   del i
            #   del $90binary_subscr.15
            #   jump 60

            buffer[i] = state[t]

        # --- LINE 393 --- 
        # label 100
        #   del $phi62.2
        #   del $phi60.1
        #   del $60for_iter.4
        #   $100load_global.1 = global(range: <class 'range'>)  :: Function(<class 'range'>)
        #   $104call_function.3 = call $push_global_to_block.742(nsubstates, func=$push_global_to_block.742, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
        #   del $100load_global.1
        #   $106get_iter.4 = getiter(value=$104call_function.3)  :: range_iter_int64
        #   del $104call_function.3
        #   $phi108.1 = $106get_iter.4  :: range_iter_int64
        #   del $106get_iter.4
        #   jump 108
        # label 108
        #   $108for_iter.2 = iternext(value=$106get_iter.4)  :: pair<int64, bool>
        #   $i.1.660 = pair_first(value=$108for_iter.2)  :: int64
        #   $108for_iter.4 = pair_second(value=$108for_iter.2)  :: bool
        #   del $108for_iter.2
        #   $phi110.2 = $i.1.660  :: int64
        #   del $108for_iter.3
        #   branch $108for_iter.4, 110, 1573
        # label 110
        #   del $108for_iter.4
        #   i.1 = $i.1.660  :: int64
        #   del $phi110.2

        for i in range(nsubstates):

            # --- LINE 394 --- 
            #   $114load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))  :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
            #   $120binary_subtract.7 = nsubstates - $i.1.660  :: int64
            #   $const122.8 = const(int, 1)  :: Literal[int](1)
            #   $124binary_subtract.9 = $120binary_subtract.7 - $const122.8  :: int64
            #   del $const122.8
            #   del $120binary_subtract.7
            #   $128call_function.11 = call $push_global_to_block.743($124binary_subtract.9, targets, func=$push_global_to_block.743, args=[Var($124binary_subtract.9, gates.py:394), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int64, 1d, A)) -> int64
            #   del $124binary_subtract.9
            #   del $114load_global.4
            #   t.1 = ig - $128call_function.11  :: int64
            #   del $128call_function.11

            t = ig - multitarget_index(nsubstates - i - 1, targets)

            # --- LINE 395 --- 
            #   $134load_global.13 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)  :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
            #   $136load_method.14 = getattr(value=$134load_global.13, attr=dot)  :: Function(<function dot at 0x7fe76a92c550>)
            #   del $134load_global.13
            #   $a.592.661 = getitem(value=gate, index=$i.1.660, fn=<built-in function getitem>)  :: array(complex64, 1d, C)
            #   del i.1
            #   $146call_method.19 = call $136load_method.14($142binary_subscr.17, buffer, func=$136load_method.14, args=[Var($142binary_subscr.17, gates.py:395), Var(buffer, gates.py:389)], kws=(), vararg=None, target=None)  :: (array(complex64, 1d, C), array(complex64, 1d, C)) -> complex64
            #   del $142binary_subscr.17
            #   del $136load_method.14
            #   state[t.1] = s.610  :: (array(complex64, 1d, A), int64, complex64) -> none
            #   del t.1
            #   del $146call_method.19
            #   jump 108
            # label 156
            #   del ig
            #   del buffer
            #   del $phi110.2
            #   del $phi108.1
            #   del $108for_iter.4
            #   jump 1600

            state[t] = np.dot(gate[i], buffer)

    # --- LINE 396 --- 
    # label 158
    #   del targets
    #   del qubits
    #   del nsubstates
    #   del gate
    #   del $phi24.1
    #   del $phi22.0
    #   del $22for_iter.3
    #   $160return_value.1 = cast(value=state)  :: array(complex64, 1d, A)
    #   del state
    #   return $160return_value.1

    return state


================================================================================
apply_multi_qubit_gate_kernel (array(complex128, 1d, A), array(complex128, 2d, C), array(int32, 1d, A), int64, array(int64, 1d, A))
--------------------------------------------------------------------------------
# File: .../qibojit/src/qibojit/custom_operators/gates.py
# --- LINE 381 --- 

@njit(["complex64[:](complex64[:], complex64[:,::1], int32[:], int64, int64[:])",

       # --- LINE 382 --- 

       "complex128[:](complex128[:], complex128[:,::1], int32[:], int64, int64[:])"],

       # --- LINE 383 --- 

       parallel=True,

       # --- LINE 384 --- 

       cache=True)

# --- LINE 385 --- 

def apply_multi_qubit_gate_kernel(state, gate, qubits, nstates, targets):

    # --- LINE 386 --- 
    # label 0
    #   state = arg(0, name=state)  :: array(complex128, 1d, A)
    #   gate = arg(1, name=gate)  :: array(complex128, 2d, C)
    #   qubits = arg(2, name=qubits)  :: array(int32, 1d, A)
    #   nstates = arg(3, name=nstates)  :: int64
    #   targets = arg(4, name=targets)  :: array(int64, 1d, A)
    #   $const2.0 = const(int, 1)  :: Literal[int](1)
    #   $4load_global.1 = global(len: <built-in function len>)  :: Function(<built-in function len>)
    #   $8call_function.3 = targets_size0.815  :: int64
    #   del $4load_global.1
    #   nsubstates = $const2.0 << targets_size0.815  :: int64
    #   del $const2.0
    #   del $8call_function.3

    nsubstates = 1 << len(targets)

    # --- LINE 387 --- 
    #   $14load_global.5 = global(prange: <class 'numba.misc.special.prange'>)  :: Function(<class 'numba.misc.special.prange'>)
    #   $18call_function.7 = call $14load_global.5(nstates, func=$14load_global.5, args=[Var(nstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
    #   del nstates
    #   del $14load_global.5
    #   $20get_iter.8 = getiter(value=$18call_function.7)  :: range_iter_int64
    #   del $18call_function.7
    #   $phi22.0 = $20get_iter.8  :: range_iter_int64
    #   del $20get_iter.8
    #   jump 22
    # label 22
    #   $22for_iter.1 = iternext(value=$20get_iter.8)  :: pair<int64, bool>
    #   $22for_iter.2 = pair_first(value=$22for_iter.1)  :: int64
    #   $22for_iter.3 = pair_second(value=$22for_iter.1)  :: bool
    #   del $22for_iter.1
    #   parfor_index.823 = parfor_index.823  :: uint64
    #   del $22for_iter.2
    #   branch $22for_iter.3, 24, 1699
    # label 24
    #   del $22for_iter.3
    #   g = $parfor__index_823.894  :: int64
    #   del $phi24.1

    for g in prange(nstates):  # pylint: disable=not-an-iterable

        # --- LINE 388 --- 
        #   $26load_global.2 = global(multicontrol_index: CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))  :: type(CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))
        #   ig = call $push_global_to_block.906($parfor__index_823.894, qubits, func=$push_global_to_block.906, args=[Var($parfor__index_823.894, <string>:2), Var(qubits, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int32, 1d, A)) -> int64
        #   del g
        #   del $26load_global.2

        ig = multicontrol_index(g, qubits)

        # --- LINE 389 --- 
        #   $36load_global.6 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)  :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
        #   $38load_attr.7 = getattr(value=$36load_global.6, attr=empty)  :: Function(<built-in function empty>)
        #   del $36load_global.6
        #   $44load_attr.10 = $dtype_var.806  :: dtype(complex128)
        #   buffer = call $push_getattr_to_block.907(nsubstates, func=$push_getattr_to_block.907, args=[Var(nsubstates, gates.py:386)], kws=[('dtype', Var($dtype_var.806, gates.py:389))], vararg=None, target=None)  :: (int64, dtype(complex128)) -> array(complex128, 1d, C)
        #   del $44load_attr.10
        #   del $38load_attr.7

        buffer = np.empty(nsubstates, dtype=state.dtype)

        # --- LINE 390 --- 
        #   $52load_global.13 = global(range: <class 'range'>)  :: Function(<class 'range'>)
        #   $56call_function.15 = call $push_global_to_block.909(nsubstates, func=$push_global_to_block.909, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
        #   del $52load_global.13
        #   $58get_iter.16 = getiter(value=$56call_function.15)  :: range_iter_int64
        #   del $56call_function.15
        #   $phi60.1 = $58get_iter.16  :: range_iter_int64
        #   del $58get_iter.16
        #   jump 60
        # label 60
        #   $60for_iter.2 = iternext(value=$58get_iter.16)  :: pair<int64, bool>
        #   $i.831 = pair_first(value=$60for_iter.2)  :: int64
        #   $60for_iter.4 = pair_second(value=$60for_iter.2)  :: bool
        #   del $60for_iter.2
        #   $phi62.2 = $i.831  :: int64
        #   del $60for_iter.3
        #   branch $60for_iter.4, 62, 100
        # label 62
        #   del $60for_iter.4
        #   i = $i.831  :: int64
        #   del $phi62.2

        for i in range(nsubstates):

            # --- LINE 391 --- 
            #   $66load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))  :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
            #   $72binary_subtract.7 = nsubstates - $i.831  :: int64
            #   $const74.8 = const(int, 1)  :: Literal[int](1)
            #   $76binary_subtract.9 = $72binary_subtract.7 - $const74.8  :: int64
            #   del $const74.8
            #   del $72binary_subtract.7
            #   $80call_function.11 = call $push_global_to_block.912($76binary_subtract.9, targets, func=$push_global_to_block.912, args=[Var($76binary_subtract.9, gates.py:391), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int64, 1d, A)) -> int64
            #   del $76binary_subtract.9
            #   del $66load_global.4
            #   t = ig - $80call_function.11  :: int64
            #   del $80call_function.11

            t = ig - multitarget_index(nsubstates - i - 1, targets)

            # --- LINE 392 --- 
            #   $90binary_subscr.15 = getitem(value=state, index=t, fn=<built-in function getitem>)  :: complex128
            #   del t
            #   buffer[$i.831] = $90binary_subscr.15  :: (array(complex128, 1d, C), int64, complex128) -> none
            #   del i
            #   del $90binary_subscr.15
            #   jump 60

            buffer[i] = state[t]

        # --- LINE 393 --- 
        # label 100
        #   del $phi62.2
        #   del $phi60.1
        #   del $60for_iter.4
        #   $100load_global.1 = global(range: <class 'range'>)  :: Function(<class 'range'>)
        #   $104call_function.3 = call $push_global_to_block.910(nsubstates, func=$push_global_to_block.910, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
        #   del $100load_global.1
        #   $106get_iter.4 = getiter(value=$104call_function.3)  :: range_iter_int64
        #   del $104call_function.3
        #   $phi108.1 = $106get_iter.4  :: range_iter_int64
        #   del $106get_iter.4
        #   jump 108
        # label 108
        #   $108for_iter.2 = iternext(value=$106get_iter.4)  :: pair<int64, bool>
        #   $i.1.829 = pair_first(value=$108for_iter.2)  :: int64
        #   $108for_iter.4 = pair_second(value=$108for_iter.2)  :: bool
        #   del $108for_iter.2
        #   $phi110.2 = $i.1.829  :: int64
        #   del $108for_iter.3
        #   branch $108for_iter.4, 110, 1675
        # label 110
        #   del $108for_iter.4
        #   i.1 = $i.1.829  :: int64
        #   del $phi110.2

        for i in range(nsubstates):

            # --- LINE 394 --- 
            #   $114load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))  :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
            #   $120binary_subtract.7 = nsubstates - $i.1.829  :: int64
            #   $const122.8 = const(int, 1)  :: Literal[int](1)
            #   $124binary_subtract.9 = $120binary_subtract.7 - $const122.8  :: int64
            #   del $const122.8
            #   del $120binary_subtract.7
            #   $128call_function.11 = call $push_global_to_block.911($124binary_subtract.9, targets, func=$push_global_to_block.911, args=[Var($124binary_subtract.9, gates.py:394), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int64, 1d, A)) -> int64
            #   del $124binary_subtract.9
            #   del $114load_global.4
            #   t.1 = ig - $128call_function.11  :: int64
            #   del $128call_function.11

            t = ig - multitarget_index(nsubstates - i - 1, targets)

            # --- LINE 395 --- 
            #   $134load_global.13 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)  :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
            #   $136load_method.14 = getattr(value=$134load_global.13, attr=dot)  :: Function(<function dot at 0x7fe76a92c550>)
            #   del $134load_global.13
            #   $a.760.828 = getitem(value=gate, index=$i.1.829, fn=<built-in function getitem>)  :: array(complex128, 1d, C)
            #   del i.1
            #   $146call_method.19 = call $136load_method.14($142binary_subscr.17, buffer, func=$136load_method.14, args=[Var($142binary_subscr.17, gates.py:395), Var(buffer, gates.py:389)], kws=(), vararg=None, target=None)  :: (array(complex128, 1d, C), array(complex128, 1d, C)) -> complex128
            #   del $142binary_subscr.17
            #   del $136load_method.14
            #   state[t.1] = s.778  :: (array(complex128, 1d, A), int64, complex128) -> none
            #   del t.1
            #   del $146call_method.19
            #   jump 108
            # label 156
            #   del ig
            #   del buffer
            #   del $phi110.2
            #   del $phi108.1
            #   del $108for_iter.4
            #   jump 1702

            state[t] = np.dot(gate[i], buffer)

    # --- LINE 396 --- 
    # label 158
    #   del targets
    #   del qubits
    #   del nsubstates
    #   del gate
    #   del $phi24.1
    #   del $phi22.0
    #   del $22for_iter.3
    #   $160return_value.1 = cast(value=state)  :: array(complex128, 1d, A)
    #   del state
    #   return $160return_value.1

    return state


================================================================================

from qibojit.

scarrazza commented on August 12, 2024

Ok, then at this point we have to check the argument types before and after the kernel execution. Btw, what happens if we disable double precision and execute in single precision? Does the code complains or runs?

from qibojit.

stavros11 commented on August 12, 2024

Something I forgot to mention above is that for the latest numbers I used the fixcast branch, however I do not think there would be any difference even if I used main.

Ok, then at this point we have to check the argument types before and after the kernel execution.

I always check the state vector type after execution and it has the correct type. The gate matrix should be passed in compatible precision because of the signature. If I pass complex128 matrix with complex64 state it will raise an error. So at least for inputs and outputs the types are correct.

Btw, what happens if we disable double precision and execute in single precision? Does the code complains or runs?

What exactly do you mean by disable? Remove the double precision signature?

from qibojit.

scarrazza commented on August 12, 2024

What exactly do you mean by disable? Remove the double precision signature?

yes, to make sure that we are passing single precision numbers.

If this test passes, then the debug should start by pruning the kernel until we find the operation which does not performs well.

from qibojit.

mlazzarin commented on August 12, 2024

If I remove the double precision signature, the code works fine with single precision but crashes with double

TypeError: No matching definition for argument type(s) array(complex128, 1d, C), array(complex128, 2d, C), array(int32, 1d, C), int64, array(int64, 1d, C)

as expected.

from qibojit.

scarrazza commented on August 12, 2024

Does the performance matches the previous single precision runs?

from qibojit.

mlazzarin commented on August 12, 2024

I believe there's no difference in the execution time (performances are a bit unstable on my laptop, but they oscillate in the same range).

from qibojit.

mlazzarin commented on August 12, 2024

I'd like to add some addition results. I think it's a bit difficult to solve this issue because we don't have a baseline that shows which is the optimal result. For example, I tried to run some benchmarks (the same as in issue #51) with qibo, numpy backend:

Are these results expected?
I think it may be interesting to run the same benchmark with all our backends that support multiqubit gates (numpy, tensorflow, qibojit) and with CPU/GPU to have a general overview of the problem.

from qibojit.

scarrazza commented on August 12, 2024

In theory I would say no, however this means that (a) the computation is not the bottleneck, or (b) for multiqubit gates there are pre or post processing done in double precision. If you can please profile the code to exclude (a).

from qibojit.

mlazzarin commented on August 12, 2024

I tried to profile my benchmark code with cProfile. Configuration with 25 qubits, 6 targets, NumPy backend.
I see that nearly all the simulation time is spent in numpy.core._multiarray_umath.c_einsum which is called by NumPy' einsum. Time per call: 6.205 s (single), 6.29 s (double)

from qibojit.

scarrazza commented on August 12, 2024

Would be great to repeat this exercise with qibojit.

from qibojit.

mlazzarin commented on August 12, 2024

I compared single vs double precision performance across different hardware and backends:

DGX

Numpy

TensorFlow

Qibojit

TensorFlow GPU

Qibojit GPU

Qiskit

Qiskit GPU

Qibomachine

Numpy

TensorFlow

Qibojit

TensorFlow GPU

Qibojit GPU

Generally speaking, we always see a consistent speed-up with single precision over double when we use GPU. Instead, with CPU we have different behaviors.
In particular, with NumPy we don't see consistent improvements, with TensorFlow we see a modest speed-up and with qibojit we see a speed-up only for ntargets < 3.

EDIT: I added the results for Qiskit, which show a moderate speed-up of single precision w.r.t. double, at least in average.

from qibojit.

scarrazza commented on August 12, 2024

@mlazzarin thank you very much. This looks quite consistent from the qibojit side. How many repetitions are you using for these numbers?

from qibojit.

mlazzarin commented on August 12, 2024

20 for NQUBITS < 21, 10 for 21<=NQUBITS<=26, 3 for NQUBITS>26.

from qibojit.

scarrazza commented on August 12, 2024

Ok, good. Then going back to #53, I assume there is a noticeable difference between us and qiskit in terms of algorithm.

from qibojit.

Single precision vs double precision performance about qibojit HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent