Git Product home page Git Product logo

Comments (19)

stavros11 avatar stavros11 commented on August 12, 2024

@stavros11 when you have time, could you post your results with multiqubit gates?

Thanks for opening the issue. Here are the results for the multiqubit benchmark from #51 on CPU:

nqubits=18 - simulation times
ntargets qibo single qibo double qibojit single qibojit double qiskit single qiskit double
3 0.00252 0.00293 0.00279 0.00293 0.00255 0.00273
4 0.00257 0.00386 0.00345 0.00363 0.00316 0.00374
5 0.00523 0.00559 0.00499 0.00553 0.00391 0.00476
6 0.01301 0.01194 0.01284 0.01184 0.00863 0.00972
7 0.02037 0.01597 0.02032 0.01590 0.04467 0.04942
8 0.03582 0.02775 0.03580 0.02734 0.16707 0.18929
9 0.06475 0.04812 0.06510 0.04808 0.88611 0.91469
10 0.12337 0.07982 0.12047 0.11339 5.63898 6.16987
nqubits=19 - simulation times
ntargets qibo single qibo double qibojit single qibojit double qiskit single qiskit double
3 0.00480 0.00435 0.00433 0.00443 0.00324 0.00362
4 0.00645 0.00667 0.00638 0.00695 0.00356 0.00470
5 0.01023 0.01049 0.00996 0.01044 0.00552 0.00699
6 0.02816 0.02578 0.02810 0.02563 0.01156 0.01446
7 0.04366 0.03425 0.04413 0.03410 0.06701 0.07808
8 0.09011 0.06150 0.07839 0.06002 0.21946 0.26088
9 0.14856 0.11064 0.14990 0.11285 1.12933 1.11194
10 0.26818 0.19493 0.23539 0.21493 6.54414 7.63080
nqubits=20 - simulation times
ntargets qibo single qibo double qibojit single qibojit double qiskit single qiskit double
3 0.00847 0.00832 0.00885 0.00893 0.00448 0.00557
4 0.01322 0.01523 0.01336 0.01410 0.00565 0.00783
5 0.02097 0.01520 0.02083 0.02229 0.00867 0.01183
6 0.06118 0.05561 0.06151 0.05508 0.01762 0.02485
7 0.09197 0.07478 0.11024 0.07456 0.12033 0.13969
8 0.16788 0.09990 0.18165 0.13632 0.29780 0.40871
9 0.32098 0.20416 0.26321 0.22551 1.48667 1.46820
10 0.32529 0.38266 0.58053 0.35156 7.72730 8.52348
nqubits=21 - simulation times
ntargets qibo single qibo double qibojit single qibojit double qiskit single qiskit double
3 0.01766 0.02236 0.01769 0.02222 0.00683 0.00936
4 0.02769 0.03944 0.02949 0.02284 0.00920 0.01479
5 0.04744 0.05407 0.04725 0.05816 0.01613 0.02674
6 0.13562 0.09652 0.13764 0.12275 0.03251 0.04883
7 0.22272 0.12269 0.21994 0.18551 0.22816 0.30534
8 0.37997 0.28580 0.37997 0.29958 0.49158 0.67552
9 0.46696 0.50157 0.67388 0.31254 2.14137 2.14235
10 0.74231 0.79214 0.79269 0.85835 9.66886 15.46075
nqubits=22 - simulation times
ntargets qibo single qibo double qibojit single qibojit double qiskit single qiskit double
3 0.04161 0.04240 0.04167 0.04218 0.01237 0.01709
4 0.06720 0.07693 0.07538 0.08310 0.01839 0.02563
5 0.12196 0.13427 0.12148 0.13557 0.03189 0.04959
6 0.20233 0.26014 0.26395 0.23517 0.06211 0.09527
7 0.32837 0.37219 0.41109 0.24786 0.47313 0.35765
8 0.52610 0.58221 0.80508 0.51965 0.84729 1.28193
9 0.85750 0.87660 0.97282 0.78351 3.37298 3.36922
10 1.61882 1.59471 1.58853 1.66941 12.36304 24.87413
nqubits=23 - simulation times
ntargets qibo single qibo double qibojit single qibojit double qiskit single qiskit double
3 0.04492 0.09326 0.08448 0.06830 0.02291 0.03834
4 0.15546 0.17647 0.15955 0.16432 0.03577 0.05503
5 0.27392 0.25915 0.19172 0.24949 0.06591 0.10277
6 0.54600 0.45729 0.64013 0.48950 0.12116 0.18796
7 0.71717 0.65344 0.61897 0.78138 0.57905 0.74864
8 1.07134 0.98115 1.09886 1.19669 1.98408 2.51820
9 2.20403 1.54533 1.85110 1.62834 5.86612 5.98414
10 3.48476 3.47243 3.40456 3.38084 18.15899 42.43832
nqubits=24 - simulation times
ntargets qibo single qibo double qibojit single qibojit double qiskit single qiskit double
3 0.11587 0.15120 0.18880 0.21074 0.05287 0.08805
4 0.24533 0.35044 0.30943 0.30915 0.07027 0.12071
5 0.53195 0.58445 0.43483 0.56411 0.13495 0.20928
6 1.19677 1.09115 1.10584 1.02924 0.24610 0.34992
7 1.37310 1.16172 1.42246 1.22445 1.41638 1.23091
8 2.55380 2.01835 2.28003 1.91831 3.50500 5.66647
9 4.18953 3.32717 3.98467 3.47910 10.78300 11.75414
10 7.22219 7.11032 7.72539 6.52692 28.78554 73.70552
nqubits=25 - simulation times
ntargets qibo single qibo double qibojit single qibojit double qiskit single qiskit double
3 0.29793 0.31643 0.20984 0.36482 0.10907 0.24589
4 0.60118 0.59639 0.51942 0.58661 0.15394 0.26426
5 1.04315 1.27955 1.14314 1.26141 0.25628 0.38248
6 2.11482 2.18827 1.98463 2.08555 0.46081 0.71935
7 2.77332 2.47914 2.94143 2.32125 2.50428 2.56939
8 4.86569 4.05328 4.74529 4.07807 7.38578 11.53120
9 8.43331 6.88069 8.43319 6.86786 23.06335 22.85148
10 15.40516 14.79330 15.42111 13.21936 51.77071 133.96834

For qiskit single precision is faster than double, as expected, while for qibo there is no difference and in some cases single is even slower.

from qibojit.

stavros11 avatar stavros11 commented on August 12, 2024

Here are some numbers for single vs double precision for various custom operators on CPU:

apply_gate
nqubits single (64 threads) double (64 threads) single (1 thread) double (1 thread)
10 0.00003 0.00003 0.00001 0.00001
11 0.00003 0.00003 0.00001 0.00001
12 0.00003 0.00003 0.00001 0.00001
13 0.00003 0.00003 0.00002 0.00002
14 0.00003 0.00003 0.00003 0.00004
15 0.00003 0.00004 0.00007 0.00007
16 0.00004 0.00004 0.00013 0.00013
17 0.00005 0.00006 0.00025 0.00025
18 0.00007 0.00010 0.00050 0.00050
19 0.00012 0.00015 0.00100 0.00099
20 0.00017 0.00026 0.00197 0.00200
21 0.00033 0.00061 0.00396 0.00410
22 0.00059 0.00132 0.00791 0.00815
23 0.00121 0.00366 0.01579 0.01631
24 0.00334 0.00609 0.03154 0.03221
25 0.00635 0.00891 0.06291 0.08589
26 0.00987 0.01903 0.14184 0.17122
27 0.02124 0.03823 0.28522 0.34151
28 0.04245 0.07628 0.56901 0.68857
apply_x
nqubits single (64 threads) double (64 threads) single (1 thread) double (1 thread)
10 0.00005 0.00003 0.00000 0.00000
11 0.00003 0.00003 0.00001 0.00001
12 0.00005 0.00003 0.00001 0.00001
13 0.00003 0.00003 0.00001 0.00001
14 0.00003 0.00003 0.00002 0.00002
15 0.00003 0.00003 0.00003 0.00003
16 0.00003 0.00004 0.00006 0.00005
17 0.00004 0.00005 0.00011 0.00010
18 0.00006 0.00008 0.00022 0.00020
19 0.00008 0.00013 0.00043 0.00063
20 0.00014 0.00025 0.00085 0.00086
21 0.00024 0.00055 0.00173 0.00201
22 0.00061 0.00137 0.00360 0.00400
23 0.00123 0.00360 0.00709 0.00771
24 0.00339 0.00645 0.01476 0.01651
25 0.00613 0.00811 0.02816 0.05608
26 0.00888 0.01910 0.09013 0.10297
27 0.01873 0.03655 0.17821 0.20354
28 0.03727 0.07210 0.35670 0.41257
apply_two_qubit_gate
nqubits single (64 threads) double (64 threads) single (1 thread) double (1 thread)
10 0.00003 0.00003 0.00001 0.00001
11 0.00003 0.00003 0.00001 0.00001
12 0.00003 0.00003 0.00002 0.00002
13 0.00003 0.00003 0.00004 0.00004
14 0.00003 0.00004 0.00007 0.00008
15 0.00004 0.00004 0.00014 0.00015
16 0.00005 0.00005 0.00028 0.00029
17 0.00006 0.00008 0.00055 0.00055
18 0.00010 0.00014 0.00111 0.00113
19 0.00020 0.00022 0.00220 0.00226
20 0.00036 0.00036 0.00438 0.00454
21 0.00067 0.00077 0.00878 0.00909
22 0.00110 0.00159 0.01757 0.01817
23 0.00201 0.00344 0.03469 0.03619
24 0.00465 0.00766 0.07004 0.07252
25 0.00866 0.00967 0.13983 0.17484
26 0.01276 0.02150 0.33970 0.48276
27 0.02858 0.04276 0.89865 0.97283
28 0.05609 0.09051 1.79465 1.93739
apply_multi_qubit_gate - ntargets=3
nqubits single (64 threads) double (64 threads) single (1 thread) double (1 thread)
10 0.00003 0.00003 0.00002 0.00002
11 0.00004 0.00003 0.00003 0.00003
12 0.00003 0.00004 0.00006 0.00005
13 0.00004 0.00004 0.00010 0.00010
14 0.00007 0.00005 0.00019 0.00019
15 0.00007 0.00007 0.00037 0.00038
16 0.00011 0.00010 0.00074 0.00075
17 0.00010 0.00016 0.00148 0.00150
18 0.00030 0.00026 0.00294 0.00304
19 0.00051 0.00051 0.00590 0.00604
20 0.00092 0.00082 0.01178 0.01216
21 0.00148 0.00174 0.02375 0.02441
22 0.00274 0.00306 0.04756 0.04895
23 0.00556 0.00645 0.09580 0.09791
24 0.01099 0.01273 0.19194 0.19609
25 0.02151 0.02069 0.38340 0.40367
26 0.04200 0.03610 0.78617 0.91813
27 0.07741 0.07561 1.76592 1.98405
28 0.13882 0.15438 3.84446 3.95124
apply_multi_qubit_gate - ntargets=4
nqubits single (64 threads) double (64 threads) single (1 thread) double (1 thread)
10 0.00004 0.00004 0.00003 0.00003
11 0.00004 0.00003 0.00005 0.00005
12 0.00004 0.00004 0.00009 0.00009
13 0.00004 0.00005 0.00018 0.00018
14 0.00005 0.00006 0.00035 0.00034
15 0.00010 0.00010 0.00071 0.00067
16 0.00014 0.00014 0.00135 0.00133
17 0.00018 0.00025 0.00269 0.00270
18 0.00042 0.00049 0.00540 0.00608
19 0.00085 0.00082 0.01268 0.01607
20 0.00145 0.00166 0.03330 0.03659
21 0.00225 0.00424 0.07047 0.07545
22 0.00456 0.00619 0.14903 0.15434
23 0.01973 0.01497 0.30761 0.31311
24 0.01979 0.03174 0.61445 0.62966
25 0.05212 0.06423 1.24524 1.29693
26 0.10581 0.11538 2.55039 2.41001
27 0.17856 0.16543 4.73377 4.29563
28 0.29317 0.23916 8.79219 8.72099
apply_multi_qubit_gate - ntargets=5
nqubits single (64 threads) double (64 threads) single (1 thread) double (1 thread)
10 0.00004 0.00004 0.00004 0.00004
11 0.00005 0.00004 0.00008 0.00008
12 0.00005 0.00005 0.00015 0.00015
13 0.00006 0.00006 0.00031 0.00030
14 0.00008 0.00010 0.00060 0.00060
15 0.00013 0.00014 0.00125 0.00118
16 0.00023 0.00025 0.00232 0.00237
17 0.00040 0.00040 0.00464 0.00471
18 0.00043 0.00078 0.00923 0.01119
19 0.00125 0.00112 0.04019 0.02804
20 0.00203 0.00219 0.05659 0.06124
21 0.00375 0.00773 0.12062 0.22236
22 0.01474 0.01367 0.41249 0.45251
23 0.02889 0.02938 0.85467 0.93634
24 0.05221 0.05733 1.72745 1.88799
25 0.11024 0.10648 3.42486 3.81983
26 0.18496 0.20348 7.00453 7.75527
27 0.37176 0.35060 14.17189 14.21730
28 0.67633 0.83049 26.16925 36.41165
apply_multi_qubit_gate - ntargets=6
nqubits single (64 threads) double (64 threads) single (1 thread) double (1 thread)
10 0.00006 0.00004 0.00011 0.00008
11 0.00006 0.00005 0.00021 0.00016
12 0.00007 0.00006 0.00041 0.00031
13 0.00008 0.00013 0.00080 0.00062
14 0.00018 0.00016 0.00160 0.00124
15 0.00024 0.00023 0.00320 0.00250
16 0.00052 0.00045 0.00641 0.00506
17 0.00100 0.00052 0.01308 0.00998
18 0.00120 0.00137 0.02585 0.02033
19 0.00271 0.00197 0.05230 0.04444
20 0.00402 0.00431 0.10630 0.08368
21 0.00917 0.01023 0.21530 0.30033
22 0.02224 0.02078 0.63213 0.62978
23 0.04513 0.04106 1.31287 1.31433
24 0.08741 0.08156 2.65060 2.64123
25 0.16587 0.16127 5.30112 5.19375
26 0.31596 0.31030 10.74703 10.39643
27 0.63359 0.62503 21.16741 20.80903
28 1.24665 1.26802 42.51790 41.84317
apply_multi_qubit_gate - ntargets=7
nqubits single (64 threads) double (64 threads) single (1 thread) double (1 thread)
10 0.00009 0.00008 0.00020 0.00015
11 0.00008 0.00009 0.00040 0.00030
12 0.00014 0.00009 0.00079 0.00060
13 0.00018 0.00010 0.00156 0.00121
14 0.00028 0.00024 0.00344 0.00233
15 0.00048 0.00025 0.00630 0.00467
16 0.00092 0.00071 0.01247 0.00938
17 0.00144 0.00102 0.02492 0.01889
18 0.00219 0.00190 0.05043 0.03780
19 0.00409 0.00288 0.10056 0.07645
20 0.00864 0.00657 0.20283 0.15400
21 0.01519 0.01503 0.40817 0.46041
22 0.03118 0.02859 1.02093 0.93423
23 0.06537 0.05591 2.10929 1.95459
24 0.12585 0.11040 4.22156 3.92964
25 0.23822 0.21403 8.40546 7.69408
26 0.45770 0.41094 16.92144 15.54976
27 0.94539 0.83017 33.80684 31.00753
28 1.82106 1.65167 67.47716 62.36736
apply_multi_qubit_gate - ntargets=8
nqubits single (64 threads) double (64 threads) single (1 thread) double (1 thread)
10 0.00013 0.00020 0.00038 0.00028
11 0.00014 0.00021 0.00075 0.00056
12 0.00025 0.00020 0.00148 0.00111
13 0.00025 0.00020 0.00296 0.00221
14 0.00048 0.00038 0.00591 0.00439
15 0.00090 0.00070 0.01181 0.00879
16 0.00134 0.00074 0.02359 0.01753
17 0.00235 0.00198 0.04727 0.03541
18 0.00423 0.00317 0.09454 0.07123
19 0.00735 0.00588 0.19099 0.14196
20 0.01550 0.01102 0.37985 0.28841
21 0.02907 0.02451 0.76924 0.77398
22 0.05943 0.04698 1.82490 1.57743
23 0.11351 0.09371 3.69565 3.22068
24 0.21270 0.17893 7.31751 6.34632
25 0.40647 0.34237 14.74788 12.70273
26 0.79980 0.67554 29.88017 25.75951
27 1.57292 1.33787 59.09961 51.23731
28 3.10821 2.62125 118.59848 102.75361

Apparently the issue appears only for the multi-qubit op and exists for single thread too. For other ops the difference between single and double is as expected.

from qibojit.

mlazzarin avatar mlazzarin commented on August 12, 2024

Good to know, at least the problem is restricted to a specific op. Two questions:

  • Which function call did you benchmark? e.g. exactly apply_multi_qubit_gate or the whole multi_qubit_base?
  • Did you try to set parallel=False instead of setting the number of threads?

from qibojit.

stavros11 avatar stavros11 commented on August 12, 2024
  • Which function call did you benchmark? e.g. exactly apply_multi_qubit_gate or the whole multi_qubit_base?

I called apply_multi_qubit_gate directly and the same for the other ops. Here is the full script I used:

Benchmark script
import os
import argparse
import json
import time
import numpy as np


parser = argparse.ArgumentParser()
parser.add_argument("--nqubits", default=20, type=int)
parser.add_argument("--nreps", default=1, type=int)
parser.add_argument("--method", default="apply_gate", type=str)
parser.add_argument("--ntargets", default=1, type=int)
parser.add_argument("--precision", default="double", type=str)
parser.add_argument("--nthreads", default=None, type=int)
parser.add_argument("--filename", default=None, type=str)


def main(nqubits, nreps, method, ntargets, precision, nthreads, filename):
    if filename is not None:
        if os.path.isfile(filename):
            with open(filename, "r") as file:
                logs = json.load(file)
            print("Extending existing logs from {}.".format(filename))
        else:
            print("Creating new logs in {}.".format(filename))
            logs = []
    else:
        logs = []

    logs.append({
        "nqubits": nqubits, "nreps": nreps,
        "method": method, "ntargets": ntargets,
        "CUDA_VISIBLE_DEVICES": os.environ.get("CUDA_VISIBLE_DEVICES")
        })

    # Create backend object
    start_time = time.time()
    from qibo import K
    logs[-1]["import_time"] = time.time() - start_time

    # Set backend precision
    K.set_precision(precision)
    logs[-1]["precision"] = K.precision
    logs[-1]["backend"] = K.name
    # Set number of threads
    if nthreads is not None:
        K.set_threads(nthreads)
    logs[-1]["nthreads"] = K.nthreads

    # Calculate ``qubits`` tensor
    targets = list(range(ntargets))
    qubits = [nqubits - q - 1 for q in targets]
    #qubits.extend(nqubits - q - 1 for q in controls)
    qubits = K.cast(sorted(qubits), dtype="int32")

    # Generate ``gate`` matrix if needed
    if method in ("apply_gate", "apply_two_qubit_gate", "apply_multi_qubit_gate"):
        shape = 2 * (2 ** len(targets),)
        gate = K.cast(np.random.random(shape) + 1j * np.random.random(shape))
        executor = lambda x: getattr(K, method)(x, gate, nqubits, targets, qubits)
    else:
        executor = lambda x: getattr(K, method)(x, nqubits, targets, qubits)

    start_time = time.time()
    state  = K.initial_state(nqubits)
    logs[-1]["initial_state_time"] = time.time() - start_time
    start_time = time.time()
    state = executor(state)
    logs[-1]["dry_run_time"] = time.time() - start_time
    logs[-1]["dtype"] = str(state.dtype)
    del(state)

    logs[-1]["simulation_times"] = []
    for _ in range(nreps):
        state  = K.initial_state(nqubits)
        start_time = time.time()
        state = executor(state)
        logs[-1]["simulation_times"].append(time.time() - start_time)
        del(state)
    logs[-1]["simulation_time_mean"] = np.mean(logs[-1]["simulation_times"])
    logs[-1]["simulation_time_std"] = np.std(logs[-1]["simulation_times"])

    for k, v in logs[-1].items():
        print("{}: {}".format(k, v))
    print()

    if filename is not None:
        with open(filename, "w") as file:
            json.dump(logs, file)


if __name__ == "__main__":
    args = vars(parser.parse_args())
    main(**args)
  • Did you try to set parallel=False instead of setting the number of threads?

No, I haven't tried that. As shown in the above script I use

from qibo import K
K.set_threads(nthreads)

in order to set the number of threads. I also haven't tried running this on GPU but in principle the same script would work.

EDIT: Here are some results with parallel=False and single thread

apply_multi_qubit_gate - ntargets=3
nqubits single double
10 0.00002 0.00002
11 0.00003 0.00003
12 0.00005 0.00005
13 0.00010 0.00010
14 0.00019 0.00019
15 0.00037 0.00038
16 0.00074 0.00075
17 0.00148 0.00149
18 0.00296 0.00296
19 0.00591 0.01222
20 0.01182 0.01196
21 0.02399 0.02434
22 0.04796 0.04896
23 0.09642 0.09773
24 0.19295 0.19594
25 0.38674 0.40418
26 0.78994 0.89543
27 1.77743 1.97553
apply_multi_qubit_gate - ntargets=4
nqubits single double
10 0.00003 0.00003
11 0.00005 0.00005
12 0.00009 0.00009
13 0.00017 0.00018
14 0.00035 0.00035
15 0.00073 0.00073
16 0.00142 0.00140
17 0.00278 0.00279
18 0.00567 0.00566
19 0.01255 0.01666
20 0.03295 0.03674
21 0.07555 0.07934
22 0.14994 0.15966
23 0.30840 0.32393
24 0.62203 0.65096
25 1.25948 1.34278
26 2.57411 2.47533
27 4.86673 4.34140
apply_multi_qubit_gate - ntargets=5
nqubits single double
10 0.00004 0.00004
11 0.00008 0.00008
12 0.00015 0.00015
13 0.00029 0.00030
14 0.00059 0.00060
15 0.00118 0.00115
16 0.00228 0.00235
17 0.00458 0.00464
18 0.00912 0.01179
19 0.02251 0.02766
20 0.05465 0.06129
21 0.12234 0.22476
22 0.39308 0.43870
23 0.80829 0.89043
24 1.64349 1.81944
25 3.27383 3.67461
26 6.82895 7.51540
apply_multi_qubit_gate - ntargets=6
nqubits single double
10 0.00006 0.00007
11 0.00011 0.00013
12 0.00022 0.00025
13 0.00044 0.00050
14 0.00088 0.00099
15 0.00177 0.00200
16 0.00355 0.00427
17 0.00723 0.00826
18 0.01462 0.01709
19 0.02899 0.03485
20 0.06035 0.07053
21 0.11944 0.37921
22 0.69294 0.77325
23 1.45051 1.58418
24 2.98322 3.17140
25 6.04239 6.37544
26 12.26439 13.06399
apply_multi_qubit_gate - ntargets=7
nqubits single double
10 0.00007 0.00009
11 0.00014 0.00016
12 0.00027 0.00033
13 0.00053 0.00064
14 0.00107 0.00126
15 0.00211 0.00252
16 0.00428 0.00507
17 0.00866 0.01026
18 0.01786 0.02033
19 0.03554 0.04163
20 0.07032 0.08453
21 0.14418 0.43070
22 0.78342 0.89268
23 1.62434 1.82199
24 3.36772 3.65349
25 6.83152 7.30669
26 14.06274 15.98979
apply_multi_qubit_gate - ntargets=8
nqubits single double
10 0.00009 0.00012
11 0.00018 0.00024
12 0.00034 0.00048
13 0.00067 0.00096
14 0.00135 0.00197
15 0.00275 0.00372
16 0.00552 0.00764
17 0.01094 0.01451
18 0.02138 0.03014
19 0.04558 0.06331
20 0.09554 0.13397
21 0.18050 0.54658
22 0.91581 1.08427
23 1.82818 2.20956
24 3.75763 4.48631
25 7.77429 9.19321
26 15.59674 18.21344

from qibojit.

mlazzarin avatar mlazzarin commented on August 12, 2024

I checked the parallelization logs, and I don't see anything wrong in them. (even if I don't understand all details)

================================================================================
 Parallel Accelerator Optimizing:  Function apply_multi_qubit_gate_kernel, 
.../qibojit/src/qibojit/custom_operators/gates.py (381)  
================================================================================


Parallel loop listing for  Function apply_multi_qubit_gate_kernel, .../qibojit/src/qibojit/custom_operators/gates.py (381) 
-----------------------------------------------------------------------------------------|loop #ID
@njit(["complex64[:](complex64[:], complex64[:,::1], int32[:], int64, int64[:])",        | 
       "complex128[:](complex128[:], complex128[:,::1], int32[:], int64, int64[:])"],    | 
       parallel=True, nogil=True,                                                        | 
       cache=True)                                                                       | 
def apply_multi_qubit_gate_kernel(state, gate, qubits, nstates, targets):                | 
    nsubstates = 1 << len(targets)                                                       | 
    for g in prange(nstates):  # pylint: disable=not-an-iterable-------------------------| #41
        ig = multicontrol_index(g, qubits)                                               | 
        buffer = np.empty(nsubstates, dtype=state.dtype)                                 | 
        for i in range(nsubstates):                                                      | 
            t = ig - multitarget_index(nsubstates - i - 1, targets)                      | 
            buffer[i] = state[t]                                                         | 
        for i in range(nsubstates):                                                      | 
            t = ig - multitarget_index(nsubstates - i - 1, targets)                      | 
            state[t] = np.dot(gate[i], buffer)-------------------------------------------| #40
    return state                                                                         | 
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
----------------------------- Before Optimisation ------------------------------
Parallel region 0:
+--41 (parallel)
   +--40 (parallel)


--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel region 0:
+--41 (parallel)
   +--40 (serial)


 
Parallel region 0 (loop #41) had 0 loop(s) fused and 1 loop(s) serialized as 
part of the larger parallel loop (#41).
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
 
---------------------------Loop invariant code motion---------------------------
Allocation hoisting:
The memory allocation derived from the instruction at 
.../qibojit/src/qibojit/custom_operators/gates.py (389) is 
hoisted out of the parallel loop labelled #41 (it will be performed before the 
loop is executed and reused inside the loop):
   Allocation:: buffer = np.empty(nsubstates, dtype=state.dtype)
    - numpy.empty() is used for the allocation.

Instruction hoisting:
loop #41:
  Has the following hoisted:
    $100load_global.1 = global(range: <class 'range'>)
    $114load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fd5b0ae0040>))
    $const122.8 = const(int, 1)
    $const18.7.2435 = const(int, 0)
    $np_typ_var.2470 = const(str, complex128)
    $26load_global.2 = global(multicontrol_index: CPUDispatcher(<function multicontrol_index at 0x7fd5dc117040>))
    $36load_global.6 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)
    $38load_attr.7 = getattr(value=$36load_global.6, attr=empty)
    $52load_global.13 = global(range: <class 'range'>)
    $66load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fd5b0ae0040>))
    $const74.8 = const(int, 1)
  Failed to hoist the following:
    not pure: $104call_function.3 = call $push_global_to_block.2576(nsubstates, func=$push_global_to_block.2576, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: $106get_iter.4 = getiter(value=$104call_function.3)
    dependency: $108for_iter.2 = iternext(value=$106get_iter.4)
    dependency: $i.1.2494 = pair_first(value=$108for_iter.2)
    dependency: $108for_iter.4 = pair_second(value=$108for_iter.2)
    dependency: $120binary_subtract.7 = nsubstates - $i.1.2494
    dependency: $124binary_subtract.9 = $120binary_subtract.7 - $const122.8
    dependency: $128call_function.11 = call $push_global_to_block.2577($124binary_subtract.9, targets, func=$push_global_to_block.2577, args=[Var($124binary_subtract.9, gates.py:394), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: t.1 = ig - $128call_function.11
    dependency: $a.2426.2495 = getitem(value=gate, index=$i.1.2494, fn=<built-in function getitem>)
    dependency: $16load_attr.6.2434 = getattr(value=$a.2426.2495, attr=shape)
    dependency: l.2437 = getitem(value=$16load_attr.6.2434, index=$const18.7.2435, fn=<built-in function getitem>)
    dependency: $dtype_attr_var.2471 = getattr(value=$np_g_var.2469, attr=dtype)
    dependency: $dtype_var.2472 = call $dtype_attr_var.2471($np_typ_var.2470, func=$dtype_attr_var.2471, args=[Var($np_typ_var.2470, gates.py:389)], kws=(), vararg=None, target=None)
    dependency: g = $parfor__index_2489.2560
    dependency: ig = call $push_global_to_block.2572($parfor__index_2489.2560, qubits, func=$push_global_to_block.2572, args=[Var($parfor__index_2489.2560, <string>:2), Var(qubits, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: buffer = call $push_getattr_to_block.2573(nsubstates, func=$push_getattr_to_block.2573, args=[Var(nsubstates, gates.py:386)], kws=[('dtype', Var($dtype_var.2472, gates.py:389))], vararg=None, target=None)
    not pure: $56call_function.15 = call $push_global_to_block.2575(nsubstates, func=$push_global_to_block.2575, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: $58get_iter.16 = getiter(value=$56call_function.15)
    dependency: $60for_iter.2 = iternext(value=$58get_iter.16)
    dependency: $i.2497 = pair_first(value=$60for_iter.2)
    dependency: $60for_iter.4 = pair_second(value=$60for_iter.2)
    dependency: $72binary_subtract.7 = nsubstates - $i.2497
    dependency: $76binary_subtract.9 = $72binary_subtract.7 - $const74.8
    dependency: $80call_function.11 = call $push_global_to_block.2578($76binary_subtract.9, targets, func=$push_global_to_block.2578, args=[Var($76binary_subtract.9, gates.py:391), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)
    dependency: t = ig - $80call_function.11
    dependency: $90binary_subscr.15 = getitem(value=state, index=t, fn=<built-in function getitem>)

Besides parallelization, I tried to inspect the intermediate representation of such kernel and I believe types are inferred properly:

>>> gates.apply_multi_qubit_gate_kernel.inspect_types()
apply_multi_qubit_gate_kernel (array(complex64, 1d, A), array(complex64, 2d, C), array(int32, 1d, A), int64, array(int64, 1d, A))
--------------------------------------------------------------------------------
# File: .../qibojit/src/qibojit/custom_operators/gates.py
# --- LINE 381 --- 

@njit(["complex64[:](complex64[:], complex64[:,::1], int32[:], int64, int64[:])",

       # --- LINE 382 --- 

       "complex128[:](complex128[:], complex128[:,::1], int32[:], int64, int64[:])"],

       # --- LINE 383 --- 

       parallel=True,

       # --- LINE 384 --- 

       cache=True)

# --- LINE 385 --- 

def apply_multi_qubit_gate_kernel(state, gate, qubits, nstates, targets):

    # --- LINE 386 --- 
    # label 0
    #   state = arg(0, name=state)  :: array(complex64, 1d, A)
    #   gate = arg(1, name=gate)  :: array(complex64, 2d, C)
    #   qubits = arg(2, name=qubits)  :: array(int32, 1d, A)
    #   nstates = arg(3, name=nstates)  :: int64
    #   targets = arg(4, name=targets)  :: array(int64, 1d, A)
    #   $const2.0 = const(int, 1)  :: Literal[int](1)
    #   $4load_global.1 = global(len: <built-in function len>)  :: Function(<built-in function len>)
    #   $8call_function.3 = targets_size0.647  :: int64
    #   del $4load_global.1
    #   nsubstates = $const2.0 << targets_size0.647  :: int64
    #   del $const2.0
    #   del $8call_function.3

    nsubstates = 1 << len(targets)

    # --- LINE 387 --- 
    #   $14load_global.5 = global(prange: <class 'numba.misc.special.prange'>)  :: Function(<class 'numba.misc.special.prange'>)
    #   $18call_function.7 = call $14load_global.5(nstates, func=$14load_global.5, args=[Var(nstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
    #   del nstates
    #   del $14load_global.5
    #   $20get_iter.8 = getiter(value=$18call_function.7)  :: range_iter_int64
    #   del $18call_function.7
    #   $phi22.0 = $20get_iter.8  :: range_iter_int64
    #   del $20get_iter.8
    #   jump 22
    # label 22
    #   $22for_iter.1 = iternext(value=$20get_iter.8)  :: pair<int64, bool>
    #   $22for_iter.2 = pair_first(value=$22for_iter.1)  :: int64
    #   $22for_iter.3 = pair_second(value=$22for_iter.1)  :: bool
    #   del $22for_iter.1
    #   parfor_index.655 = parfor_index.655  :: uint64
    #   del $22for_iter.2
    #   branch $22for_iter.3, 24, 1597
    # label 24
    #   del $22for_iter.3
    #   g = $parfor__index_655.726  :: int64
    #   del $phi24.1

    for g in prange(nstates):  # pylint: disable=not-an-iterable

        # --- LINE 388 --- 
        #   $26load_global.2 = global(multicontrol_index: CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))  :: type(CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))
        #   ig = call $push_global_to_block.738($parfor__index_655.726, qubits, func=$push_global_to_block.738, args=[Var($parfor__index_655.726, <string>:2), Var(qubits, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int32, 1d, A)) -> int64
        #   del g
        #   del $26load_global.2

        ig = multicontrol_index(g, qubits)

        # --- LINE 389 --- 
        #   $36load_global.6 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)  :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
        #   $38load_attr.7 = getattr(value=$36load_global.6, attr=empty)  :: Function(<built-in function empty>)
        #   del $36load_global.6
        #   $44load_attr.10 = $dtype_var.638  :: dtype(complex64)
        #   buffer = call $push_getattr_to_block.739(nsubstates, func=$push_getattr_to_block.739, args=[Var(nsubstates, gates.py:386)], kws=[('dtype', Var($dtype_var.638, gates.py:389))], vararg=None, target=None)  :: (int64, dtype(complex64)) -> array(complex64, 1d, C)
        #   del $44load_attr.10
        #   del $38load_attr.7

        buffer = np.empty(nsubstates, dtype=state.dtype)

        # --- LINE 390 --- 
        #   $52load_global.13 = global(range: <class 'range'>)  :: Function(<class 'range'>)
        #   $56call_function.15 = call $push_global_to_block.741(nsubstates, func=$push_global_to_block.741, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
        #   del $52load_global.13
        #   $58get_iter.16 = getiter(value=$56call_function.15)  :: range_iter_int64
        #   del $56call_function.15
        #   $phi60.1 = $58get_iter.16  :: range_iter_int64
        #   del $58get_iter.16
        #   jump 60
        # label 60
        #   $60for_iter.2 = iternext(value=$58get_iter.16)  :: pair<int64, bool>
        #   $i.663 = pair_first(value=$60for_iter.2)  :: int64
        #   $60for_iter.4 = pair_second(value=$60for_iter.2)  :: bool
        #   del $60for_iter.2
        #   $phi62.2 = $i.663  :: int64
        #   del $60for_iter.3
        #   branch $60for_iter.4, 62, 100
        # label 62
        #   del $60for_iter.4
        #   i = $i.663  :: int64
        #   del $phi62.2

        for i in range(nsubstates):

            # --- LINE 391 --- 
            #   $66load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))  :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
            #   $72binary_subtract.7 = nsubstates - $i.663  :: int64
            #   $const74.8 = const(int, 1)  :: Literal[int](1)
            #   $76binary_subtract.9 = $72binary_subtract.7 - $const74.8  :: int64
            #   del $const74.8
            #   del $72binary_subtract.7
            #   $80call_function.11 = call $push_global_to_block.744($76binary_subtract.9, targets, func=$push_global_to_block.744, args=[Var($76binary_subtract.9, gates.py:391), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int64, 1d, A)) -> int64
            #   del $76binary_subtract.9
            #   del $66load_global.4
            #   t = ig - $80call_function.11  :: int64
            #   del $80call_function.11

            t = ig - multitarget_index(nsubstates - i - 1, targets)

            # --- LINE 392 --- 
            #   $90binary_subscr.15 = getitem(value=state, index=t, fn=<built-in function getitem>)  :: complex64
            #   del t
            #   buffer[$i.663] = $90binary_subscr.15  :: (array(complex64, 1d, C), int64, complex64) -> none
            #   del i
            #   del $90binary_subscr.15
            #   jump 60

            buffer[i] = state[t]

        # --- LINE 393 --- 
        # label 100
        #   del $phi62.2
        #   del $phi60.1
        #   del $60for_iter.4
        #   $100load_global.1 = global(range: <class 'range'>)  :: Function(<class 'range'>)
        #   $104call_function.3 = call $push_global_to_block.742(nsubstates, func=$push_global_to_block.742, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
        #   del $100load_global.1
        #   $106get_iter.4 = getiter(value=$104call_function.3)  :: range_iter_int64
        #   del $104call_function.3
        #   $phi108.1 = $106get_iter.4  :: range_iter_int64
        #   del $106get_iter.4
        #   jump 108
        # label 108
        #   $108for_iter.2 = iternext(value=$106get_iter.4)  :: pair<int64, bool>
        #   $i.1.660 = pair_first(value=$108for_iter.2)  :: int64
        #   $108for_iter.4 = pair_second(value=$108for_iter.2)  :: bool
        #   del $108for_iter.2
        #   $phi110.2 = $i.1.660  :: int64
        #   del $108for_iter.3
        #   branch $108for_iter.4, 110, 1573
        # label 110
        #   del $108for_iter.4
        #   i.1 = $i.1.660  :: int64
        #   del $phi110.2

        for i in range(nsubstates):

            # --- LINE 394 --- 
            #   $114load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))  :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
            #   $120binary_subtract.7 = nsubstates - $i.1.660  :: int64
            #   $const122.8 = const(int, 1)  :: Literal[int](1)
            #   $124binary_subtract.9 = $120binary_subtract.7 - $const122.8  :: int64
            #   del $const122.8
            #   del $120binary_subtract.7
            #   $128call_function.11 = call $push_global_to_block.743($124binary_subtract.9, targets, func=$push_global_to_block.743, args=[Var($124binary_subtract.9, gates.py:394), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int64, 1d, A)) -> int64
            #   del $124binary_subtract.9
            #   del $114load_global.4
            #   t.1 = ig - $128call_function.11  :: int64
            #   del $128call_function.11

            t = ig - multitarget_index(nsubstates - i - 1, targets)

            # --- LINE 395 --- 
            #   $134load_global.13 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)  :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
            #   $136load_method.14 = getattr(value=$134load_global.13, attr=dot)  :: Function(<function dot at 0x7fe76a92c550>)
            #   del $134load_global.13
            #   $a.592.661 = getitem(value=gate, index=$i.1.660, fn=<built-in function getitem>)  :: array(complex64, 1d, C)
            #   del i.1
            #   $146call_method.19 = call $136load_method.14($142binary_subscr.17, buffer, func=$136load_method.14, args=[Var($142binary_subscr.17, gates.py:395), Var(buffer, gates.py:389)], kws=(), vararg=None, target=None)  :: (array(complex64, 1d, C), array(complex64, 1d, C)) -> complex64
            #   del $142binary_subscr.17
            #   del $136load_method.14
            #   state[t.1] = s.610  :: (array(complex64, 1d, A), int64, complex64) -> none
            #   del t.1
            #   del $146call_method.19
            #   jump 108
            # label 156
            #   del ig
            #   del buffer
            #   del $phi110.2
            #   del $phi108.1
            #   del $108for_iter.4
            #   jump 1600

            state[t] = np.dot(gate[i], buffer)

    # --- LINE 396 --- 
    # label 158
    #   del targets
    #   del qubits
    #   del nsubstates
    #   del gate
    #   del $phi24.1
    #   del $phi22.0
    #   del $22for_iter.3
    #   $160return_value.1 = cast(value=state)  :: array(complex64, 1d, A)
    #   del state
    #   return $160return_value.1

    return state


================================================================================
apply_multi_qubit_gate_kernel (array(complex128, 1d, A), array(complex128, 2d, C), array(int32, 1d, A), int64, array(int64, 1d, A))
--------------------------------------------------------------------------------
# File: .../qibojit/src/qibojit/custom_operators/gates.py
# --- LINE 381 --- 

@njit(["complex64[:](complex64[:], complex64[:,::1], int32[:], int64, int64[:])",

       # --- LINE 382 --- 

       "complex128[:](complex128[:], complex128[:,::1], int32[:], int64, int64[:])"],

       # --- LINE 383 --- 

       parallel=True,

       # --- LINE 384 --- 

       cache=True)

# --- LINE 385 --- 

def apply_multi_qubit_gate_kernel(state, gate, qubits, nstates, targets):

    # --- LINE 386 --- 
    # label 0
    #   state = arg(0, name=state)  :: array(complex128, 1d, A)
    #   gate = arg(1, name=gate)  :: array(complex128, 2d, C)
    #   qubits = arg(2, name=qubits)  :: array(int32, 1d, A)
    #   nstates = arg(3, name=nstates)  :: int64
    #   targets = arg(4, name=targets)  :: array(int64, 1d, A)
    #   $const2.0 = const(int, 1)  :: Literal[int](1)
    #   $4load_global.1 = global(len: <built-in function len>)  :: Function(<built-in function len>)
    #   $8call_function.3 = targets_size0.815  :: int64
    #   del $4load_global.1
    #   nsubstates = $const2.0 << targets_size0.815  :: int64
    #   del $const2.0
    #   del $8call_function.3

    nsubstates = 1 << len(targets)

    # --- LINE 387 --- 
    #   $14load_global.5 = global(prange: <class 'numba.misc.special.prange'>)  :: Function(<class 'numba.misc.special.prange'>)
    #   $18call_function.7 = call $14load_global.5(nstates, func=$14load_global.5, args=[Var(nstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
    #   del nstates
    #   del $14load_global.5
    #   $20get_iter.8 = getiter(value=$18call_function.7)  :: range_iter_int64
    #   del $18call_function.7
    #   $phi22.0 = $20get_iter.8  :: range_iter_int64
    #   del $20get_iter.8
    #   jump 22
    # label 22
    #   $22for_iter.1 = iternext(value=$20get_iter.8)  :: pair<int64, bool>
    #   $22for_iter.2 = pair_first(value=$22for_iter.1)  :: int64
    #   $22for_iter.3 = pair_second(value=$22for_iter.1)  :: bool
    #   del $22for_iter.1
    #   parfor_index.823 = parfor_index.823  :: uint64
    #   del $22for_iter.2
    #   branch $22for_iter.3, 24, 1699
    # label 24
    #   del $22for_iter.3
    #   g = $parfor__index_823.894  :: int64
    #   del $phi24.1

    for g in prange(nstates):  # pylint: disable=not-an-iterable

        # --- LINE 388 --- 
        #   $26load_global.2 = global(multicontrol_index: CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))  :: type(CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))
        #   ig = call $push_global_to_block.906($parfor__index_823.894, qubits, func=$push_global_to_block.906, args=[Var($parfor__index_823.894, <string>:2), Var(qubits, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int32, 1d, A)) -> int64
        #   del g
        #   del $26load_global.2

        ig = multicontrol_index(g, qubits)

        # --- LINE 389 --- 
        #   $36load_global.6 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)  :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
        #   $38load_attr.7 = getattr(value=$36load_global.6, attr=empty)  :: Function(<built-in function empty>)
        #   del $36load_global.6
        #   $44load_attr.10 = $dtype_var.806  :: dtype(complex128)
        #   buffer = call $push_getattr_to_block.907(nsubstates, func=$push_getattr_to_block.907, args=[Var(nsubstates, gates.py:386)], kws=[('dtype', Var($dtype_var.806, gates.py:389))], vararg=None, target=None)  :: (int64, dtype(complex128)) -> array(complex128, 1d, C)
        #   del $44load_attr.10
        #   del $38load_attr.7

        buffer = np.empty(nsubstates, dtype=state.dtype)

        # --- LINE 390 --- 
        #   $52load_global.13 = global(range: <class 'range'>)  :: Function(<class 'range'>)
        #   $56call_function.15 = call $push_global_to_block.909(nsubstates, func=$push_global_to_block.909, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
        #   del $52load_global.13
        #   $58get_iter.16 = getiter(value=$56call_function.15)  :: range_iter_int64
        #   del $56call_function.15
        #   $phi60.1 = $58get_iter.16  :: range_iter_int64
        #   del $58get_iter.16
        #   jump 60
        # label 60
        #   $60for_iter.2 = iternext(value=$58get_iter.16)  :: pair<int64, bool>
        #   $i.831 = pair_first(value=$60for_iter.2)  :: int64
        #   $60for_iter.4 = pair_second(value=$60for_iter.2)  :: bool
        #   del $60for_iter.2
        #   $phi62.2 = $i.831  :: int64
        #   del $60for_iter.3
        #   branch $60for_iter.4, 62, 100
        # label 62
        #   del $60for_iter.4
        #   i = $i.831  :: int64
        #   del $phi62.2

        for i in range(nsubstates):

            # --- LINE 391 --- 
            #   $66load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))  :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
            #   $72binary_subtract.7 = nsubstates - $i.831  :: int64
            #   $const74.8 = const(int, 1)  :: Literal[int](1)
            #   $76binary_subtract.9 = $72binary_subtract.7 - $const74.8  :: int64
            #   del $const74.8
            #   del $72binary_subtract.7
            #   $80call_function.11 = call $push_global_to_block.912($76binary_subtract.9, targets, func=$push_global_to_block.912, args=[Var($76binary_subtract.9, gates.py:391), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int64, 1d, A)) -> int64
            #   del $76binary_subtract.9
            #   del $66load_global.4
            #   t = ig - $80call_function.11  :: int64
            #   del $80call_function.11

            t = ig - multitarget_index(nsubstates - i - 1, targets)

            # --- LINE 392 --- 
            #   $90binary_subscr.15 = getitem(value=state, index=t, fn=<built-in function getitem>)  :: complex128
            #   del t
            #   buffer[$i.831] = $90binary_subscr.15  :: (array(complex128, 1d, C), int64, complex128) -> none
            #   del i
            #   del $90binary_subscr.15
            #   jump 60

            buffer[i] = state[t]

        # --- LINE 393 --- 
        # label 100
        #   del $phi62.2
        #   del $phi60.1
        #   del $60for_iter.4
        #   $100load_global.1 = global(range: <class 'range'>)  :: Function(<class 'range'>)
        #   $104call_function.3 = call $push_global_to_block.910(nsubstates, func=$push_global_to_block.910, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64,) -> range_state_int64
        #   del $100load_global.1
        #   $106get_iter.4 = getiter(value=$104call_function.3)  :: range_iter_int64
        #   del $104call_function.3
        #   $phi108.1 = $106get_iter.4  :: range_iter_int64
        #   del $106get_iter.4
        #   jump 108
        # label 108
        #   $108for_iter.2 = iternext(value=$106get_iter.4)  :: pair<int64, bool>
        #   $i.1.829 = pair_first(value=$108for_iter.2)  :: int64
        #   $108for_iter.4 = pair_second(value=$108for_iter.2)  :: bool
        #   del $108for_iter.2
        #   $phi110.2 = $i.1.829  :: int64
        #   del $108for_iter.3
        #   branch $108for_iter.4, 110, 1675
        # label 110
        #   del $108for_iter.4
        #   i.1 = $i.1.829  :: int64
        #   del $phi110.2

        for i in range(nsubstates):

            # --- LINE 394 --- 
            #   $114load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))  :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
            #   $120binary_subtract.7 = nsubstates - $i.1.829  :: int64
            #   $const122.8 = const(int, 1)  :: Literal[int](1)
            #   $124binary_subtract.9 = $120binary_subtract.7 - $const122.8  :: int64
            #   del $const122.8
            #   del $120binary_subtract.7
            #   $128call_function.11 = call $push_global_to_block.911($124binary_subtract.9, targets, func=$push_global_to_block.911, args=[Var($124binary_subtract.9, gates.py:394), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)  :: (int64, array(int64, 1d, A)) -> int64
            #   del $124binary_subtract.9
            #   del $114load_global.4
            #   t.1 = ig - $128call_function.11  :: int64
            #   del $128call_function.11

            t = ig - multitarget_index(nsubstates - i - 1, targets)

            # --- LINE 395 --- 
            #   $134load_global.13 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)  :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
            #   $136load_method.14 = getattr(value=$134load_global.13, attr=dot)  :: Function(<function dot at 0x7fe76a92c550>)
            #   del $134load_global.13
            #   $a.760.828 = getitem(value=gate, index=$i.1.829, fn=<built-in function getitem>)  :: array(complex128, 1d, C)
            #   del i.1
            #   $146call_method.19 = call $136load_method.14($142binary_subscr.17, buffer, func=$136load_method.14, args=[Var($142binary_subscr.17, gates.py:395), Var(buffer, gates.py:389)], kws=(), vararg=None, target=None)  :: (array(complex128, 1d, C), array(complex128, 1d, C)) -> complex128
            #   del $142binary_subscr.17
            #   del $136load_method.14
            #   state[t.1] = s.778  :: (array(complex128, 1d, A), int64, complex128) -> none
            #   del t.1
            #   del $146call_method.19
            #   jump 108
            # label 156
            #   del ig
            #   del buffer
            #   del $phi110.2
            #   del $phi108.1
            #   del $108for_iter.4
            #   jump 1702

            state[t] = np.dot(gate[i], buffer)

    # --- LINE 396 --- 
    # label 158
    #   del targets
    #   del qubits
    #   del nsubstates
    #   del gate
    #   del $phi24.1
    #   del $phi22.0
    #   del $22for_iter.3
    #   $160return_value.1 = cast(value=state)  :: array(complex128, 1d, A)
    #   del state
    #   return $160return_value.1

    return state


================================================================================

from qibojit.

scarrazza avatar scarrazza commented on August 12, 2024

Ok, then at this point we have to check the argument types before and after the kernel execution. Btw, what happens if we disable double precision and execute in single precision? Does the code complains or runs?

from qibojit.

stavros11 avatar stavros11 commented on August 12, 2024

Something I forgot to mention above is that for the latest numbers I used the fixcast branch, however I do not think there would be any difference even if I used main.

Ok, then at this point we have to check the argument types before and after the kernel execution.

I always check the state vector type after execution and it has the correct type. The gate matrix should be passed in compatible precision because of the signature. If I pass complex128 matrix with complex64 state it will raise an error. So at least for inputs and outputs the types are correct.

Btw, what happens if we disable double precision and execute in single precision? Does the code complains or runs?

What exactly do you mean by disable? Remove the double precision signature?

from qibojit.

scarrazza avatar scarrazza commented on August 12, 2024

What exactly do you mean by disable? Remove the double precision signature?

yes, to make sure that we are passing single precision numbers.

If this test passes, then the debug should start by pruning the kernel until we find the operation which does not performs well.

from qibojit.

mlazzarin avatar mlazzarin commented on August 12, 2024

If I remove the double precision signature, the code works fine with single precision but crashes with double

TypeError: No matching definition for argument type(s) array(complex128, 1d, C), array(complex128, 2d, C), array(int32, 1d, C), int64, array(int64, 1d, C)

as expected.

from qibojit.

scarrazza avatar scarrazza commented on August 12, 2024

Does the performance matches the previous single precision runs?

from qibojit.

mlazzarin avatar mlazzarin commented on August 12, 2024

I believe there's no difference in the execution time (performances are a bit unstable on my laptop, but they oscillate in the same range).

from qibojit.

mlazzarin avatar mlazzarin commented on August 12, 2024

I'd like to add some addition results. I think it's a bit difficult to solve this issue because we don't have a baseline that shows which is the optimal result. For example, I tried to run some benchmarks (the same as in issue #51) with qibo, numpy backend:

numpy_backend_single_vs_double_prec

Are these results expected?
I think it may be interesting to run the same benchmark with all our backends that support multiqubit gates (numpy, tensorflow, qibojit) and with CPU/GPU to have a general overview of the problem.

from qibojit.

scarrazza avatar scarrazza commented on August 12, 2024

In theory I would say no, however this means that (a) the computation is not the bottleneck, or (b) for multiqubit gates there are pre or post processing done in double precision. If you can please profile the code to exclude (a).

from qibojit.

mlazzarin avatar mlazzarin commented on August 12, 2024

I tried to profile my benchmark code with cProfile. Configuration with 25 qubits, 6 targets, NumPy backend.
I see that nearly all the simulation time is spent in numpy.core._multiarray_umath.c_einsum which is called by NumPy' einsum. Time per call: 6.205 s (single), 6.29 s (double)

from qibojit.

scarrazza avatar scarrazza commented on August 12, 2024

Would be great to repeat this exercise with qibojit.

from qibojit.

mlazzarin avatar mlazzarin commented on August 12, 2024

I compared single vs double precision performance across different hardware and backends:

DGX
Numpy

dgx_numpy_single_vs_double_prec

TensorFlow

dgx_tensorflow_single_vs_double_prec

Qibojit

dgx_qibojit_single_vs_double_prec

TensorFlow GPU

dgx_gpu_tensorflow_gpu_single_vs_double_prec

Qibojit GPU

dgx_qibojit_gpu_single_vs_double_prec

Qiskit

dgx_qiskit_single_vs_double_prec

Qiskit GPU

dgx_qiskit_gpu_single_vs_double_prec

Qibomachine
Numpy

qibomachine_numpy_single_vs_double_prec

TensorFlow

qibomachine_tensorflow_single_vs_double_prec

Qibojit

qibomachine_qibojit_single_vs_double_prec

TensorFlow GPU

qibomachine_gpu_tensorflow_gpu_single_vs_double_prec

Qibojit GPU

qibomachine_qibojit_gpu_single_vs_double_prec

Generally speaking, we always see a consistent speed-up with single precision over double when we use GPU. Instead, with CPU we have different behaviors.
In particular, with NumPy we don't see consistent improvements, with TensorFlow we see a modest speed-up and with qibojit we see a speed-up only for ntargets < 3.

EDIT: I added the results for Qiskit, which show a moderate speed-up of single precision w.r.t. double, at least in average.

from qibojit.

scarrazza avatar scarrazza commented on August 12, 2024

@mlazzarin thank you very much. This looks quite consistent from the qibojit side. How many repetitions are you using for these numbers?

from qibojit.

mlazzarin avatar mlazzarin commented on August 12, 2024

20 for NQUBITS < 21, 10 for 21<=NQUBITS<=26, 3 for NQUBITS>26.

from qibojit.

scarrazza avatar scarrazza commented on August 12, 2024

Ok, good. Then going back to #53, I assume there is a noticeable difference between us and qiskit in terms of algorithm.

from qibojit.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.