Comments (19)
@stavros11 when you have time, could you post your results with multiqubit gates?
Thanks for opening the issue. Here are the results for the multiqubit benchmark from #51 on CPU:
nqubits=18 - simulation times
ntargets | qibo single | qibo double | qibojit single | qibojit double | qiskit single | qiskit double |
---|---|---|---|---|---|---|
3 | 0.00252 | 0.00293 | 0.00279 | 0.00293 | 0.00255 | 0.00273 |
4 | 0.00257 | 0.00386 | 0.00345 | 0.00363 | 0.00316 | 0.00374 |
5 | 0.00523 | 0.00559 | 0.00499 | 0.00553 | 0.00391 | 0.00476 |
6 | 0.01301 | 0.01194 | 0.01284 | 0.01184 | 0.00863 | 0.00972 |
7 | 0.02037 | 0.01597 | 0.02032 | 0.01590 | 0.04467 | 0.04942 |
8 | 0.03582 | 0.02775 | 0.03580 | 0.02734 | 0.16707 | 0.18929 |
9 | 0.06475 | 0.04812 | 0.06510 | 0.04808 | 0.88611 | 0.91469 |
10 | 0.12337 | 0.07982 | 0.12047 | 0.11339 | 5.63898 | 6.16987 |
nqubits=19 - simulation times
ntargets | qibo single | qibo double | qibojit single | qibojit double | qiskit single | qiskit double |
---|---|---|---|---|---|---|
3 | 0.00480 | 0.00435 | 0.00433 | 0.00443 | 0.00324 | 0.00362 |
4 | 0.00645 | 0.00667 | 0.00638 | 0.00695 | 0.00356 | 0.00470 |
5 | 0.01023 | 0.01049 | 0.00996 | 0.01044 | 0.00552 | 0.00699 |
6 | 0.02816 | 0.02578 | 0.02810 | 0.02563 | 0.01156 | 0.01446 |
7 | 0.04366 | 0.03425 | 0.04413 | 0.03410 | 0.06701 | 0.07808 |
8 | 0.09011 | 0.06150 | 0.07839 | 0.06002 | 0.21946 | 0.26088 |
9 | 0.14856 | 0.11064 | 0.14990 | 0.11285 | 1.12933 | 1.11194 |
10 | 0.26818 | 0.19493 | 0.23539 | 0.21493 | 6.54414 | 7.63080 |
nqubits=20 - simulation times
ntargets | qibo single | qibo double | qibojit single | qibojit double | qiskit single | qiskit double |
---|---|---|---|---|---|---|
3 | 0.00847 | 0.00832 | 0.00885 | 0.00893 | 0.00448 | 0.00557 |
4 | 0.01322 | 0.01523 | 0.01336 | 0.01410 | 0.00565 | 0.00783 |
5 | 0.02097 | 0.01520 | 0.02083 | 0.02229 | 0.00867 | 0.01183 |
6 | 0.06118 | 0.05561 | 0.06151 | 0.05508 | 0.01762 | 0.02485 |
7 | 0.09197 | 0.07478 | 0.11024 | 0.07456 | 0.12033 | 0.13969 |
8 | 0.16788 | 0.09990 | 0.18165 | 0.13632 | 0.29780 | 0.40871 |
9 | 0.32098 | 0.20416 | 0.26321 | 0.22551 | 1.48667 | 1.46820 |
10 | 0.32529 | 0.38266 | 0.58053 | 0.35156 | 7.72730 | 8.52348 |
nqubits=21 - simulation times
ntargets | qibo single | qibo double | qibojit single | qibojit double | qiskit single | qiskit double |
---|---|---|---|---|---|---|
3 | 0.01766 | 0.02236 | 0.01769 | 0.02222 | 0.00683 | 0.00936 |
4 | 0.02769 | 0.03944 | 0.02949 | 0.02284 | 0.00920 | 0.01479 |
5 | 0.04744 | 0.05407 | 0.04725 | 0.05816 | 0.01613 | 0.02674 |
6 | 0.13562 | 0.09652 | 0.13764 | 0.12275 | 0.03251 | 0.04883 |
7 | 0.22272 | 0.12269 | 0.21994 | 0.18551 | 0.22816 | 0.30534 |
8 | 0.37997 | 0.28580 | 0.37997 | 0.29958 | 0.49158 | 0.67552 |
9 | 0.46696 | 0.50157 | 0.67388 | 0.31254 | 2.14137 | 2.14235 |
10 | 0.74231 | 0.79214 | 0.79269 | 0.85835 | 9.66886 | 15.46075 |
nqubits=22 - simulation times
ntargets | qibo single | qibo double | qibojit single | qibojit double | qiskit single | qiskit double |
---|---|---|---|---|---|---|
3 | 0.04161 | 0.04240 | 0.04167 | 0.04218 | 0.01237 | 0.01709 |
4 | 0.06720 | 0.07693 | 0.07538 | 0.08310 | 0.01839 | 0.02563 |
5 | 0.12196 | 0.13427 | 0.12148 | 0.13557 | 0.03189 | 0.04959 |
6 | 0.20233 | 0.26014 | 0.26395 | 0.23517 | 0.06211 | 0.09527 |
7 | 0.32837 | 0.37219 | 0.41109 | 0.24786 | 0.47313 | 0.35765 |
8 | 0.52610 | 0.58221 | 0.80508 | 0.51965 | 0.84729 | 1.28193 |
9 | 0.85750 | 0.87660 | 0.97282 | 0.78351 | 3.37298 | 3.36922 |
10 | 1.61882 | 1.59471 | 1.58853 | 1.66941 | 12.36304 | 24.87413 |
nqubits=23 - simulation times
ntargets | qibo single | qibo double | qibojit single | qibojit double | qiskit single | qiskit double |
---|---|---|---|---|---|---|
3 | 0.04492 | 0.09326 | 0.08448 | 0.06830 | 0.02291 | 0.03834 |
4 | 0.15546 | 0.17647 | 0.15955 | 0.16432 | 0.03577 | 0.05503 |
5 | 0.27392 | 0.25915 | 0.19172 | 0.24949 | 0.06591 | 0.10277 |
6 | 0.54600 | 0.45729 | 0.64013 | 0.48950 | 0.12116 | 0.18796 |
7 | 0.71717 | 0.65344 | 0.61897 | 0.78138 | 0.57905 | 0.74864 |
8 | 1.07134 | 0.98115 | 1.09886 | 1.19669 | 1.98408 | 2.51820 |
9 | 2.20403 | 1.54533 | 1.85110 | 1.62834 | 5.86612 | 5.98414 |
10 | 3.48476 | 3.47243 | 3.40456 | 3.38084 | 18.15899 | 42.43832 |
nqubits=24 - simulation times
ntargets | qibo single | qibo double | qibojit single | qibojit double | qiskit single | qiskit double |
---|---|---|---|---|---|---|
3 | 0.11587 | 0.15120 | 0.18880 | 0.21074 | 0.05287 | 0.08805 |
4 | 0.24533 | 0.35044 | 0.30943 | 0.30915 | 0.07027 | 0.12071 |
5 | 0.53195 | 0.58445 | 0.43483 | 0.56411 | 0.13495 | 0.20928 |
6 | 1.19677 | 1.09115 | 1.10584 | 1.02924 | 0.24610 | 0.34992 |
7 | 1.37310 | 1.16172 | 1.42246 | 1.22445 | 1.41638 | 1.23091 |
8 | 2.55380 | 2.01835 | 2.28003 | 1.91831 | 3.50500 | 5.66647 |
9 | 4.18953 | 3.32717 | 3.98467 | 3.47910 | 10.78300 | 11.75414 |
10 | 7.22219 | 7.11032 | 7.72539 | 6.52692 | 28.78554 | 73.70552 |
nqubits=25 - simulation times
ntargets | qibo single | qibo double | qibojit single | qibojit double | qiskit single | qiskit double |
---|---|---|---|---|---|---|
3 | 0.29793 | 0.31643 | 0.20984 | 0.36482 | 0.10907 | 0.24589 |
4 | 0.60118 | 0.59639 | 0.51942 | 0.58661 | 0.15394 | 0.26426 |
5 | 1.04315 | 1.27955 | 1.14314 | 1.26141 | 0.25628 | 0.38248 |
6 | 2.11482 | 2.18827 | 1.98463 | 2.08555 | 0.46081 | 0.71935 |
7 | 2.77332 | 2.47914 | 2.94143 | 2.32125 | 2.50428 | 2.56939 |
8 | 4.86569 | 4.05328 | 4.74529 | 4.07807 | 7.38578 | 11.53120 |
9 | 8.43331 | 6.88069 | 8.43319 | 6.86786 | 23.06335 | 22.85148 |
10 | 15.40516 | 14.79330 | 15.42111 | 13.21936 | 51.77071 | 133.96834 |
For qiskit single precision is faster than double, as expected, while for qibo there is no difference and in some cases single is even slower.
from qibojit.
Here are some numbers for single vs double precision for various custom operators on CPU:
apply_gate
nqubits | single (64 threads) | double (64 threads) | single (1 thread) | double (1 thread) |
---|---|---|---|---|
10 | 0.00003 | 0.00003 | 0.00001 | 0.00001 |
11 | 0.00003 | 0.00003 | 0.00001 | 0.00001 |
12 | 0.00003 | 0.00003 | 0.00001 | 0.00001 |
13 | 0.00003 | 0.00003 | 0.00002 | 0.00002 |
14 | 0.00003 | 0.00003 | 0.00003 | 0.00004 |
15 | 0.00003 | 0.00004 | 0.00007 | 0.00007 |
16 | 0.00004 | 0.00004 | 0.00013 | 0.00013 |
17 | 0.00005 | 0.00006 | 0.00025 | 0.00025 |
18 | 0.00007 | 0.00010 | 0.00050 | 0.00050 |
19 | 0.00012 | 0.00015 | 0.00100 | 0.00099 |
20 | 0.00017 | 0.00026 | 0.00197 | 0.00200 |
21 | 0.00033 | 0.00061 | 0.00396 | 0.00410 |
22 | 0.00059 | 0.00132 | 0.00791 | 0.00815 |
23 | 0.00121 | 0.00366 | 0.01579 | 0.01631 |
24 | 0.00334 | 0.00609 | 0.03154 | 0.03221 |
25 | 0.00635 | 0.00891 | 0.06291 | 0.08589 |
26 | 0.00987 | 0.01903 | 0.14184 | 0.17122 |
27 | 0.02124 | 0.03823 | 0.28522 | 0.34151 |
28 | 0.04245 | 0.07628 | 0.56901 | 0.68857 |
apply_x
nqubits | single (64 threads) | double (64 threads) | single (1 thread) | double (1 thread) |
---|---|---|---|---|
10 | 0.00005 | 0.00003 | 0.00000 | 0.00000 |
11 | 0.00003 | 0.00003 | 0.00001 | 0.00001 |
12 | 0.00005 | 0.00003 | 0.00001 | 0.00001 |
13 | 0.00003 | 0.00003 | 0.00001 | 0.00001 |
14 | 0.00003 | 0.00003 | 0.00002 | 0.00002 |
15 | 0.00003 | 0.00003 | 0.00003 | 0.00003 |
16 | 0.00003 | 0.00004 | 0.00006 | 0.00005 |
17 | 0.00004 | 0.00005 | 0.00011 | 0.00010 |
18 | 0.00006 | 0.00008 | 0.00022 | 0.00020 |
19 | 0.00008 | 0.00013 | 0.00043 | 0.00063 |
20 | 0.00014 | 0.00025 | 0.00085 | 0.00086 |
21 | 0.00024 | 0.00055 | 0.00173 | 0.00201 |
22 | 0.00061 | 0.00137 | 0.00360 | 0.00400 |
23 | 0.00123 | 0.00360 | 0.00709 | 0.00771 |
24 | 0.00339 | 0.00645 | 0.01476 | 0.01651 |
25 | 0.00613 | 0.00811 | 0.02816 | 0.05608 |
26 | 0.00888 | 0.01910 | 0.09013 | 0.10297 |
27 | 0.01873 | 0.03655 | 0.17821 | 0.20354 |
28 | 0.03727 | 0.07210 | 0.35670 | 0.41257 |
apply_two_qubit_gate
nqubits | single (64 threads) | double (64 threads) | single (1 thread) | double (1 thread) |
---|---|---|---|---|
10 | 0.00003 | 0.00003 | 0.00001 | 0.00001 |
11 | 0.00003 | 0.00003 | 0.00001 | 0.00001 |
12 | 0.00003 | 0.00003 | 0.00002 | 0.00002 |
13 | 0.00003 | 0.00003 | 0.00004 | 0.00004 |
14 | 0.00003 | 0.00004 | 0.00007 | 0.00008 |
15 | 0.00004 | 0.00004 | 0.00014 | 0.00015 |
16 | 0.00005 | 0.00005 | 0.00028 | 0.00029 |
17 | 0.00006 | 0.00008 | 0.00055 | 0.00055 |
18 | 0.00010 | 0.00014 | 0.00111 | 0.00113 |
19 | 0.00020 | 0.00022 | 0.00220 | 0.00226 |
20 | 0.00036 | 0.00036 | 0.00438 | 0.00454 |
21 | 0.00067 | 0.00077 | 0.00878 | 0.00909 |
22 | 0.00110 | 0.00159 | 0.01757 | 0.01817 |
23 | 0.00201 | 0.00344 | 0.03469 | 0.03619 |
24 | 0.00465 | 0.00766 | 0.07004 | 0.07252 |
25 | 0.00866 | 0.00967 | 0.13983 | 0.17484 |
26 | 0.01276 | 0.02150 | 0.33970 | 0.48276 |
27 | 0.02858 | 0.04276 | 0.89865 | 0.97283 |
28 | 0.05609 | 0.09051 | 1.79465 | 1.93739 |
apply_multi_qubit_gate - ntargets=3
nqubits | single (64 threads) | double (64 threads) | single (1 thread) | double (1 thread) |
---|---|---|---|---|
10 | 0.00003 | 0.00003 | 0.00002 | 0.00002 |
11 | 0.00004 | 0.00003 | 0.00003 | 0.00003 |
12 | 0.00003 | 0.00004 | 0.00006 | 0.00005 |
13 | 0.00004 | 0.00004 | 0.00010 | 0.00010 |
14 | 0.00007 | 0.00005 | 0.00019 | 0.00019 |
15 | 0.00007 | 0.00007 | 0.00037 | 0.00038 |
16 | 0.00011 | 0.00010 | 0.00074 | 0.00075 |
17 | 0.00010 | 0.00016 | 0.00148 | 0.00150 |
18 | 0.00030 | 0.00026 | 0.00294 | 0.00304 |
19 | 0.00051 | 0.00051 | 0.00590 | 0.00604 |
20 | 0.00092 | 0.00082 | 0.01178 | 0.01216 |
21 | 0.00148 | 0.00174 | 0.02375 | 0.02441 |
22 | 0.00274 | 0.00306 | 0.04756 | 0.04895 |
23 | 0.00556 | 0.00645 | 0.09580 | 0.09791 |
24 | 0.01099 | 0.01273 | 0.19194 | 0.19609 |
25 | 0.02151 | 0.02069 | 0.38340 | 0.40367 |
26 | 0.04200 | 0.03610 | 0.78617 | 0.91813 |
27 | 0.07741 | 0.07561 | 1.76592 | 1.98405 |
28 | 0.13882 | 0.15438 | 3.84446 | 3.95124 |
apply_multi_qubit_gate - ntargets=4
nqubits | single (64 threads) | double (64 threads) | single (1 thread) | double (1 thread) |
---|---|---|---|---|
10 | 0.00004 | 0.00004 | 0.00003 | 0.00003 |
11 | 0.00004 | 0.00003 | 0.00005 | 0.00005 |
12 | 0.00004 | 0.00004 | 0.00009 | 0.00009 |
13 | 0.00004 | 0.00005 | 0.00018 | 0.00018 |
14 | 0.00005 | 0.00006 | 0.00035 | 0.00034 |
15 | 0.00010 | 0.00010 | 0.00071 | 0.00067 |
16 | 0.00014 | 0.00014 | 0.00135 | 0.00133 |
17 | 0.00018 | 0.00025 | 0.00269 | 0.00270 |
18 | 0.00042 | 0.00049 | 0.00540 | 0.00608 |
19 | 0.00085 | 0.00082 | 0.01268 | 0.01607 |
20 | 0.00145 | 0.00166 | 0.03330 | 0.03659 |
21 | 0.00225 | 0.00424 | 0.07047 | 0.07545 |
22 | 0.00456 | 0.00619 | 0.14903 | 0.15434 |
23 | 0.01973 | 0.01497 | 0.30761 | 0.31311 |
24 | 0.01979 | 0.03174 | 0.61445 | 0.62966 |
25 | 0.05212 | 0.06423 | 1.24524 | 1.29693 |
26 | 0.10581 | 0.11538 | 2.55039 | 2.41001 |
27 | 0.17856 | 0.16543 | 4.73377 | 4.29563 |
28 | 0.29317 | 0.23916 | 8.79219 | 8.72099 |
apply_multi_qubit_gate - ntargets=5
nqubits | single (64 threads) | double (64 threads) | single (1 thread) | double (1 thread) |
---|---|---|---|---|
10 | 0.00004 | 0.00004 | 0.00004 | 0.00004 |
11 | 0.00005 | 0.00004 | 0.00008 | 0.00008 |
12 | 0.00005 | 0.00005 | 0.00015 | 0.00015 |
13 | 0.00006 | 0.00006 | 0.00031 | 0.00030 |
14 | 0.00008 | 0.00010 | 0.00060 | 0.00060 |
15 | 0.00013 | 0.00014 | 0.00125 | 0.00118 |
16 | 0.00023 | 0.00025 | 0.00232 | 0.00237 |
17 | 0.00040 | 0.00040 | 0.00464 | 0.00471 |
18 | 0.00043 | 0.00078 | 0.00923 | 0.01119 |
19 | 0.00125 | 0.00112 | 0.04019 | 0.02804 |
20 | 0.00203 | 0.00219 | 0.05659 | 0.06124 |
21 | 0.00375 | 0.00773 | 0.12062 | 0.22236 |
22 | 0.01474 | 0.01367 | 0.41249 | 0.45251 |
23 | 0.02889 | 0.02938 | 0.85467 | 0.93634 |
24 | 0.05221 | 0.05733 | 1.72745 | 1.88799 |
25 | 0.11024 | 0.10648 | 3.42486 | 3.81983 |
26 | 0.18496 | 0.20348 | 7.00453 | 7.75527 |
27 | 0.37176 | 0.35060 | 14.17189 | 14.21730 |
28 | 0.67633 | 0.83049 | 26.16925 | 36.41165 |
apply_multi_qubit_gate - ntargets=6
nqubits | single (64 threads) | double (64 threads) | single (1 thread) | double (1 thread) |
---|---|---|---|---|
10 | 0.00006 | 0.00004 | 0.00011 | 0.00008 |
11 | 0.00006 | 0.00005 | 0.00021 | 0.00016 |
12 | 0.00007 | 0.00006 | 0.00041 | 0.00031 |
13 | 0.00008 | 0.00013 | 0.00080 | 0.00062 |
14 | 0.00018 | 0.00016 | 0.00160 | 0.00124 |
15 | 0.00024 | 0.00023 | 0.00320 | 0.00250 |
16 | 0.00052 | 0.00045 | 0.00641 | 0.00506 |
17 | 0.00100 | 0.00052 | 0.01308 | 0.00998 |
18 | 0.00120 | 0.00137 | 0.02585 | 0.02033 |
19 | 0.00271 | 0.00197 | 0.05230 | 0.04444 |
20 | 0.00402 | 0.00431 | 0.10630 | 0.08368 |
21 | 0.00917 | 0.01023 | 0.21530 | 0.30033 |
22 | 0.02224 | 0.02078 | 0.63213 | 0.62978 |
23 | 0.04513 | 0.04106 | 1.31287 | 1.31433 |
24 | 0.08741 | 0.08156 | 2.65060 | 2.64123 |
25 | 0.16587 | 0.16127 | 5.30112 | 5.19375 |
26 | 0.31596 | 0.31030 | 10.74703 | 10.39643 |
27 | 0.63359 | 0.62503 | 21.16741 | 20.80903 |
28 | 1.24665 | 1.26802 | 42.51790 | 41.84317 |
apply_multi_qubit_gate - ntargets=7
nqubits | single (64 threads) | double (64 threads) | single (1 thread) | double (1 thread) |
---|---|---|---|---|
10 | 0.00009 | 0.00008 | 0.00020 | 0.00015 |
11 | 0.00008 | 0.00009 | 0.00040 | 0.00030 |
12 | 0.00014 | 0.00009 | 0.00079 | 0.00060 |
13 | 0.00018 | 0.00010 | 0.00156 | 0.00121 |
14 | 0.00028 | 0.00024 | 0.00344 | 0.00233 |
15 | 0.00048 | 0.00025 | 0.00630 | 0.00467 |
16 | 0.00092 | 0.00071 | 0.01247 | 0.00938 |
17 | 0.00144 | 0.00102 | 0.02492 | 0.01889 |
18 | 0.00219 | 0.00190 | 0.05043 | 0.03780 |
19 | 0.00409 | 0.00288 | 0.10056 | 0.07645 |
20 | 0.00864 | 0.00657 | 0.20283 | 0.15400 |
21 | 0.01519 | 0.01503 | 0.40817 | 0.46041 |
22 | 0.03118 | 0.02859 | 1.02093 | 0.93423 |
23 | 0.06537 | 0.05591 | 2.10929 | 1.95459 |
24 | 0.12585 | 0.11040 | 4.22156 | 3.92964 |
25 | 0.23822 | 0.21403 | 8.40546 | 7.69408 |
26 | 0.45770 | 0.41094 | 16.92144 | 15.54976 |
27 | 0.94539 | 0.83017 | 33.80684 | 31.00753 |
28 | 1.82106 | 1.65167 | 67.47716 | 62.36736 |
apply_multi_qubit_gate - ntargets=8
nqubits | single (64 threads) | double (64 threads) | single (1 thread) | double (1 thread) |
---|---|---|---|---|
10 | 0.00013 | 0.00020 | 0.00038 | 0.00028 |
11 | 0.00014 | 0.00021 | 0.00075 | 0.00056 |
12 | 0.00025 | 0.00020 | 0.00148 | 0.00111 |
13 | 0.00025 | 0.00020 | 0.00296 | 0.00221 |
14 | 0.00048 | 0.00038 | 0.00591 | 0.00439 |
15 | 0.00090 | 0.00070 | 0.01181 | 0.00879 |
16 | 0.00134 | 0.00074 | 0.02359 | 0.01753 |
17 | 0.00235 | 0.00198 | 0.04727 | 0.03541 |
18 | 0.00423 | 0.00317 | 0.09454 | 0.07123 |
19 | 0.00735 | 0.00588 | 0.19099 | 0.14196 |
20 | 0.01550 | 0.01102 | 0.37985 | 0.28841 |
21 | 0.02907 | 0.02451 | 0.76924 | 0.77398 |
22 | 0.05943 | 0.04698 | 1.82490 | 1.57743 |
23 | 0.11351 | 0.09371 | 3.69565 | 3.22068 |
24 | 0.21270 | 0.17893 | 7.31751 | 6.34632 |
25 | 0.40647 | 0.34237 | 14.74788 | 12.70273 |
26 | 0.79980 | 0.67554 | 29.88017 | 25.75951 |
27 | 1.57292 | 1.33787 | 59.09961 | 51.23731 |
28 | 3.10821 | 2.62125 | 118.59848 | 102.75361 |
Apparently the issue appears only for the multi-qubit op and exists for single thread too. For other ops the difference between single and double is as expected.
from qibojit.
Good to know, at least the problem is restricted to a specific op. Two questions:
- Which function call did you benchmark? e.g. exactly
apply_multi_qubit_gate
or the wholemulti_qubit_base
? - Did you try to set
parallel=False
instead of setting the number of threads?
from qibojit.
- Which function call did you benchmark? e.g. exactly
apply_multi_qubit_gate
or the wholemulti_qubit_base
?
I called apply_multi_qubit_gate
directly and the same for the other ops. Here is the full script I used:
Benchmark script
import os
import argparse
import json
import time
import numpy as np
parser = argparse.ArgumentParser()
parser.add_argument("--nqubits", default=20, type=int)
parser.add_argument("--nreps", default=1, type=int)
parser.add_argument("--method", default="apply_gate", type=str)
parser.add_argument("--ntargets", default=1, type=int)
parser.add_argument("--precision", default="double", type=str)
parser.add_argument("--nthreads", default=None, type=int)
parser.add_argument("--filename", default=None, type=str)
def main(nqubits, nreps, method, ntargets, precision, nthreads, filename):
if filename is not None:
if os.path.isfile(filename):
with open(filename, "r") as file:
logs = json.load(file)
print("Extending existing logs from {}.".format(filename))
else:
print("Creating new logs in {}.".format(filename))
logs = []
else:
logs = []
logs.append({
"nqubits": nqubits, "nreps": nreps,
"method": method, "ntargets": ntargets,
"CUDA_VISIBLE_DEVICES": os.environ.get("CUDA_VISIBLE_DEVICES")
})
# Create backend object
start_time = time.time()
from qibo import K
logs[-1]["import_time"] = time.time() - start_time
# Set backend precision
K.set_precision(precision)
logs[-1]["precision"] = K.precision
logs[-1]["backend"] = K.name
# Set number of threads
if nthreads is not None:
K.set_threads(nthreads)
logs[-1]["nthreads"] = K.nthreads
# Calculate ``qubits`` tensor
targets = list(range(ntargets))
qubits = [nqubits - q - 1 for q in targets]
#qubits.extend(nqubits - q - 1 for q in controls)
qubits = K.cast(sorted(qubits), dtype="int32")
# Generate ``gate`` matrix if needed
if method in ("apply_gate", "apply_two_qubit_gate", "apply_multi_qubit_gate"):
shape = 2 * (2 ** len(targets),)
gate = K.cast(np.random.random(shape) + 1j * np.random.random(shape))
executor = lambda x: getattr(K, method)(x, gate, nqubits, targets, qubits)
else:
executor = lambda x: getattr(K, method)(x, nqubits, targets, qubits)
start_time = time.time()
state = K.initial_state(nqubits)
logs[-1]["initial_state_time"] = time.time() - start_time
start_time = time.time()
state = executor(state)
logs[-1]["dry_run_time"] = time.time() - start_time
logs[-1]["dtype"] = str(state.dtype)
del(state)
logs[-1]["simulation_times"] = []
for _ in range(nreps):
state = K.initial_state(nqubits)
start_time = time.time()
state = executor(state)
logs[-1]["simulation_times"].append(time.time() - start_time)
del(state)
logs[-1]["simulation_time_mean"] = np.mean(logs[-1]["simulation_times"])
logs[-1]["simulation_time_std"] = np.std(logs[-1]["simulation_times"])
for k, v in logs[-1].items():
print("{}: {}".format(k, v))
print()
if filename is not None:
with open(filename, "w") as file:
json.dump(logs, file)
if __name__ == "__main__":
args = vars(parser.parse_args())
main(**args)
- Did you try to set
parallel=False
instead of setting the number of threads?
No, I haven't tried that. As shown in the above script I use
from qibo import K
K.set_threads(nthreads)
in order to set the number of threads. I also haven't tried running this on GPU but in principle the same script would work.
EDIT: Here are some results with parallel=False
and single thread
apply_multi_qubit_gate - ntargets=3
nqubits | single | double |
---|---|---|
10 | 0.00002 | 0.00002 |
11 | 0.00003 | 0.00003 |
12 | 0.00005 | 0.00005 |
13 | 0.00010 | 0.00010 |
14 | 0.00019 | 0.00019 |
15 | 0.00037 | 0.00038 |
16 | 0.00074 | 0.00075 |
17 | 0.00148 | 0.00149 |
18 | 0.00296 | 0.00296 |
19 | 0.00591 | 0.01222 |
20 | 0.01182 | 0.01196 |
21 | 0.02399 | 0.02434 |
22 | 0.04796 | 0.04896 |
23 | 0.09642 | 0.09773 |
24 | 0.19295 | 0.19594 |
25 | 0.38674 | 0.40418 |
26 | 0.78994 | 0.89543 |
27 | 1.77743 | 1.97553 |
apply_multi_qubit_gate - ntargets=4
nqubits | single | double |
---|---|---|
10 | 0.00003 | 0.00003 |
11 | 0.00005 | 0.00005 |
12 | 0.00009 | 0.00009 |
13 | 0.00017 | 0.00018 |
14 | 0.00035 | 0.00035 |
15 | 0.00073 | 0.00073 |
16 | 0.00142 | 0.00140 |
17 | 0.00278 | 0.00279 |
18 | 0.00567 | 0.00566 |
19 | 0.01255 | 0.01666 |
20 | 0.03295 | 0.03674 |
21 | 0.07555 | 0.07934 |
22 | 0.14994 | 0.15966 |
23 | 0.30840 | 0.32393 |
24 | 0.62203 | 0.65096 |
25 | 1.25948 | 1.34278 |
26 | 2.57411 | 2.47533 |
27 | 4.86673 | 4.34140 |
apply_multi_qubit_gate - ntargets=5
nqubits | single | double |
---|---|---|
10 | 0.00004 | 0.00004 |
11 | 0.00008 | 0.00008 |
12 | 0.00015 | 0.00015 |
13 | 0.00029 | 0.00030 |
14 | 0.00059 | 0.00060 |
15 | 0.00118 | 0.00115 |
16 | 0.00228 | 0.00235 |
17 | 0.00458 | 0.00464 |
18 | 0.00912 | 0.01179 |
19 | 0.02251 | 0.02766 |
20 | 0.05465 | 0.06129 |
21 | 0.12234 | 0.22476 |
22 | 0.39308 | 0.43870 |
23 | 0.80829 | 0.89043 |
24 | 1.64349 | 1.81944 |
25 | 3.27383 | 3.67461 |
26 | 6.82895 | 7.51540 |
apply_multi_qubit_gate - ntargets=6
nqubits | single | double |
---|---|---|
10 | 0.00006 | 0.00007 |
11 | 0.00011 | 0.00013 |
12 | 0.00022 | 0.00025 |
13 | 0.00044 | 0.00050 |
14 | 0.00088 | 0.00099 |
15 | 0.00177 | 0.00200 |
16 | 0.00355 | 0.00427 |
17 | 0.00723 | 0.00826 |
18 | 0.01462 | 0.01709 |
19 | 0.02899 | 0.03485 |
20 | 0.06035 | 0.07053 |
21 | 0.11944 | 0.37921 |
22 | 0.69294 | 0.77325 |
23 | 1.45051 | 1.58418 |
24 | 2.98322 | 3.17140 |
25 | 6.04239 | 6.37544 |
26 | 12.26439 | 13.06399 |
apply_multi_qubit_gate - ntargets=7
nqubits | single | double |
---|---|---|
10 | 0.00007 | 0.00009 |
11 | 0.00014 | 0.00016 |
12 | 0.00027 | 0.00033 |
13 | 0.00053 | 0.00064 |
14 | 0.00107 | 0.00126 |
15 | 0.00211 | 0.00252 |
16 | 0.00428 | 0.00507 |
17 | 0.00866 | 0.01026 |
18 | 0.01786 | 0.02033 |
19 | 0.03554 | 0.04163 |
20 | 0.07032 | 0.08453 |
21 | 0.14418 | 0.43070 |
22 | 0.78342 | 0.89268 |
23 | 1.62434 | 1.82199 |
24 | 3.36772 | 3.65349 |
25 | 6.83152 | 7.30669 |
26 | 14.06274 | 15.98979 |
apply_multi_qubit_gate - ntargets=8
nqubits | single | double |
---|---|---|
10 | 0.00009 | 0.00012 |
11 | 0.00018 | 0.00024 |
12 | 0.00034 | 0.00048 |
13 | 0.00067 | 0.00096 |
14 | 0.00135 | 0.00197 |
15 | 0.00275 | 0.00372 |
16 | 0.00552 | 0.00764 |
17 | 0.01094 | 0.01451 |
18 | 0.02138 | 0.03014 |
19 | 0.04558 | 0.06331 |
20 | 0.09554 | 0.13397 |
21 | 0.18050 | 0.54658 |
22 | 0.91581 | 1.08427 |
23 | 1.82818 | 2.20956 |
24 | 3.75763 | 4.48631 |
25 | 7.77429 | 9.19321 |
26 | 15.59674 | 18.21344 |
from qibojit.
I checked the parallelization logs, and I don't see anything wrong in them. (even if I don't understand all details)
================================================================================
Parallel Accelerator Optimizing: Function apply_multi_qubit_gate_kernel,
.../qibojit/src/qibojit/custom_operators/gates.py (381)
================================================================================
Parallel loop listing for Function apply_multi_qubit_gate_kernel, .../qibojit/src/qibojit/custom_operators/gates.py (381)
-----------------------------------------------------------------------------------------|loop #ID
@njit(["complex64[:](complex64[:], complex64[:,::1], int32[:], int64, int64[:])", |
"complex128[:](complex128[:], complex128[:,::1], int32[:], int64, int64[:])"], |
parallel=True, nogil=True, |
cache=True) |
def apply_multi_qubit_gate_kernel(state, gate, qubits, nstates, targets): |
nsubstates = 1 << len(targets) |
for g in prange(nstates): # pylint: disable=not-an-iterable-------------------------| #41
ig = multicontrol_index(g, qubits) |
buffer = np.empty(nsubstates, dtype=state.dtype) |
for i in range(nsubstates): |
t = ig - multitarget_index(nsubstates - i - 1, targets) |
buffer[i] = state[t] |
for i in range(nsubstates): |
t = ig - multitarget_index(nsubstates - i - 1, targets) |
state[t] = np.dot(gate[i], buffer)-------------------------------------------| #40
return state |
--------------------------------- Fusing loops ---------------------------------
Attempting fusion of parallel loops (combines loops with similar properties)...
----------------------------- Before Optimisation ------------------------------
Parallel region 0:
+--41 (parallel)
+--40 (parallel)
--------------------------------------------------------------------------------
------------------------------ After Optimisation ------------------------------
Parallel region 0:
+--41 (parallel)
+--40 (serial)
Parallel region 0 (loop #41) had 0 loop(s) fused and 1 loop(s) serialized as
part of the larger parallel loop (#41).
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
---------------------------Loop invariant code motion---------------------------
Allocation hoisting:
The memory allocation derived from the instruction at
.../qibojit/src/qibojit/custom_operators/gates.py (389) is
hoisted out of the parallel loop labelled #41 (it will be performed before the
loop is executed and reused inside the loop):
Allocation:: buffer = np.empty(nsubstates, dtype=state.dtype)
- numpy.empty() is used for the allocation.
Instruction hoisting:
loop #41:
Has the following hoisted:
$100load_global.1 = global(range: <class 'range'>)
$114load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fd5b0ae0040>))
$const122.8 = const(int, 1)
$const18.7.2435 = const(int, 0)
$np_typ_var.2470 = const(str, complex128)
$26load_global.2 = global(multicontrol_index: CPUDispatcher(<function multicontrol_index at 0x7fd5dc117040>))
$36load_global.6 = global(np: <module 'numpy' from '.../numpy/__init__.py'>)
$38load_attr.7 = getattr(value=$36load_global.6, attr=empty)
$52load_global.13 = global(range: <class 'range'>)
$66load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fd5b0ae0040>))
$const74.8 = const(int, 1)
Failed to hoist the following:
not pure: $104call_function.3 = call $push_global_to_block.2576(nsubstates, func=$push_global_to_block.2576, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)
dependency: $106get_iter.4 = getiter(value=$104call_function.3)
dependency: $108for_iter.2 = iternext(value=$106get_iter.4)
dependency: $i.1.2494 = pair_first(value=$108for_iter.2)
dependency: $108for_iter.4 = pair_second(value=$108for_iter.2)
dependency: $120binary_subtract.7 = nsubstates - $i.1.2494
dependency: $124binary_subtract.9 = $120binary_subtract.7 - $const122.8
dependency: $128call_function.11 = call $push_global_to_block.2577($124binary_subtract.9, targets, func=$push_global_to_block.2577, args=[Var($124binary_subtract.9, gates.py:394), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)
dependency: t.1 = ig - $128call_function.11
dependency: $a.2426.2495 = getitem(value=gate, index=$i.1.2494, fn=<built-in function getitem>)
dependency: $16load_attr.6.2434 = getattr(value=$a.2426.2495, attr=shape)
dependency: l.2437 = getitem(value=$16load_attr.6.2434, index=$const18.7.2435, fn=<built-in function getitem>)
dependency: $dtype_attr_var.2471 = getattr(value=$np_g_var.2469, attr=dtype)
dependency: $dtype_var.2472 = call $dtype_attr_var.2471($np_typ_var.2470, func=$dtype_attr_var.2471, args=[Var($np_typ_var.2470, gates.py:389)], kws=(), vararg=None, target=None)
dependency: g = $parfor__index_2489.2560
dependency: ig = call $push_global_to_block.2572($parfor__index_2489.2560, qubits, func=$push_global_to_block.2572, args=[Var($parfor__index_2489.2560, <string>:2), Var(qubits, gates.py:386)], kws=(), vararg=None, target=None)
dependency: buffer = call $push_getattr_to_block.2573(nsubstates, func=$push_getattr_to_block.2573, args=[Var(nsubstates, gates.py:386)], kws=[('dtype', Var($dtype_var.2472, gates.py:389))], vararg=None, target=None)
not pure: $56call_function.15 = call $push_global_to_block.2575(nsubstates, func=$push_global_to_block.2575, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None)
dependency: $58get_iter.16 = getiter(value=$56call_function.15)
dependency: $60for_iter.2 = iternext(value=$58get_iter.16)
dependency: $i.2497 = pair_first(value=$60for_iter.2)
dependency: $60for_iter.4 = pair_second(value=$60for_iter.2)
dependency: $72binary_subtract.7 = nsubstates - $i.2497
dependency: $76binary_subtract.9 = $72binary_subtract.7 - $const74.8
dependency: $80call_function.11 = call $push_global_to_block.2578($76binary_subtract.9, targets, func=$push_global_to_block.2578, args=[Var($76binary_subtract.9, gates.py:391), Var(targets, gates.py:386)], kws=(), vararg=None, target=None)
dependency: t = ig - $80call_function.11
dependency: $90binary_subscr.15 = getitem(value=state, index=t, fn=<built-in function getitem>)
Besides parallelization, I tried to inspect the intermediate representation of such kernel and I believe types are inferred properly:
>>> gates.apply_multi_qubit_gate_kernel.inspect_types()
apply_multi_qubit_gate_kernel (array(complex64, 1d, A), array(complex64, 2d, C), array(int32, 1d, A), int64, array(int64, 1d, A))
--------------------------------------------------------------------------------
# File: .../qibojit/src/qibojit/custom_operators/gates.py
# --- LINE 381 ---
@njit(["complex64[:](complex64[:], complex64[:,::1], int32[:], int64, int64[:])",
# --- LINE 382 ---
"complex128[:](complex128[:], complex128[:,::1], int32[:], int64, int64[:])"],
# --- LINE 383 ---
parallel=True,
# --- LINE 384 ---
cache=True)
# --- LINE 385 ---
def apply_multi_qubit_gate_kernel(state, gate, qubits, nstates, targets):
# --- LINE 386 ---
# label 0
# state = arg(0, name=state) :: array(complex64, 1d, A)
# gate = arg(1, name=gate) :: array(complex64, 2d, C)
# qubits = arg(2, name=qubits) :: array(int32, 1d, A)
# nstates = arg(3, name=nstates) :: int64
# targets = arg(4, name=targets) :: array(int64, 1d, A)
# $const2.0 = const(int, 1) :: Literal[int](1)
# $4load_global.1 = global(len: <built-in function len>) :: Function(<built-in function len>)
# $8call_function.3 = targets_size0.647 :: int64
# del $4load_global.1
# nsubstates = $const2.0 << targets_size0.647 :: int64
# del $const2.0
# del $8call_function.3
nsubstates = 1 << len(targets)
# --- LINE 387 ---
# $14load_global.5 = global(prange: <class 'numba.misc.special.prange'>) :: Function(<class 'numba.misc.special.prange'>)
# $18call_function.7 = call $14load_global.5(nstates, func=$14load_global.5, args=[Var(nstates, gates.py:386)], kws=(), vararg=None, target=None) :: (int64,) -> range_state_int64
# del nstates
# del $14load_global.5
# $20get_iter.8 = getiter(value=$18call_function.7) :: range_iter_int64
# del $18call_function.7
# $phi22.0 = $20get_iter.8 :: range_iter_int64
# del $20get_iter.8
# jump 22
# label 22
# $22for_iter.1 = iternext(value=$20get_iter.8) :: pair<int64, bool>
# $22for_iter.2 = pair_first(value=$22for_iter.1) :: int64
# $22for_iter.3 = pair_second(value=$22for_iter.1) :: bool
# del $22for_iter.1
# parfor_index.655 = parfor_index.655 :: uint64
# del $22for_iter.2
# branch $22for_iter.3, 24, 1597
# label 24
# del $22for_iter.3
# g = $parfor__index_655.726 :: int64
# del $phi24.1
for g in prange(nstates): # pylint: disable=not-an-iterable
# --- LINE 388 ---
# $26load_global.2 = global(multicontrol_index: CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>)) :: type(CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))
# ig = call $push_global_to_block.738($parfor__index_655.726, qubits, func=$push_global_to_block.738, args=[Var($parfor__index_655.726, <string>:2), Var(qubits, gates.py:386)], kws=(), vararg=None, target=None) :: (int64, array(int32, 1d, A)) -> int64
# del g
# del $26load_global.2
ig = multicontrol_index(g, qubits)
# --- LINE 389 ---
# $36load_global.6 = global(np: <module 'numpy' from '.../numpy/__init__.py'>) :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
# $38load_attr.7 = getattr(value=$36load_global.6, attr=empty) :: Function(<built-in function empty>)
# del $36load_global.6
# $44load_attr.10 = $dtype_var.638 :: dtype(complex64)
# buffer = call $push_getattr_to_block.739(nsubstates, func=$push_getattr_to_block.739, args=[Var(nsubstates, gates.py:386)], kws=[('dtype', Var($dtype_var.638, gates.py:389))], vararg=None, target=None) :: (int64, dtype(complex64)) -> array(complex64, 1d, C)
# del $44load_attr.10
# del $38load_attr.7
buffer = np.empty(nsubstates, dtype=state.dtype)
# --- LINE 390 ---
# $52load_global.13 = global(range: <class 'range'>) :: Function(<class 'range'>)
# $56call_function.15 = call $push_global_to_block.741(nsubstates, func=$push_global_to_block.741, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None) :: (int64,) -> range_state_int64
# del $52load_global.13
# $58get_iter.16 = getiter(value=$56call_function.15) :: range_iter_int64
# del $56call_function.15
# $phi60.1 = $58get_iter.16 :: range_iter_int64
# del $58get_iter.16
# jump 60
# label 60
# $60for_iter.2 = iternext(value=$58get_iter.16) :: pair<int64, bool>
# $i.663 = pair_first(value=$60for_iter.2) :: int64
# $60for_iter.4 = pair_second(value=$60for_iter.2) :: bool
# del $60for_iter.2
# $phi62.2 = $i.663 :: int64
# del $60for_iter.3
# branch $60for_iter.4, 62, 100
# label 62
# del $60for_iter.4
# i = $i.663 :: int64
# del $phi62.2
for i in range(nsubstates):
# --- LINE 391 ---
# $66load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>)) :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
# $72binary_subtract.7 = nsubstates - $i.663 :: int64
# $const74.8 = const(int, 1) :: Literal[int](1)
# $76binary_subtract.9 = $72binary_subtract.7 - $const74.8 :: int64
# del $const74.8
# del $72binary_subtract.7
# $80call_function.11 = call $push_global_to_block.744($76binary_subtract.9, targets, func=$push_global_to_block.744, args=[Var($76binary_subtract.9, gates.py:391), Var(targets, gates.py:386)], kws=(), vararg=None, target=None) :: (int64, array(int64, 1d, A)) -> int64
# del $76binary_subtract.9
# del $66load_global.4
# t = ig - $80call_function.11 :: int64
# del $80call_function.11
t = ig - multitarget_index(nsubstates - i - 1, targets)
# --- LINE 392 ---
# $90binary_subscr.15 = getitem(value=state, index=t, fn=<built-in function getitem>) :: complex64
# del t
# buffer[$i.663] = $90binary_subscr.15 :: (array(complex64, 1d, C), int64, complex64) -> none
# del i
# del $90binary_subscr.15
# jump 60
buffer[i] = state[t]
# --- LINE 393 ---
# label 100
# del $phi62.2
# del $phi60.1
# del $60for_iter.4
# $100load_global.1 = global(range: <class 'range'>) :: Function(<class 'range'>)
# $104call_function.3 = call $push_global_to_block.742(nsubstates, func=$push_global_to_block.742, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None) :: (int64,) -> range_state_int64
# del $100load_global.1
# $106get_iter.4 = getiter(value=$104call_function.3) :: range_iter_int64
# del $104call_function.3
# $phi108.1 = $106get_iter.4 :: range_iter_int64
# del $106get_iter.4
# jump 108
# label 108
# $108for_iter.2 = iternext(value=$106get_iter.4) :: pair<int64, bool>
# $i.1.660 = pair_first(value=$108for_iter.2) :: int64
# $108for_iter.4 = pair_second(value=$108for_iter.2) :: bool
# del $108for_iter.2
# $phi110.2 = $i.1.660 :: int64
# del $108for_iter.3
# branch $108for_iter.4, 110, 1573
# label 110
# del $108for_iter.4
# i.1 = $i.1.660 :: int64
# del $phi110.2
for i in range(nsubstates):
# --- LINE 394 ---
# $114load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>)) :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
# $120binary_subtract.7 = nsubstates - $i.1.660 :: int64
# $const122.8 = const(int, 1) :: Literal[int](1)
# $124binary_subtract.9 = $120binary_subtract.7 - $const122.8 :: int64
# del $const122.8
# del $120binary_subtract.7
# $128call_function.11 = call $push_global_to_block.743($124binary_subtract.9, targets, func=$push_global_to_block.743, args=[Var($124binary_subtract.9, gates.py:394), Var(targets, gates.py:386)], kws=(), vararg=None, target=None) :: (int64, array(int64, 1d, A)) -> int64
# del $124binary_subtract.9
# del $114load_global.4
# t.1 = ig - $128call_function.11 :: int64
# del $128call_function.11
t = ig - multitarget_index(nsubstates - i - 1, targets)
# --- LINE 395 ---
# $134load_global.13 = global(np: <module 'numpy' from '.../numpy/__init__.py'>) :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
# $136load_method.14 = getattr(value=$134load_global.13, attr=dot) :: Function(<function dot at 0x7fe76a92c550>)
# del $134load_global.13
# $a.592.661 = getitem(value=gate, index=$i.1.660, fn=<built-in function getitem>) :: array(complex64, 1d, C)
# del i.1
# $146call_method.19 = call $136load_method.14($142binary_subscr.17, buffer, func=$136load_method.14, args=[Var($142binary_subscr.17, gates.py:395), Var(buffer, gates.py:389)], kws=(), vararg=None, target=None) :: (array(complex64, 1d, C), array(complex64, 1d, C)) -> complex64
# del $142binary_subscr.17
# del $136load_method.14
# state[t.1] = s.610 :: (array(complex64, 1d, A), int64, complex64) -> none
# del t.1
# del $146call_method.19
# jump 108
# label 156
# del ig
# del buffer
# del $phi110.2
# del $phi108.1
# del $108for_iter.4
# jump 1600
state[t] = np.dot(gate[i], buffer)
# --- LINE 396 ---
# label 158
# del targets
# del qubits
# del nsubstates
# del gate
# del $phi24.1
# del $phi22.0
# del $22for_iter.3
# $160return_value.1 = cast(value=state) :: array(complex64, 1d, A)
# del state
# return $160return_value.1
return state
================================================================================
apply_multi_qubit_gate_kernel (array(complex128, 1d, A), array(complex128, 2d, C), array(int32, 1d, A), int64, array(int64, 1d, A))
--------------------------------------------------------------------------------
# File: .../qibojit/src/qibojit/custom_operators/gates.py
# --- LINE 381 ---
@njit(["complex64[:](complex64[:], complex64[:,::1], int32[:], int64, int64[:])",
# --- LINE 382 ---
"complex128[:](complex128[:], complex128[:,::1], int32[:], int64, int64[:])"],
# --- LINE 383 ---
parallel=True,
# --- LINE 384 ---
cache=True)
# --- LINE 385 ---
def apply_multi_qubit_gate_kernel(state, gate, qubits, nstates, targets):
# --- LINE 386 ---
# label 0
# state = arg(0, name=state) :: array(complex128, 1d, A)
# gate = arg(1, name=gate) :: array(complex128, 2d, C)
# qubits = arg(2, name=qubits) :: array(int32, 1d, A)
# nstates = arg(3, name=nstates) :: int64
# targets = arg(4, name=targets) :: array(int64, 1d, A)
# $const2.0 = const(int, 1) :: Literal[int](1)
# $4load_global.1 = global(len: <built-in function len>) :: Function(<built-in function len>)
# $8call_function.3 = targets_size0.815 :: int64
# del $4load_global.1
# nsubstates = $const2.0 << targets_size0.815 :: int64
# del $const2.0
# del $8call_function.3
nsubstates = 1 << len(targets)
# --- LINE 387 ---
# $14load_global.5 = global(prange: <class 'numba.misc.special.prange'>) :: Function(<class 'numba.misc.special.prange'>)
# $18call_function.7 = call $14load_global.5(nstates, func=$14load_global.5, args=[Var(nstates, gates.py:386)], kws=(), vararg=None, target=None) :: (int64,) -> range_state_int64
# del nstates
# del $14load_global.5
# $20get_iter.8 = getiter(value=$18call_function.7) :: range_iter_int64
# del $18call_function.7
# $phi22.0 = $20get_iter.8 :: range_iter_int64
# del $20get_iter.8
# jump 22
# label 22
# $22for_iter.1 = iternext(value=$20get_iter.8) :: pair<int64, bool>
# $22for_iter.2 = pair_first(value=$22for_iter.1) :: int64
# $22for_iter.3 = pair_second(value=$22for_iter.1) :: bool
# del $22for_iter.1
# parfor_index.823 = parfor_index.823 :: uint64
# del $22for_iter.2
# branch $22for_iter.3, 24, 1699
# label 24
# del $22for_iter.3
# g = $parfor__index_823.894 :: int64
# del $phi24.1
for g in prange(nstates): # pylint: disable=not-an-iterable
# --- LINE 388 ---
# $26load_global.2 = global(multicontrol_index: CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>)) :: type(CPUDispatcher(<function multicontrol_index at 0x7fe75d6714c0>))
# ig = call $push_global_to_block.906($parfor__index_823.894, qubits, func=$push_global_to_block.906, args=[Var($parfor__index_823.894, <string>:2), Var(qubits, gates.py:386)], kws=(), vararg=None, target=None) :: (int64, array(int32, 1d, A)) -> int64
# del g
# del $26load_global.2
ig = multicontrol_index(g, qubits)
# --- LINE 389 ---
# $36load_global.6 = global(np: <module 'numpy' from '.../numpy/__init__.py'>) :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
# $38load_attr.7 = getattr(value=$36load_global.6, attr=empty) :: Function(<built-in function empty>)
# del $36load_global.6
# $44load_attr.10 = $dtype_var.806 :: dtype(complex128)
# buffer = call $push_getattr_to_block.907(nsubstates, func=$push_getattr_to_block.907, args=[Var(nsubstates, gates.py:386)], kws=[('dtype', Var($dtype_var.806, gates.py:389))], vararg=None, target=None) :: (int64, dtype(complex128)) -> array(complex128, 1d, C)
# del $44load_attr.10
# del $38load_attr.7
buffer = np.empty(nsubstates, dtype=state.dtype)
# --- LINE 390 ---
# $52load_global.13 = global(range: <class 'range'>) :: Function(<class 'range'>)
# $56call_function.15 = call $push_global_to_block.909(nsubstates, func=$push_global_to_block.909, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None) :: (int64,) -> range_state_int64
# del $52load_global.13
# $58get_iter.16 = getiter(value=$56call_function.15) :: range_iter_int64
# del $56call_function.15
# $phi60.1 = $58get_iter.16 :: range_iter_int64
# del $58get_iter.16
# jump 60
# label 60
# $60for_iter.2 = iternext(value=$58get_iter.16) :: pair<int64, bool>
# $i.831 = pair_first(value=$60for_iter.2) :: int64
# $60for_iter.4 = pair_second(value=$60for_iter.2) :: bool
# del $60for_iter.2
# $phi62.2 = $i.831 :: int64
# del $60for_iter.3
# branch $60for_iter.4, 62, 100
# label 62
# del $60for_iter.4
# i = $i.831 :: int64
# del $phi62.2
for i in range(nsubstates):
# --- LINE 391 ---
# $66load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>)) :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
# $72binary_subtract.7 = nsubstates - $i.831 :: int64
# $const74.8 = const(int, 1) :: Literal[int](1)
# $76binary_subtract.9 = $72binary_subtract.7 - $const74.8 :: int64
# del $const74.8
# del $72binary_subtract.7
# $80call_function.11 = call $push_global_to_block.912($76binary_subtract.9, targets, func=$push_global_to_block.912, args=[Var($76binary_subtract.9, gates.py:391), Var(targets, gates.py:386)], kws=(), vararg=None, target=None) :: (int64, array(int64, 1d, A)) -> int64
# del $76binary_subtract.9
# del $66load_global.4
# t = ig - $80call_function.11 :: int64
# del $80call_function.11
t = ig - multitarget_index(nsubstates - i - 1, targets)
# --- LINE 392 ---
# $90binary_subscr.15 = getitem(value=state, index=t, fn=<built-in function getitem>) :: complex128
# del t
# buffer[$i.831] = $90binary_subscr.15 :: (array(complex128, 1d, C), int64, complex128) -> none
# del i
# del $90binary_subscr.15
# jump 60
buffer[i] = state[t]
# --- LINE 393 ---
# label 100
# del $phi62.2
# del $phi60.1
# del $60for_iter.4
# $100load_global.1 = global(range: <class 'range'>) :: Function(<class 'range'>)
# $104call_function.3 = call $push_global_to_block.910(nsubstates, func=$push_global_to_block.910, args=[Var(nsubstates, gates.py:386)], kws=(), vararg=None, target=None) :: (int64,) -> range_state_int64
# del $100load_global.1
# $106get_iter.4 = getiter(value=$104call_function.3) :: range_iter_int64
# del $104call_function.3
# $phi108.1 = $106get_iter.4 :: range_iter_int64
# del $106get_iter.4
# jump 108
# label 108
# $108for_iter.2 = iternext(value=$106get_iter.4) :: pair<int64, bool>
# $i.1.829 = pair_first(value=$108for_iter.2) :: int64
# $108for_iter.4 = pair_second(value=$108for_iter.2) :: bool
# del $108for_iter.2
# $phi110.2 = $i.1.829 :: int64
# del $108for_iter.3
# branch $108for_iter.4, 110, 1675
# label 110
# del $108for_iter.4
# i.1 = $i.1.829 :: int64
# del $phi110.2
for i in range(nsubstates):
# --- LINE 394 ---
# $114load_global.4 = global(multitarget_index: CPUDispatcher(<function multitarget_index at 0x7fe732519f70>)) :: type(CPUDispatcher(<function multitarget_index at 0x7fe732519f70>))
# $120binary_subtract.7 = nsubstates - $i.1.829 :: int64
# $const122.8 = const(int, 1) :: Literal[int](1)
# $124binary_subtract.9 = $120binary_subtract.7 - $const122.8 :: int64
# del $const122.8
# del $120binary_subtract.7
# $128call_function.11 = call $push_global_to_block.911($124binary_subtract.9, targets, func=$push_global_to_block.911, args=[Var($124binary_subtract.9, gates.py:394), Var(targets, gates.py:386)], kws=(), vararg=None, target=None) :: (int64, array(int64, 1d, A)) -> int64
# del $124binary_subtract.9
# del $114load_global.4
# t.1 = ig - $128call_function.11 :: int64
# del $128call_function.11
t = ig - multitarget_index(nsubstates - i - 1, targets)
# --- LINE 395 ---
# $134load_global.13 = global(np: <module 'numpy' from '.../numpy/__init__.py'>) :: Module(<module 'numpy' from '.../numpy/__init__.py'>)
# $136load_method.14 = getattr(value=$134load_global.13, attr=dot) :: Function(<function dot at 0x7fe76a92c550>)
# del $134load_global.13
# $a.760.828 = getitem(value=gate, index=$i.1.829, fn=<built-in function getitem>) :: array(complex128, 1d, C)
# del i.1
# $146call_method.19 = call $136load_method.14($142binary_subscr.17, buffer, func=$136load_method.14, args=[Var($142binary_subscr.17, gates.py:395), Var(buffer, gates.py:389)], kws=(), vararg=None, target=None) :: (array(complex128, 1d, C), array(complex128, 1d, C)) -> complex128
# del $142binary_subscr.17
# del $136load_method.14
# state[t.1] = s.778 :: (array(complex128, 1d, A), int64, complex128) -> none
# del t.1
# del $146call_method.19
# jump 108
# label 156
# del ig
# del buffer
# del $phi110.2
# del $phi108.1
# del $108for_iter.4
# jump 1702
state[t] = np.dot(gate[i], buffer)
# --- LINE 396 ---
# label 158
# del targets
# del qubits
# del nsubstates
# del gate
# del $phi24.1
# del $phi22.0
# del $22for_iter.3
# $160return_value.1 = cast(value=state) :: array(complex128, 1d, A)
# del state
# return $160return_value.1
return state
================================================================================
from qibojit.
Ok, then at this point we have to check the argument types before and after the kernel execution. Btw, what happens if we disable double precision and execute in single precision? Does the code complains or runs?
from qibojit.
Something I forgot to mention above is that for the latest numbers I used the fixcast
branch, however I do not think there would be any difference even if I used main.
Ok, then at this point we have to check the argument types before and after the kernel execution.
I always check the state vector type after execution and it has the correct type. The gate matrix should be passed in compatible precision because of the signature. If I pass complex128 matrix with complex64 state it will raise an error. So at least for inputs and outputs the types are correct.
Btw, what happens if we disable double precision and execute in single precision? Does the code complains or runs?
What exactly do you mean by disable? Remove the double precision signature?
from qibojit.
What exactly do you mean by disable? Remove the double precision signature?
yes, to make sure that we are passing single precision numbers.
If this test passes, then the debug should start by pruning the kernel until we find the operation which does not performs well.
from qibojit.
If I remove the double precision signature, the code works fine with single precision but crashes with double
TypeError: No matching definition for argument type(s) array(complex128, 1d, C), array(complex128, 2d, C), array(int32, 1d, C), int64, array(int64, 1d, C)
as expected.
from qibojit.
Does the performance matches the previous single precision runs?
from qibojit.
I believe there's no difference in the execution time (performances are a bit unstable on my laptop, but they oscillate in the same range).
from qibojit.
I'd like to add some addition results. I think it's a bit difficult to solve this issue because we don't have a baseline that shows which is the optimal result. For example, I tried to run some benchmarks (the same as in issue #51) with qibo, numpy backend:
Are these results expected?
I think it may be interesting to run the same benchmark with all our backends that support multiqubit gates (numpy, tensorflow, qibojit) and with CPU/GPU to have a general overview of the problem.
from qibojit.
In theory I would say no, however this means that (a) the computation is not the bottleneck, or (b) for multiqubit gates there are pre or post processing done in double precision. If you can please profile the code to exclude (a).
from qibojit.
I tried to profile my benchmark code with cProfile
. Configuration with 25 qubits, 6 targets, NumPy backend.
I see that nearly all the simulation time is spent in numpy.core._multiarray_umath.c_einsum
which is called by NumPy' einsum. Time per call: 6.205 s (single), 6.29 s (double)
from qibojit.
Would be great to repeat this exercise with qibojit.
from qibojit.
I compared single vs double precision performance across different hardware and backends:
Generally speaking, we always see a consistent speed-up with single precision over double when we use GPU. Instead, with CPU we have different behaviors.
In particular, with NumPy we don't see consistent improvements, with TensorFlow we see a modest speed-up and with qibojit we see a speed-up only for ntargets < 3
.
EDIT: I added the results for Qiskit, which show a moderate speed-up of single precision w.r.t. double, at least in average.
from qibojit.
@mlazzarin thank you very much. This looks quite consistent from the qibojit side. How many repetitions are you using for these numbers?
from qibojit.
20 for NQUBITS < 21
, 10 for 21<=NQUBITS<=26
, 3 for NQUBITS>26
.
from qibojit.
Ok, good. Then going back to #53, I assume there is a noticeable difference between us and qiskit in terms of algorithm.
from qibojit.
Related Issues (20)
- porting workflows repo
- Invalid NUMBA_NUM_THREADS while running with Numpy HOT 10
- Inverse of singular matrix in a quantum-to-classical channel in `numba` HOT 1
- Port poetry HOT 3
- Improvements for cuQuantum backend HOT 3
- Cannot run pytest when cuquantum is installed HOT 1
- No version `0.0.11` on pypi HOT 5
- Backend review HOT 7
- `CuQuantum` exception HOT 7
- Slow when using repeated execution HOT 1
- Documentation
- Benchmark performance not-in-place updates
- Add not-in-place updates
- Drop py38
- some qibojit benchmarks on NVIDIA Grace-Hopper (WIP) HOT 15
- Tests are failing for cuquantum `23.10`
- Module not found during installation HOT 10
- FSWAP operator
- Cuquantum fails using the latest release HOT 2
- Add cuQuantum version in documentation HOT 3
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from qibojit.