Git Product home page Git Product logo

accl's People

Contributors

bo3z avatar danieleparravicini avatar hpc-ken avatar mar-ven avatar mellich avatar pedroohr avatar preusser avatar quetric avatar tobi-alonso avatar tristanlaan avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

accl's Issues

Broadcast segmentation fails when root != 0

The ACCL XRT test suite fails for large counts on the bcast with root 1 tests.
ACCL was synthesized with TCP stack and UDP stack from dev at 36eebbb. All other tests passed.
The error appears in both versions.

Setup:

  • 2x Alveo U280, directly connected
  • XRT:2.12
  • Platform: xilinx_u280_xdma_201920_3
  • Vitis: 2021.2
  • Tests using streams are commented out to prevent deadlocks

Execution:
TCP version: mpirun -n 2 bin/test -f -x the.xclbin -t -b 2 -s 512
UDP version: mpirun -n 2 bin/test -f -x the.xclbin -u -b 2 -s 512

Output for the two failing tests:

[1,1]<stdout>:Start bcast test with root 1 ...
[1,0]<stdout>:Start bcast test with root 1 ...
[1,0]<stdout>:257th item is incorrect! (629.447388 != -480.259155)
[1,0]<stdout>:258th item is incorrect! (-729.046021 != -909.880920)
[1,0]<stdout>:259th item is incorrect! (811.583862 != 600.136963)
[1,0]<stdout>:260th item is incorrect! (670.017090 != 320.238892)
[1,0]<stdout>:261th item is incorrect! (-746.026367 != -137.172363)
[1,0]<stdout>:262th item is incorrect! (937.735596 != 499.881958)
[1,0]<stdout>:263th item is incorrect! (826.751709 != 821.295166)
[1,0]<stdout>:264th item is incorrect! (-557.931885 != -734.007935)
[1,0]<stdout>:265th item is incorrect! (264.718506 != -636.305969)
[1,0]<stdout>:266th item is incorrect! (-383.665894 != 964.721069)
[1,0]<stdout>:267th item is incorrect! (-804.919189 != -472.394165)
[1,0]<stdout>:268th item is incorrect! (94.441162 != -809.289673)
[1,0]<stdout>:269th item is incorrect! (-443.003540 != -708.921997)
[1,0]<stdout>:270th item is incorrect! (-623.236084 != -434.653381)
[1,0]<stdout>:271th item is incorrect! (93.762939 != -727.862915)
[1,0]<stdout>:272th item is incorrect! (985.762573 != 604.222900)
[1,0]<stdout>:273th item is incorrect! (915.013672 != 738.584351)
[1,0]<stdout>:274th item is incorrect! (992.922607 != -844.885925)
[1,0]<stdout>:275th item is incorrect! (929.776978 != 159.409180)
[1,0]<stdout>:276th item is incorrect! (935.389893 != 254.768677)
[1,0]<stdout>:277th item is incorrect! (-684.773804 != 99.720337)
[1,0]<stdout>:278th item is incorrect! (451.677979 != -983.812195)
[1,0]<stdout>:279th item is incorrect! (941.185547 != -710.090393)
[1,0]<stdout>:280th item is incorrect! (962.219360 != 360.574097)
[1,0]<stdout>:281th item is incorrect! (914.333984 != 706.062256)
[1,0]<stdout>:282th item is incorrect! (-780.276489 != 67.866211)
[1,0]<stdout>:283th item is incorrect! (-29.248718 != 244.110229)
[1,0]<stdout>:284th item is incorrect! (596.211670 != -122.666382)
[1,0]<stdout>:285th item is incorrect! (600.560913 != -298.095215)
[1,0]<stdout>:286th item is incorrect! (-405.941101 != -600.897583)
[1,0]<stdout>:287th item is incorrect! (-716.227295 != 26.499023)
[1,0]<stdout>:288th item is incorrect! (-990.433044 != -723.997314)
[1,0]<stdout>:289th item is incorrect! (-156.477478 != -196.383972)
[1,0]<stdout>:290th item is incorrect! (-775.070984 != -235.334106)
[1,0]<stdout>:291th item is incorrect! (831.471069 != -848.066650)
[1,0]<stdout>:292th item is incorrect! (279.526733 != 524.842651)
[1,0]<stdout>:293th item is incorrect! (584.414551 != -520.167664)
[1,0]<stdout>:294th item is incorrect! (756.861328 != -919.057800)
[1,0]<stdout>:295th item is incorrect! (918.984863 != -753.362122)
[1,0]<stdout>:296th item is incorrect! (7.325439 != -494.088043)
[1,0]<stdout>:297th item is incorrect! (311.481323 != -632.184448)
[1,0]<stdout>:298th item is incorrect! (595.857300 != 9.541992)
[1,0]<stdout>:299th item is incorrect! (-928.576660 != -520.094971)
[1,0]<stdout>:300th item is incorrect! (-277.411987 != 645.209961)
[1,0]<stdout>:301th item is incorrect! (698.258667 != -165.465881)
[1,0]<stdout>:302th item is incorrect! (-576.151367 != 963.446289)
[1,0]<stdout>:303th item is incorrect! (867.986450 != -900.691162)
[1,0]<stdout>:304th item is incorrect! (362.719116 != 646.910767)
[1,0]<stdout>:305th item is incorrect! (357.470215 != 805.432251)
[1,0]<stdout>:306th item is incorrect! (-202.522949 != -396.345398)
[1,0]<stdout>:307th item is incorrect! (515.480225 != 889.574463)
[1,0]<stdout>:308th item is incorrect! (481.294556 != -904.111450)
[1,0]<stdout>:309th item is incorrect! (486.264893 != -18.271790)
[1,0]<stdout>:310th item is incorrect! (-50.482605 != -504.304810)
[1,0]<stdout>:311th item is incorrect! (-215.545959 != -21.494751)
[1,0]<stdout>:312th item is incorrect! (-155.824646 != 88.112183)
[1,0]<stdout>:313th item is incorrect! (310.955811 != -324.561157)
[1,0]<stdout>:314th item is incorrect! (-652.269653 != 775.452271)
[1,0]<stdout>:315th item is incorrect! (-657.626587 != 800.107666)
[1,0]<stdout>:316th item is incorrect! (-396.173767 != -312.320923)
[1,0]<stdout>:317th item is incorrect! (412.092163 != -261.506409)
[1,0]<stdout>:318th item is incorrect! (594.559814 != -933.462891)
[1,0]<stdout>:319th item is incorrect! (-936.334290 != -777.594482)
[1,0]<stdout>:320th item is incorrect! (-366.899109 != -674.255737)
[1,0]<stdout>:321th item is incorrect! (-446.154053 != 560.504028)
[1,0]<stdout>:322th item is incorrect! (744.857666 != 754.727783)
[1,0]<stdout>:323th item is incorrect! (-907.657227 != -220.522339)
[1,0]<stdout>:324th item is incorrect! (-701.772034 != -579.396240)
[1,0]<stdout>:325th item is incorrect! (-805.736450 != -516.617432)
[1,0]<stdout>:326th item is incorrect! (988.136963 != -454.493958)
[1,0]<stdout>:327th item is incorrect! (646.915649 != -192.175720)
[1,0]<stdout>:328th item is incorrect! (643.806519 != -15.116028)
[1,0]<stdout>:329th item is incorrect! (389.657227 != -807.090942)
[1,0]<stdout>:330th item is incorrect! (-749.634460 != -553.160095)
[1,0]<stdout>:331th item is incorrect! (-365.801025 != -736.053406)
[1,0]<stdout>:332th item is incorrect! (527.500000 != -19.397034)
[1,0]<stdout>:333th item is incorrect! (900.444092 != 884.101196)
[1,0]<stdout>:334th item is incorrect! (-18.821899 != 909.886719)
[1,0]<stdout>:335th item is incorrect! (-931.107849 != 912.269165)
[1,0]<stdout>:336th item is incorrect! (327.211060 != 303.937256)
[1,0]<stdout>:337th item is incorrect! (-122.511292 != 150.417236)
[1,0]<stdout>:338th item is incorrect! (-748.206726 != 515.007202)
[1,0]<stdout>:339th item is incorrect! (-236.883118 != -880.440918)
[1,0]<stdout>:340th item is incorrect! (-579.581848 != -129.508606)
[1,0]<stdout>:341th item is incorrect! (531.033691 != -530.440186)
[1,0]<stdout>:342th item is incorrect! (-897.567139 != 105.762329)
[1,0]<stdout>:343th item is incorrect! (590.399902 != -293.682861)
[1,0]<stdout>:344th item is incorrect! (-927.117493 != -893.694885)
[1,0]<stdout>:345th item is incorrect! (-626.254761 != 642.388062)
[1,0]<stdout>:346th item is incorrect! (-182.537659 != -354.978516)
[1,0]<stdout>:347th item is incorrect! (-20.471191 != -969.193115)
[1,0]<stdout>:348th item is incorrect! (-84.021667 != -190.037231)
[1,0]<stdout>:349th item is incorrect! (-108.827576 != -913.952393)
[1,0]<stdout>:350th item is incorrect! (-24.862183 != 816.867188)
[1,0]<stdout>:351th item is incorrect! (292.625977 != -662.019897)
[1,0]<stdout>:352th item is incorrect! (587.949951 != 618.274414)
[1,0]<stdout>:353th item is incorrect! (418.729614 != 298.230957)
[1,0]<stdout>:354th item is incorrect! (841.749512 != -483.408203)
[1,0]<stdout>:355th item is incorrect! (509.373291 != 463.444824)
[1,0]<stdout>:356th item is incorrect! (615.062012 != -740.306946)
[1,0]<stdout>:357th item is incorrect! (-447.949829 != 295.491943)
[1,0]<stdout>:358th item is incorrect! (411.548462 != -13.346313)
[1,0]<stdout>:359th item is incorrect! (359.405396 != -98.152588)
[1,0]<stdout>:360th item is incorrect! (-994.363159 != -242.999146)
[1,0]<stdout>:361th item is incorrect! (310.196045 != 94.017700)
[1,0]<stdout>:362th item is incorrect! (421.407715 != 436.939941)
[1,0]<stdout>:363th item is incorrect! (-674.776489 != -407.358398)
[1,0]<stdout>:364th item is incorrect! (287.921875 != -424.390015)
[1,0]<stdout>:365th item is incorrect! (-762.004639 != 489.385620)
[1,0]<stdout>:366th item is incorrect! (-87.934387 != 246.871094)
[1,0]<stdout>:367th item is incorrect! (-3.271851 != -622.089966)
[1,0]<stdout>:368th item is incorrect! (547.834229 != 623.747925)
[1,0]<stdout>:369th item is incorrect! (919.487915 != 373.550903)
[1,0]<stdout>:370th item is incorrect! (147.509277 != -374.984009)
[1,0]<stdout>:371th item is incorrect! (-319.228516 != -632.977661)
[1,0]<stdout>:372th item is incorrect! (753.514893 != -228.578979)
[1,0]<stdout>:373th item is incorrect! (170.535400 != -263.030823)
[1,0]<stdout>:374th item is incorrect! (616.350952 != -312.140503)
[1,0]<stdout>:375th item is incorrect! (-552.376099 != 251.237183)
[1,0]<stdout>:376th item is incorrect! (-964.452209 != 631.538208)
[1,0]<stdout>:377th item is incorrect! (502.534180 != 560.454834)
[1,0]<stdout>:378th item is incorrect! (642.491943 != 338.570190)
[1,0]<stdout>:379th item is incorrect! (-489.809753 != -837.748474)
[1,0]<stdout>:380th item is incorrect! (641.681519 != -241.162292)
[1,0]<stdout>:381th item is incorrect! (11.914124 != 858.771973)
[1,0]<stdout>:382th item is incorrect! (880.148071 != -83.006226)
[1,0]<stdout>:383th item is incorrect! (398.153442 != 551.425293)
[1,0]<stdout>:384th item is incorrect! (-174.666931 != -392.772095)
[1,0]<stdout>:385th item is incorrect! (781.806519 != -26.416748)
[1,0]<stdout>:386th item is incorrect! (-153.669800 != 821.129883)
[1,0]<stdout>:387th item is incorrect! (918.582764 != -128.282837)
[1,0]<stdout>:388th item is incorrect! (161.913452 != -22.764587)
[1,0]<stdout>:389th item is incorrect! (94.431030 != -106.432495)
[1,0]<stdout>:390th item is incorrect! (-683.884827 != 511.579712)
[1,0]<stdout>:391th item is incorrect! (-722.751099 != -387.301025)
[1,0]<stdout>:392th item is incorrect! (523.462402 != -783.876160)
[1,0]<stdout>:393th item is incorrect! (-701.411987 != 17.017334)
[1,0]<stdout>:394th item is incorrect! (-539.687866 != -212.820312)
[1,0]<stdout>:395th item is incorrect! (-484.983521 != 21.543152)
[1,0]<stdout>:396th item is incorrect! (619.468994 != 740.373535)
[1,0]<stdout>:397th item is incorrect! (681.434570 != 635.255493)
[1,0]<stdout>:398th item is incorrect! (977.043213 != -278.278564)
[1,0]<stdout>:399th item is incorrect! (-491.435638 != 589.662842)
[1,0]<stdout>:400th item is incorrect! (-335.103455 != 834.235840)
[1,0]<stdout>:401th item is incorrect! (628.569580 != 288.636230)
[1,0]<stdout>:402th item is incorrect! (-400.336548 != 87.610962)
[1,0]<stdout>:403th item is incorrect! (-512.950073 != -242.781250)
[1,0]<stdout>:404th item is incorrect! (-972.921753 != -719.712280)
[1,0]<stdout>:405th item is incorrect! (858.527344 != 623.161011)
[1,0]<stdout>:406th item is incorrect! (-565.524292 != -600.254272)
[1,0]<stdout>:407th item is incorrect! (-300.032471 != 65.651123)
[1,0]<stdout>:408th item is incorrect! (814.729492 != 897.850098)
[1,0]<stdout>:409th item is incorrect! (-606.809509 != -298.545776)
[1,0]<stdout>:410th item is incorrect! (696.935547 != 980.219971)
[1,0]<stdout>:411th item is incorrect! (-497.832306 != 878.003174)
[1,0]<stdout>:412th item is incorrect! (910.035156 != -519.848145)
[1,0]<stdout>:413th item is incorrect! (232.089355 != 751.885620)
[1,0]<stdout>:414th item is incorrect! (557.795410 != -966.958862)
[1,0]<stdout>:415th item is incorrect! (-53.422302 != 100.312744)
[1,0]<stdout>:416th item is incorrect! (974.919189 != -222.769653)
[1,0]<stdout>:417th item is incorrect! (-296.680969 != 244.950195)
[1,0]<stdout>:418th item is incorrect! (-864.809265 != 559.378784)
[1,0]<stdout>:419th item is incorrect! (661.657227 != 174.089478)
[1,0]<stdout>:420th item is incorrect! (587.195190 != -46.723816)
[1,0]<stdout>:421th item is incorrect! (170.528198 != -584.515381)
[1,0]<stdout>:422th item is incorrect! (189.007202 != 80.276001)
[1,0]<stdout>:423th item is incorrect! (99.447266 != -397.507324)
[1,0]<stdout>:424th item is incorrect! (465.597412 != -963.098328)
[1,0]<stdout>:425th item is incorrect! (834.387329 != -58.153320)
[1,0]<stdout>:426th item is incorrect! (390.465698 != 800.366455)
[1,0]<stdout>:427th item is incorrect! (-428.321960 != -539.023682)
[1,0]<stdout>:428th item is incorrect! (359.639526 != -611.009277)
[1,0]<stdout>:429th item is incorrect! (514.400513 != 688.617554)
[1,0]<stdout>:430th item is incorrect! (-215.359070 != 771.996216)
[1,0]<stdout>:431th item is incorrect! (507.458252 != -610.471436)
[1,0]<stdout>:432th item is incorrect! (123.114990 != -117.553040)
[1,0]<stdout>:433th item is incorrect! (-239.108337 != -548.156433)
[1,0]<stdout>:434th item is incorrect! (-583.863892 != -704.341980)
[1,0]<stdout>:435th item is incorrect! (135.643188 != -658.583923)
[1,0]<stdout>:436th item is incorrect! (54.742920 != -520.995117)
[1,0]<stdout>:437th item is incorrect! (-848.291382 != -544.671387)
[1,0]<stdout>:438th item is incorrect! (-191.583008 != 599.306396)
[1,0]<stdout>:439th item is incorrect! (-892.099731 != -128.602600)
[1,0]<stdout>:440th item is incorrect! (-294.475220 != -53.969849)
[1,0]<stdout>:441th item is incorrect! (61.595093 != -377.795410)
[1,0]<stdout>:442th item is incorrect! (185.647705 != -820.353699)
[1,0]<stdout>:443th item is incorrect! (558.334473 != 846.759277)
[1,0]<stdout>:444th item is incorrect! (-287.309692 != 289.101074)
[1,0]<stdout>:445th item is incorrect! (868.021362 != -139.585205)
[1,0]<stdout>:446th item is incorrect! (929.932739 != 266.127319)
[1,0]<stdout>:447th item is incorrect! (-740.187622 != -630.367371)
[1,0]<stdout>:448th item is incorrect! (-691.123169 != 168.764526)
[1,0]<stdout>:449th item is incorrect! (137.647217 != 809.761841)
[1,0]<stdout>:450th item is incorrect! (-210.183533 != 453.308838)
[1,0]<stdout>:451th item is incorrect! (-61.218750 != 959.496704)
[1,0]<stdout>:452th item is incorrect! (-225.408203 != -290.723755)
[1,0]<stdout>:453th item is incorrect! (-976.195862 != -122.260010)
[1,0]<stdout>:454th item is incorrect! (453.909424 != 360.813232)
[1,0]<stdout>:455th item is incorrect! (-325.754700 != -777.761597)
[1,0]<stdout>:456th item is incorrect! (-222.860413 != 414.643066)
[1,0]<stdout>:457th item is incorrect! (-675.635376 != -483.870605)
[1,0]<stdout>:458th item is incorrect! (854.985718 != -675.343018)
[1,0]<stdout>:459th item is incorrect! (588.569092 != -182.560303)
[1,0]<stdout>:460th item is incorrect! (-127.764893 != -732.527222)
[1,0]<stdout>:461th item is incorrect! (-377.569885 != 189.792114)
[1,0]<stdout>:462th item is incorrect! (725.356323 != -100.887817)
[1,0]<stdout>:463th item is incorrect! (57.066284 != -475.576538)
[1,0]<stdout>:464th item is incorrect! (240.720093 != -915.891724)
[1,0]<stdout>:465th item is incorrect! (-668.702515 != 205.686157)
[1,0]<stdout>:466th item is incorrect! (-760.905640 != 594.728271)
[1,0]<stdout>:467th item is incorrect! (203.963867 != 422.431641)
[1,0]<stdout>:468th item is incorrect! (-56.086426 != -664.888428)
[1,0]<stdout>:469th item is incorrect! (-474.057434 != -556.506531)
[1,0]<stdout>:470th item is incorrect! (-319.560608 != 662.428589)
[1,0]<stdout>:471th item is incorrect! (308.158203 != -765.164673)
[1,0]<stdout>:472th item is incorrect! (59.683960 != -350.072937)
[1,0]<stdout>:473th item is incorrect! (378.429077 != -406.648254)
[1,0]<stdout>:474th item is incorrect! (432.201416 != 311.559814)
[1,0]<stdout>:475th item is incorrect! (496.303223 != -362.443359)
[1,0]<stdout>:476th item is incorrect! (976.758789 != -159.620117)
[1,0]<stdout>:477th item is incorrect! (-98.916809 != -151.666443)
[1,0]<stdout>:478th item is incorrect! (440.986816 != 563.818359)
[1,0]<stdout>:479th item is incorrect! (-832.357239 != 15.716553)
[1,0]<stdout>:480th item is incorrect! (825.155029 != -773.614929)
[1,0]<stdout>:481th item is incorrect! (-542.046082 != -828.968384)
[1,0]<stdout>:482th item is incorrect! (10.997070 != 987.069458)
[1,0]<stdout>:483th item is incorrect! (826.674683 != -475.035522)
[1,0]<stdout>:484th item is incorrect! (116.537598 != -636.854065)
[1,0]<stdout>:485th item is incorrect! (-695.243958 != 602.029175)
[1,0]<stdout>:486th item is incorrect! (6.380066 != 525.862793)
[1,0]<stdout>:487th item is incorrect! (651.634033 != -941.559448)
[1,0]<stdout>:488th item is incorrect! (-75.051636 != -468.158691)
[1,0]<stdout>:489th item is incorrect! (76.684814 != 857.708374)
[1,0]<stdout>:490th item is incorrect! (93.183838 != -511.723450)
[1,0]<stdout>:491th item is incorrect! (992.269409 != 460.661743)
[1,0]<stdout>:492th item is incorrect! (-104.831238 != -798.523010)
[1,0]<stdout>:493th item is incorrect! (-843.648926 != -22.782043)
[1,0]<stdout>:494th item is incorrect! (708.901978 != -311.344116)
[1,0]<stdout>:495th item is incorrect! (-114.643433 != 157.050171)
[1,0]<stdout>:496th item is incorrect! (208.463013 != -436.745361)
[1,0]<stdout>:497th item is incorrect! (-786.694458 != -525.432861)
[1,0]<stdout>:498th item is incorrect! (-2.911621 != 937.280762)
[1,0]<stdout>:499th item is incorrect! (923.796143 != -82.302307)
[1,0]<stdout>:500th item is incorrect! (959.851318 != -572.454468)
[1,0]<stdout>:501th item is incorrect! (-990.731567 != 926.177002)
[1,0]<stdout>:502th item is incorrect! (-931.365356 != 211.856323)
[1,0]<stdout>:503th item is incorrect! (549.820923 != 93.611450)
[1,0]<stdout>:504th item is incorrect! (954.004028 != -546.897339)
[1,0]<stdout>:505th item is incorrect! (634.606445 != 42.271606)
[1,0]<stdout>:506th item is incorrect! (-273.627075 != -620.189026)
[1,0]<stdout>:507th item is incorrect! (737.389404 != -536.811218)
[1,0]<stdout>:508th item is incorrect! (359.039429 != -196.857056)
[1,0]<stdout>:509th item is incorrect! (-831.128296 != -22.204529)
[1,0]<stdout>:510th item is incorrect! (-307.533203 != -173.418762)
[1,0]<stdout>:511th item is incorrect! (-200.434692 != 248.120239)
[1,0]<stdout>:512th item is incorrect! (711.750244 != -659.695435)
[1,0]<stdout>:256 errors!
[1,1]<stdout>:Start bcast compression test with root 1 ...
[1,0]<stdout>:Start bcast compression test with root 1 ...
[1,0]<stdout>:257th item is incorrect! (629.500000 != -480.259155)
[1,0]<stdout>:258th item is incorrect! (-729.000000 != -909.880920)
[1,0]<stdout>:259th item is incorrect! (811.500000 != 600.136963)
[1,0]<stdout>:260th item is incorrect! (670.000000 != 320.238892)
[1,0]<stdout>:261th item is incorrect! (-746.000000 != -137.172363)
[1,0]<stdout>:262th item is incorrect! (937.500000 != 499.881958)
[1,0]<stdout>:263th item is incorrect! (827.000000 != 821.295166)
[1,0]<stdout>:264th item is incorrect! (-558.000000 != -734.007935)
[1,0]<stdout>:265th item is incorrect! (264.750000 != -636.305969)
[1,0]<stdout>:266th item is incorrect! (-383.750000 != 964.721069)
[1,0]<stdout>:267th item is incorrect! (-805.000000 != -472.394165)
[1,0]<stdout>:268th item is incorrect! (94.437500 != -809.289673)
[1,0]<stdout>:269th item is incorrect! (-443.000000 != -708.921997)
[1,0]<stdout>:270th item is incorrect! (-623.000000 != -434.653381)
[1,0]<stdout>:271th item is incorrect! (93.750000 != -727.862915)
[1,0]<stdout>:272th item is incorrect! (986.000000 != 604.222900)
[1,0]<stdout>:273th item is incorrect! (915.000000 != 738.584351)
[1,0]<stdout>:274th item is incorrect! (993.000000 != -844.885925)
[1,0]<stdout>:275th item is incorrect! (930.000000 != 159.409180)
[1,0]<stdout>:276th item is incorrect! (935.500000 != 254.768677)
[1,0]<stdout>:277th item is incorrect! (-685.000000 != 99.720337)
[1,0]<stdout>:278th item is incorrect! (451.750000 != -983.812195)
[1,0]<stdout>:279th item is incorrect! (941.000000 != -710.090393)
[1,0]<stdout>:280th item is incorrect! (962.000000 != 360.574097)
[1,0]<stdout>:281th item is incorrect! (914.500000 != 706.062256)
[1,0]<stdout>:282th item is incorrect! (-780.500000 != 67.866211)
[1,0]<stdout>:283th item is incorrect! (-29.250000 != 244.110229)
[1,0]<stdout>:284th item is incorrect! (596.000000 != -122.666382)
[1,0]<stdout>:285th item is incorrect! (600.500000 != -298.095215)
[1,0]<stdout>:286th item is incorrect! (-406.000000 != -600.897583)
[1,0]<stdout>:287th item is incorrect! (-716.000000 != 26.499023)
[1,0]<stdout>:288th item is incorrect! (-990.500000 != -723.997314)
[1,0]<stdout>:289th item is incorrect! (-156.500000 != -196.383972)
[1,0]<stdout>:290th item is incorrect! (-775.000000 != -235.334106)
[1,0]<stdout>:291th item is incorrect! (831.500000 != -848.066650)
[1,0]<stdout>:292th item is incorrect! (279.500000 != 524.842651)
[1,0]<stdout>:293th item is incorrect! (584.500000 != -520.167664)
[1,0]<stdout>:294th item is incorrect! (757.000000 != -919.057800)
[1,0]<stdout>:295th item is incorrect! (919.000000 != -753.362122)
[1,0]<stdout>:296th item is incorrect! (7.324219 != -494.088043)
[1,0]<stdout>:297th item is incorrect! (311.500000 != -632.184448)
[1,0]<stdout>:298th item is incorrect! (596.000000 != 9.541992)
[1,0]<stdout>:299th item is incorrect! (-928.500000 != -520.094971)
[1,0]<stdout>:300th item is incorrect! (-277.500000 != 645.209961)
[1,0]<stdout>:301th item is incorrect! (698.500000 != -165.465881)
[1,0]<stdout>:302th item is incorrect! (-576.000000 != 963.446289)
[1,0]<stdout>:303th item is incorrect! (868.000000 != -900.691162)
[1,0]<stdout>:304th item is incorrect! (362.750000 != 646.910767)
[1,0]<stdout>:305th item is incorrect! (357.500000 != 805.432251)
[1,0]<stdout>:306th item is incorrect! (-202.500000 != -396.345398)
[1,0]<stdout>:307th item is incorrect! (515.500000 != 889.574463)
[1,0]<stdout>:308th item is incorrect! (481.250000 != -904.111450)
[1,0]<stdout>:309th item is incorrect! (486.250000 != -18.271790)
[1,0]<stdout>:310th item is incorrect! (-50.468750 != -504.304810)
[1,0]<stdout>:311th item is incorrect! (-215.500000 != -21.494751)
[1,0]<stdout>:312th item is incorrect! (-155.875000 != 88.112183)
[1,0]<stdout>:313th item is incorrect! (311.000000 != -324.561157)
[1,0]<stdout>:314th item is incorrect! (-652.500000 != 775.452271)
[1,0]<stdout>:315th item is incorrect! (-657.500000 != 800.107666)
[1,0]<stdout>:316th item is incorrect! (-396.250000 != -312.320923)
[1,0]<stdout>:317th item is incorrect! (412.000000 != -261.506409)
[1,0]<stdout>:318th item is incorrect! (594.500000 != -933.462891)
[1,0]<stdout>:319th item is incorrect! (-936.500000 != -777.594482)
[1,0]<stdout>:320th item is incorrect! (-367.000000 != -674.255737)
[1,0]<stdout>:321th item is incorrect! (-446.250000 != 560.504028)
[1,0]<stdout>:322th item is incorrect! (745.000000 != 754.727783)
[1,0]<stdout>:323th item is incorrect! (-907.500000 != -220.522339)
[1,0]<stdout>:324th item is incorrect! (-702.000000 != -579.396240)
[1,0]<stdout>:325th item is incorrect! (-805.500000 != -516.617432)
[1,0]<stdout>:326th item is incorrect! (988.000000 != -454.493958)
[1,0]<stdout>:327th item is incorrect! (647.000000 != -192.175720)
[1,0]<stdout>:328th item is incorrect! (644.000000 != -15.116028)
[1,0]<stdout>:329th item is incorrect! (389.750000 != -807.090942)
[1,0]<stdout>:330th item is incorrect! (-749.500000 != -553.160095)
[1,0]<stdout>:331th item is incorrect! (-365.750000 != -736.053406)
[1,0]<stdout>:332th item is incorrect! (527.500000 != -19.397034)
[1,0]<stdout>:333th item is incorrect! (900.500000 != 884.101196)
[1,0]<stdout>:334th item is incorrect! (-18.828125 != 909.886719)
[1,0]<stdout>:335th item is incorrect! (-931.000000 != 912.269165)
[1,0]<stdout>:336th item is incorrect! (327.250000 != 303.937256)
[1,0]<stdout>:337th item is incorrect! (-122.500000 != 150.417236)
[1,0]<stdout>:338th item is incorrect! (-748.000000 != 515.007202)
[1,0]<stdout>:339th item is incorrect! (-236.875000 != -880.440918)
[1,0]<stdout>:340th item is incorrect! (-579.500000 != -129.508606)
[1,0]<stdout>:341th item is incorrect! (531.000000 != -530.440186)
[1,0]<stdout>:342th item is incorrect! (-897.500000 != 105.762329)
[1,0]<stdout>:343th item is incorrect! (590.500000 != -293.682861)
[1,0]<stdout>:344th item is incorrect! (-927.000000 != -893.694885)
[1,0]<stdout>:345th item is incorrect! (-626.500000 != 642.388062)
[1,0]<stdout>:346th item is incorrect! (-182.500000 != -354.978516)
[1,0]<stdout>:347th item is incorrect! (-20.468750 != -969.193115)
[1,0]<stdout>:348th item is incorrect! (-84.000000 != -190.037231)
[1,0]<stdout>:349th item is incorrect! (-108.812500 != -913.952393)
[1,0]<stdout>:350th item is incorrect! (-24.859375 != 816.867188)
[1,0]<stdout>:351th item is incorrect! (292.750000 != -662.019897)
[1,0]<stdout>:352th item is incorrect! (588.000000 != 618.274414)
[1,0]<stdout>:353th item is incorrect! (418.750000 != 298.230957)
[1,0]<stdout>:354th item is incorrect! (841.500000 != -483.408203)
[1,0]<stdout>:355th item is incorrect! (509.250000 != 463.444824)
[1,0]<stdout>:356th item is incorrect! (615.000000 != -740.306946)
[1,0]<stdout>:357th item is incorrect! (-448.000000 != 295.491943)
[1,0]<stdout>:358th item is incorrect! (411.500000 != -13.346313)
[1,0]<stdout>:359th item is incorrect! (359.500000 != -98.152588)
[1,0]<stdout>:360th item is incorrect! (-994.500000 != -242.999146)
[1,0]<stdout>:361th item is incorrect! (310.250000 != 94.017700)
[1,0]<stdout>:362th item is incorrect! (421.500000 != 436.939941)
[1,0]<stdout>:363th item is incorrect! (-675.000000 != -407.358398)
[1,0]<stdout>:364th item is incorrect! (288.000000 != -424.390015)
[1,0]<stdout>:365th item is incorrect! (-762.000000 != 489.385620)
[1,0]<stdout>:366th item is incorrect! (-87.937500 != 246.871094)
[1,0]<stdout>:367th item is incorrect! (-3.271484 != -622.089966)
[1,0]<stdout>:368th item is incorrect! (548.000000 != 623.747925)
[1,0]<stdout>:369th item is incorrect! (919.500000 != 373.550903)
[1,0]<stdout>:370th item is incorrect! (147.500000 != -374.984009)
[1,0]<stdout>:371th item is incorrect! (-319.250000 != -632.977661)
[1,0]<stdout>:372th item is incorrect! (753.500000 != -228.578979)
[1,0]<stdout>:373th item is incorrect! (170.500000 != -263.030823)
[1,0]<stdout>:374th item is incorrect! (616.500000 != -312.140503)
[1,0]<stdout>:375th item is incorrect! (-552.500000 != 251.237183)
[1,0]<stdout>:376th item is incorrect! (-964.500000 != 631.538208)
[1,0]<stdout>:377th item is incorrect! (502.500000 != 560.454834)
[1,0]<stdout>:378th item is incorrect! (642.500000 != 338.570190)
[1,0]<stdout>:379th item is incorrect! (-489.750000 != -837.748474)
[1,0]<stdout>:380th item is incorrect! (641.500000 != -241.162292)
[1,0]<stdout>:381th item is incorrect! (11.914062 != 858.771973)
[1,0]<stdout>:382th item is incorrect! (880.000000 != -83.006226)
[1,0]<stdout>:383th item is incorrect! (398.250000 != 551.425293)
[1,0]<stdout>:384th item is incorrect! (-174.625000 != -392.772095)
[1,0]<stdout>:385th item is incorrect! (782.000000 != -26.416748)
[1,0]<stdout>:386th item is incorrect! (-153.625000 != 821.129883)
[1,0]<stdout>:387th item is incorrect! (918.500000 != -128.282837)
[1,0]<stdout>:388th item is incorrect! (161.875000 != -22.764587)
[1,0]<stdout>:389th item is incorrect! (94.437500 != -106.432495)
[1,0]<stdout>:390th item is incorrect! (-684.000000 != 511.579712)
[1,0]<stdout>:391th item is incorrect! (-723.000000 != -387.301025)
[1,0]<stdout>:392th item is incorrect! (523.500000 != -783.876160)
[1,0]<stdout>:393th item is incorrect! (-701.500000 != 17.017334)
[1,0]<stdout>:394th item is incorrect! (-539.500000 != -212.820312)
[1,0]<stdout>:395th item is incorrect! (-485.000000 != 21.543152)
[1,0]<stdout>:396th item is incorrect! (619.500000 != 740.373535)
[1,0]<stdout>:397th item is incorrect! (681.500000 != 635.255493)
[1,0]<stdout>:398th item is incorrect! (977.000000 != -278.278564)
[1,0]<stdout>:399th item is incorrect! (-491.500000 != 589.662842)
[1,0]<stdout>:400th item is incorrect! (-335.000000 != 834.235840)
[1,0]<stdout>:401th item is incorrect! (628.500000 != 288.636230)
[1,0]<stdout>:402th item is incorrect! (-400.250000 != 87.610962)
[1,0]<stdout>:403th item is incorrect! (-513.000000 != -242.781250)
[1,0]<stdout>:404th item is incorrect! (-973.000000 != -719.712280)
[1,0]<stdout>:405th item is incorrect! (858.500000 != 623.161011)
[1,0]<stdout>:406th item is incorrect! (-565.500000 != -600.254272)
[1,0]<stdout>:407th item is incorrect! (-300.000000 != 65.651123)
[1,0]<stdout>:408th item is incorrect! (814.500000 != 897.850098)
[1,0]<stdout>:409th item is incorrect! (-607.000000 != -298.545776)
[1,0]<stdout>:410th item is incorrect! (697.000000 != 980.219971)
[1,0]<stdout>:411th item is incorrect! (-497.750000 != 878.003174)
[1,0]<stdout>:412th item is incorrect! (910.000000 != -519.848145)
[1,0]<stdout>:413th item is incorrect! (232.125000 != 751.885620)
[1,0]<stdout>:414th item is incorrect! (558.000000 != -966.958862)
[1,0]<stdout>:415th item is incorrect! (-53.437500 != 100.312744)
[1,0]<stdout>:416th item is incorrect! (975.000000 != -222.769653)
[1,0]<stdout>:417th item is incorrect! (-296.750000 != 244.950195)
[1,0]<stdout>:418th item is incorrect! (-865.000000 != 559.378784)
[1,0]<stdout>:419th item is incorrect! (661.500000 != 174.089478)
[1,0]<stdout>:420th item is incorrect! (587.000000 != -46.723816)
[1,0]<stdout>:421th item is incorrect! (170.500000 != -584.515381)
[1,0]<stdout>:422th item is incorrect! (189.000000 != 80.276001)
[1,0]<stdout>:423th item is incorrect! (99.437500 != -397.507324)
[1,0]<stdout>:424th item is incorrect! (465.500000 != -963.098328)
[1,0]<stdout>:425th item is incorrect! (834.500000 != -58.153320)
[1,0]<stdout>:426th item is incorrect! (390.500000 != 800.366455)
[1,0]<stdout>:427th item is incorrect! (-428.250000 != -539.023682)
[1,0]<stdout>:428th item is incorrect! (359.750000 != -611.009277)
[1,0]<stdout>:429th item is incorrect! (514.500000 != 688.617554)
[1,0]<stdout>:430th item is incorrect! (-215.375000 != 771.996216)
[1,0]<stdout>:431th item is incorrect! (507.500000 != -610.471436)
[1,0]<stdout>:432th item is incorrect! (123.125000 != -117.553040)
[1,0]<stdout>:433th item is incorrect! (-239.125000 != -548.156433)
[1,0]<stdout>:434th item is incorrect! (-584.000000 != -704.341980)
[1,0]<stdout>:435th item is incorrect! (135.625000 != -658.583923)
[1,0]<stdout>:436th item is incorrect! (54.750000 != -520.995117)
[1,0]<stdout>:437th item is incorrect! (-848.500000 != -544.671387)
[1,0]<stdout>:438th item is incorrect! (-191.625000 != 599.306396)
[1,0]<stdout>:439th item is incorrect! (-892.000000 != -128.6[1,0]<stdout>:02600)
[1,0]<stdout>:440th item is incorrect! (-294.500000 != -53.969849)
[1,0]<stdout>:441th item is incorrect! (61.593750 != -377.795410)
[1,0]<stdout>:442th item is incorrect! (185.625000 != -820.353699)
[1,0]<stdout>:443th item is incorrect! (558.500000 != 846.759277)
[1,0]<stdout>:444th item is incorrect! (-287.250000 != 289.101074)
[1,0]<stdout>:445th item is incorrect! (868.000000 != -139.585205)
[1,0]<stdout>:446th item is incorrect! (930.000000 != 266.127319)
[1,0]<stdout>:447th item is incorrect! (-740.000000 != -630.367371)
[1,0]<stdout>:448th item is incorrect! (-691.000000 != 168.764526)
[1,0]<stdout>:449th item is incorrect! (137.625000 != 809.761841)
[1,0]<stdout>:450th item is incorrect! (-210.125000 != 453.308838)
[1,0]<stdout>:451th item is incorrect! (-61.218750 != 959.496704)
[1,0]<stdout>:452th item is incorrect! (-225.375000 != -290.723755)
[1,0]<stdout>:453th item is incorrect! (-976.000000 != -122.260010)
[1,0]<stdout>:454th item is incorrect! (454.000000 != 360.813232)
[1,0]<stdout>:455th item is incorrect! (-325.750000 != -777.761597)
[1,0]<stdout>:456th item is incorrect! (-222.875000 != 414.643066)
[1,0]<stdout>:457th item is incorrect! (-675.500000 != -483.870605)
[1,0]<stdout>:458th item is incorrect! (855.000000 != -675.343018)
[1,0]<stdout>:459th item is incorrect! (588.500000 != -182.560303)
[1,0]<stdout>:460th item is incorrect! (-127.750000 != -732.527222)
[1,0]<stdout>:461th item is incorrect! (-377.500000 != 189.792114)
[1,0]<stdout>:462th item is incorrect! (725.500000 != -100.887817)
[1,0]<stdout>:463th item is incorrect! (57.062500 != -475.576538)
[1,0]<stdout>:464th item is incorrect! (240.750000 != -915.891724)
[1,0]<stdout>:465th item is incorrect! (-668.500000 != 205.686157)
[1,0]<stdout>:466th item is incorrect! (-761.000000 != 594.728271)
[1,0]<stdout>:467th item is incorrect! (204.000000 != 422.431641)
[1,0]<stdout>:468th item is incorrect! (-56.093750 != -664.888428)
[1,0]<stdout>:469th item is incorrect! (-474.000000 != -556.506531)
[1,0]<stdout>:470th item is incorrect! (-319.500000 != 662.428589)
[1,0]<stdout>:471th item is incorrect! (308.250000 != -765.164673)
[1,0]<stdout>:472th item is incorrect! (59.687500 != -350.072937)
[1,0]<stdout>:473th item is incorrect! (378.500000 != -406.648254)
[1,0]<stdout>:474th item is incorrect! (432.250000 != 311.559814)
[1,0]<stdout>:475th item is incorrect! (496.250000 != -362.443359)
[1,0]<stdout>:476th item is incorrect! (977.000000 != -159.620117)
[1,0]<stdout>:477th item is incorrect! (-98.937500 != -151.666443)
[1,0]<stdout>:478th item is incorrect! (441.000000 != 563.818359)
[1,0]<stdout>:479th item is incorrect! (-832.500000 != 15.716553)
[1,0]<stdout>:480th item is incorrect! (825.000000 != -773.614929)
[1,0]<stdout>:481th item is incorrect! (-542.000000 != -828.968384)
[1,0]<stdout>:482th item is incorrect! (11.000000 != 987.069458)
[1,0]<stdout>:483th item is incorrect! (826.500000 != -475.035522)
[1,0]<stdout>:484th item is incorrect! (116.562500 != -636.854065)
[1,0]<stdout>:485th item is incorrect! (-695.000000 != 602.029175)
[1,0]<stdout>:486th item is incorrect! (6.378906 != 525.862793)
[1,0]<stdout>:487th item is incorrect! (651.500000 != -941.559448)
[1,0]<stdout>:488th item is incorrect! (-75.062500 != -468.158691)
[1,0]<stdout>:489th item is incorrect! (76.687500 != 857.708374)
[1,0]<stdout>:490th item is incorrect! (93.187500 != -511.723450)
[1,0]<stdout>:491th item is incorrect! (992.500000 != 460.661743)
[1,0]<stdout>:492th item is incorrect! (-104.812500 != -798.523010)
[1,0]<stdout>:493th item is incorrect! (-843.500000 != -22.782043)
[1,0]<stdout>:494th item is incorrect! (709.000000 != -311.344116)
[1,0]<stdout>:495th item is incorrect! (-114.625000 != 157.050171)
[1,0]<stdout>:496th item is incorrect! (208.500000 != -436.745361)
[1,0]<stdout>:497th item is incorrect! (-786.500000 != -525.432861)
[1,0]<stdout>:498th item is incorrect! (-2.912109 != 937.280762)
[1,0]<stdout>:499th item is incorrect! (924.000000 != -82.302307)
[1,0]<stdout>:500th item is incorrect! (960.000000 != -572.454468)
[1,0]<stdout>:501th item is incorrect! (-990.500000 != 926.177002)
[1,0]<stdout>:502th item is incorrect! (-931.500000 != 211.856323)
[1,0]<stdout>:503th item is incorrect! (550.000000 != 93.611450)
[1,0]<stdout>:504th item is incorrect! (954.000000 != -546.897339)
[1,0]<stdout>:505th item is incorrect! (634.500000 != 42.271606)
[1,0]<stdout>:506th item is incorrect! (-273.750000 != -620.189026)
[1,0]<stdout>:507th item is incorrect! (737.500000 != -536.811218)
[1,0]<stdout>:508th item is incorrect! (359.000000 != -196.857056)
[1,0]<stdout>:509th item is incorrect! (-831.000000 != -22.204529)
[1,0]<stdout>:510th item is incorrect! (-307.500000 != -173.418762)
[1,0]<stdout>:511th item is incorrect! (-200.375000 != 248.120239)
[1,0]<stdout>:512th item is incorrect! (712.000000 != -659.695435)
[1,0]<stdout>:256 errors!

Hierarchical Collectives

Add support for hierarchical collectives within the confines of fanin == 1. Some examples:

  • hierarchical rings for (all)gather, (all)reduce, scatter-reduce
  • hierarchical trees for broadcast and scatter

ACCL buffers don't support Vitis sw_emu

Using the XRT driver to create an ACCL buffer for simulation, all create_buffer calls except the one that takes an xrt::bo will create a SimBuffer object that returns a nullptr in the SimBuffer::bo() call.

This becomes a problem, if the buffers created with this call should also be used in a user kernel emulated with Vitis sw_emu.
In this case, an invalid address (nullptr) will be passed to the user kernel and lead to undefined behavior. However, in hardware execution, the same code would work, because another buffer class is used underneath.

This is rather an inconsistency than a bug.

Multiple collectives overlapping

Assuming #3 is fixed and we can receive from multiple nodes simultaneously, add driver/firmware support for configuring and operating multiple communicators simultaneously.

FPGABuffer class doesn't retrun physical hardware address after calling .bo() function

The code below creates two fpga buffers and it is synced to device. However, the address got from the tx_buf_network->bo() and the rx_buf_network->bo() doesn't represent the physical FPGA memory address. So the network kernel can not write to memory, or can only write a very small amount of data to memory.

Buffer<int8_t> tx_buf_network = new FPGABuffer<int8_t>(3210241024, dataType::int8, device, networkmem);
Buffer<int8_t> rx_buf_network = new FPGABuffer<int8_t>(321024
1024, dataType::int8,device, networkmem);
tx_buf_network->sync_to_device();
rx_buf_network->sync_to_device();
network_krnl(localFPGAIP, uint(rank), localFPGAIP, tx_buf_network->bo(), rx_buf_network->bo());

After changing the buffer instantiation using the original xrt api, it works fine. The code is attached below:

auto tx_buf_network = xrt::bo (device, 810241024sizeof(int8_t), networkmem);
tx_buf_network.sync(XCL_BO_SYNC_BO_TO_DEVICE);
auto rx_buf_network = xrt::bo (device, 8
10241024sizeof(int8_t), networkmem);
rx_buf_network.sync(XCL_BO_SYNC_BO_TO_DEVICE);
network_krnl(localFPGAIP, uint(rank), localFPGAIP, tx_buf_network, rx_buf_network);

Command Scheduling from PL is stuck

Following user kernel is used to schedule sends and receives from PL:

#include "accl_hls.h"


void send_recv(const float *read_buffer,float *write_buffer,  ap_uint<32> size, ap_uint<32> num_iterations, 
                ap_uint<32> neighbor_rank, ap_uint<32> communicator_addr, ap_uint<32> datapath_cfg,
                STREAM<command_word> &cmd, STREAM<command_word> &sts) {
    accl_hls::ACCLCommand accl_cmd(cmd, sts, communicator_addr, datapath_cfg,0,0);
    for (int i = 0; i < num_iterations; i++) {
        accl_cmd.send(size, 0, neighbor_rank, (ap_uint<64>)read_buffer);
        accl_cmd.recv(size, 0, neighbor_rank, (ap_uint<64>)write_buffer);
    }
}

The user kernel is linked with the ACCL cclo and plugin kernels of the latest dev branch like this: https://github.com/XilinxDublinLabs/HPCBenchmarks/blob/accl/b_eff/settings/settings.link.xilinx.accl_pl.u55c.hbm.profile.ini

The execution of the design gets stuck when executing the send_recv kernel. Profiling data shows, that the commands of the user kernel do not get passed to the client_arbiterand cclo:

Accelerator Monitor Counters (hex values are cycle count)
  Compute Unit       Ends      Starts    Max Parallel Itr  Execution         Memory Stall      Pipe Stall        Stream Stall      Min Exec          Max Exec        
  ccl_offload_0      0         0         0                 0x0               0x0               0x0               0x0               0xffffffffffffffff  0x0             
  hostctrl_0         4         4         1                 0x6ab             0x0               0x0               0x0               0xc6              0x45d           
  networklayer_0     0         1         1                 0x27aad2d70       0x0               0x0               0x0               0xffffffffffffffff  0x0             
  sendrecv           0         1         1                 0x220eb9f1e       0x0               0x0               0x0               0xffffffffffffffff  0x0             
  cmac_0             0         0         0                 0x0               0x0               0x0               0x0               0xffffffffffffffff  0x0             


AXI Stream Monitor Counters
  Stream Master                        Stream Slave                   Num Trans.        Data kBytes       Busy Cycles       Stall Cycles      Starve Cycles   
  cmac_0/M_AXIS                        networklayer_0/S_AXIS_eth2nl   48                0.832             118               0                 14              
  networklayer_0/M_AXIS_nl2eth         cmac_0/S_AXIS                  512               4.096             512               0                 0               
  networklayer_0/M_AXIS_nl2sk          PIPE                           0                 0.000             0                 0                 0               
  ccl_offload_0/m_axis_eth_tx_data     PIPE                           0                 0.000             0                 0                 0               
  sendrecv/cmd                         client_arbiter/cmd_clients_1   0                 0.000             0                 0                 0               
  ccl_offload_0/m_axis_call_ack        client_arbiter/ack_cclo        0                 0.016             9142780514        0                 9142780510      
  client_arbiter/ack_clients_0         hostctrl_0/sts                 0                 0.016             9142809039        0                 9142809035      
  client_arbiter/ack_clients_1         sendrecv/sts                   0                 0.000             0                 0                 0               
  client_arbiter/cmd_cclo              ccl_offload_0/s_axis_call_req  4                 0.240             76                16                0               
  hostctrl_0/cmd                       client_arbiter/cmd_clients_0   4                 0.240             94                34                0   

Use PLRAM as exchange memory

Currently we have a small exchange memory mapped both in the Microblaze and host address spaces to hold configuration data. The size of this memory is limiting, e.g. some users might need much more memory than others. To enable user-configurable and potentially very large exchange memory, we could read configuration from PLRAM. Performance implications must be considered.

Warning for ACCL operations on too large data buffers

The behaviour of ACCL is undefined when doing operations where the data being exchanged is larger than the size of the rx buffers. This is an easy mistake to overlook, and results in the CCLO hanging without clear cause. It would be helpful if the ACCL driver would issue a warning when trying to perform an operation that is too large to fit in the rx buffers.

MPI_ABORT invoked at the end of test run with failures

Observed on a test against the emulator, with 8 ranks. All tests run, some fail, test killed at the very end with:

[1,7]<stdout>:3 tests failed on rank 7 (skipped 1 tests).
[1,6]<stdout>:3 tests failed on rank 6 (skipped 1 tests).
[1,2]<stdout>:3 tests failed on rank 2 (skipped 1 tests).
[1,3]<stdout>:3 tests failed on rank 3 (skipped 1 tests).
[1,4]<stdout>:3 tests failed on rank 4 (skipped 1 tests).
[1,1]<stdout>:3 tests failed on rank 1 (skipped 1 tests).
[1,5]<stdout>:3 tests failed on rank 5 (skipped 1 tests).
[1,0]<stdout>:3 tests failed on rank 0 (skipped 1 tests).
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

Tri design for performance measurements

A single kernel that sends cmds to 3 or more cclos
the cclos are connected via axis switch
And the performance kernel can be connected to a timer to count the elapsed time before cmd send and ack received
It can recall the collective multiple times
We get rid of pcie communication

Broken links in README

This is just a small usability bug I came across: Most of the links in the section of the README that describe the repository structure are broken and the structure seems to be outdated especially for the demo/test directory.

Evaluate using HOSTMEM where available

HOST memory on selected shells allows direct DMA access into host DDR from the FPGA. Using HOST memory instead of FPGA DDR/HBM may reduce the latency of host-to-host collectives or primitives. Performance implications on throughput must be evaluated.

Allocate memory in simulation with smaller JSON requests

Add a new memory operation in the emulator and simulator that only specifies the memory address + size and allocates the memory without initializing it. This will allow to use bigger buffers, without having a large performance impact from sending large messages over the network.

CCLO appears configured error in XRT driver

Currently the XRT driver throws a CCLO appears configured error in the XRT driver, even though reset_periph is called on deinit. This behavior is not observed in the pynq driver.

Allreduce tests fails in emulator, ranks==8

To reproduce, run tests against an emulator session, with 8 ranks

Some outputs slightly different from expected:

[1,7]<stdout>:1th item is incorrect! (5035.578613 != 5035.579102)
[1,7]<stdout>:2th item is incorrect! (-5832.367676 != -5832.368164)
[1,7]<stdout>:3th item is incorrect! (6492.671387 != 6492.670898)
[1,7]<stdout>:6th item is incorrect! (7501.883789 != 7501.884766)
[1,7]<stdout>:7th item is incorrect! (6614.014648 != 6614.013672)
[1,7]<stdout>:13th item is incorrect! (-3544.027832 != -3544.028320)
[1,7]<stdout>:16th item is incorrect! (7886.101074 != 7886.100586)

Distributed emulation stuck with >= 12 ranks (2+ nodes)

I'm working on the integration of ACCL and OMPC. Currently now using ACCL distributed emulation approach to start testing offloading computation to Alveo boards in a distributed system using ACCL as the communication backend.

I've tried some scenarios:

  • 4 nodes: Every time I go over 3 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
  • 3 nodes: Every time I go over 4 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
  • 2 nodes: Every time I go over 6 (including) emulated ACCL instances per node (12 in total), the application we run gets stuck after some time.
  • 1 node (not distributing): Tested up to 20 ACCL instances, this works with no problem

Any scenario with 10 (or fewer) instances in total do work fine (can't test with 11 instances due to some integration constraints)

Specify data type when sending or receiving from/to streams

If stream_put is reading from a stream and the ACCL:dummy_buffer is used , the data type of the operation is not defined. Currently, a buffer with the required data type needs to be created and passed to the operation, although no buffers are used.

The same holds when sending data from a stream with send or when receiving data and forwarding it to a stream with recv.

Either we need an elegant way to define the data type over dummy buffers without the need to manually allocating one or we need separate function signatures for these kind of operations.

Improve XRT test suite

The test script currently only tests a limited set of inputs which sometimes results in bugs being merged into the project without us realizing. The script is also turning quite large and crashes on the first error. We should look into switching to a unit test framework like GoogleTest.

Improvements

  • Test all data types and multiple data sizes
  • Use different types of buffers for all tests (unaligned, aligned, bo buffer)
  • Continue testing on error
  • Switch to unit test framework (i.e. GoogleTest)

Interface with non-FPGA hosts

To have FPGAs talk to non-FPGA hosts we need a software implementation of the ACCL collectives protocol on top of TCP initially, then on top of RDMA.

Separate tag and stream ID in API calls

Currently, we only have a single parameter to specify the source/destination tag of a message and the source/destination stream_id. This means, that tag and stream_id have to match for every call.

However, the stream_id may also be used in applications to identify different user kernels. Forcing a mapping to a tag artificially reduces the number of valid tags for a message. Also, the tag range is larger than the stream_id range, which may lead to undefined behaviour in some cases.

Possible solutions could be:

  • Use the stream_id that is sent by the user kernel with the data and forward it to the remote.
  • Allow to specify a separate stream_id in recv and stream_put calls to allow the specification of the destination user kernel if the data is coming from a buffer.

Harden DMA command segmentation

A large part of the complexity of the firmware (and time spent in execution) is for managing the issuing and acknowledgement of DMA transfer segments. We should evaluate the feasibility and benefits of offloading the functionality to a HLS IP, which would reduce the firmware complexity and also improve latency especially for small messages.

Implement barrier

Implement a barrier with ACCL, to avoid having to go to system mpi for this functionality. Possibly reuse a ring collective with small dummy payload.

Simulator memory allocation

Currently, the simulator can run out of memory because memory can not be reused.
This leads to failing tests for the C++ driver, because there are too many, small allocations. A proper allocation scheme is required in the C++ driver to prevent this issue sustainably in the C++ tests but also in applications using the C++ driver.

Unintuitive result order of reduce_scatter operation

Currently the results of the reduce_scatter are ordered unintuitively. If the result array is n large and there are n processes, rank i will receive the index (i + 1) % n. It would be more intuitive if rank i would receive index i instead.

Improve C++ API for data transfers to and from streams

Currently, all ACCL calls like send/recv require a reference to a BaseBuffer object as mandatory argument. When streaming is used as input or output, this requires to create and pass a BaseBuffer which is not used at all.

CCLO *ACCL::send(BaseBuffer &srcbuf, unsigned int count, unsigned int dst, unsigned int tag, communicatorId comm_id, bool from_fpga, streamFlags stream_flags, dataType compress_dtype, bool run_async, std::vector<CCLO *> waitfor)

There are different options to change/extend the API to overcome this issue:

  • Pass the buffers as pointer. This would allow to use a nullptr for the argument. This would already be handled by ACCL::prepare_call but it would change the signatures for all calls.
  • Create specific overloads of the calls for streaming that do not use the input buffer and implicitly set the input streaming flag
  • introduce additional calls ACCL::stream_send, ACCL::stream_recv ... that explicitly support streaming and only take the required arguments as input.

I personally would favor the last option because it is most explicit. However, I am not sure if it is flexible enough to support further enhancements of ACCL in the future since we would basically remove control over the streaming flags by the user.

I would also argue for a stream_recv call because the current way of handling communication via streams as one-sided communication is ambiguous. What happens in cases, where two ranks send data to the same destination rank? The order the messages will arrive is not defined and there is no way to find out the current sender on the receiving side. With that, we need to use barriers to enforce the correct order. With the stream_recv calls we would be able to define the order how the messages are received and handled by the user kernel.

I think all this needs further discussion to find the best solution. Maybe there are also technical restrictions i am not aware of.

Enable calls from Vitis kernels

Currently ACCL can only be called from the host. It's useful in some scenarios for Vitis kernels to call ACCL directly. For this, two things are required:

  • adding a separate call interface (pair of external command/response streams, appropriate changes to control plane and firmware)
  • specification of control handshake and HLS library for primitives and collectives

Additional Constructor for ACCL XRT Driver

In the XRT driver, the ACCL constructor currently requires the user to give the cclo and control kernel as xrt::ip and xrt::kernel objects.
But in some cases it may be easier for the user to assume default names for these kernels to allow reduction of the code required for initialization.

Therefore, an additional ACCL constructor taking the xrt::uuid instead of the kernels could be provided. The instantiation of the kernels is moved to the constructor itself. Optionally, the kernel names may be given as std::strings to the constructor.

Compilation fails with u250

Hi all,

I recently tried to compile a bitstream for u250(xilinx_u250_gen3x16_xdma_4_1_202210_1) and it fails during routing. The build was based on the last commit of the new_tcp branch (which is now merged to dev branch).

The error message is the following:

ERROR: [Constraints 18-1000] Routing results verification failed due to partially-conflicted nets (Up to first 10 of violated nets): level0_i/level1/level1_i/ulp/arith_0/inst/dadd_64ns_64ns_64_5_full_dsp_1_U39/reduce_ops_dadd_64ns_64ns_64_5_full_dsp_1_ip_u/inst/i_synth/ADDSUB_OP.ADDSUB/LOGIC_SPEED.OP/B_IP_DELAY/i_pipe/b_frac_del[20] level0_i/level1/level1_i/ulp/arith_0/inst/dadd_64ns_64ns_64_5_full_dsp_1_U39/reduce_ops_dadd_64ns_64ns_64_5_full_dsp_1_ip_u/inst/i_synth/ADDSUB_OP.ADDSUB/LOGIC_SPEED.OP/B_IP_DELAY/i_pipe/b_frac_del[5] level0_i/level1/level1_i/ulp/arith_0/inst/fadd_32ns_32ns_32_7_full_dsp_1_U11/reduce_ops_fadd_32ns_32ns_32_7_full_dsp_1_ip_u/inst/i_synth/ADDSUB_OP.ADDSUB/DSP.OP/A_IP_DELAY/i_pipe/opt_has_pipe.first_q_reg[22]_0[5] level0_i/level1/level1_i/ulp/arith_0/inst/fadd_32ns_32ns_32_7_full_dsp_1_U11/reduce_ops_fadd_32ns_32ns_32_7_full_dsp_1_ip_u/inst/i_synth/ADDSUB_OP.ADDSUB/DSP.OP/DSP48E1_BODY.ALIGN_ADD/SUM_DELAY/i_pipe/opt_has_pipe.first_q_reg[25]_0 level0_i/level1/level1_i/ulp/arith_0/inst/fadd_32ns_32ns_32_7_full_dsp_1_U11/reduce_ops_fadd_32ns_32ns_32_7_full_dsp_1_ip_u/inst/i_synth/ADDSUB_OP.ADDSUB/DSP.OP/DSP48E1_BODY.ALIGN_ADD/SUM_DELAY/i_pipe/mant_norm[5] level0_i/level1/level1_i/ulp/memory_subsystem/inst/interconnect/interconnect_m00_axi_mem00/inst/m00_exit_pipeline/m00_exit/inst/aw_reg/skid_buffer[1144]_i_1__0_n_0 level0_i/level1/level1_i/ulp/memory_subsystem/inst/interconnect/interconnect_s06_axi/inst/m00_exit_pipeline/m00_exit/inst/ar_reg/Q[134] level0_i/level1/level1_i/ulp/arith_0/inst/dadd_64ns_64ns_64_5_full_dsp_1_U40/reduce_ops_dadd_64ns_64ns_64_5_full_dsp_1_ip_u/inst/i_synth/ADDSUB_OP.ADDSUB/LOGIC_SPEED.OP/ALIGN_BLK/FRAC_ADDSUB/DSP_ADD.FRAC_ADDSUB/i_no_versal_es1_workaround.DSP48E1_ADD.DSP48E1_ADD/i_no_versal_es1_workaround.DSP/P[26] level0_i/level1/level1_i/ulp/arith_0/inst/dadd_64ns_64ns_64_5_full_dsp_1_U40/reduce_ops_dadd_64ns_64ns_64_5_full_dsp_1_ip_u/inst/i_synth/ADDSUB_OP.ADDSUB/LOGIC_SPEED.OP/ALIGN_BLK/FRAC_ADDSUB/DSP_ADD.FRAC_ADDSUB/i_no_versal_es1_workaround.DSP48E1_ADD.DSP48E1_ADD/i_no_versal_es1_workaround.DSP/P[25] level0_i/level1/level1_i/ulp/arith_0/inst/dadd_64ns_64ns_64_5_full_dsp_1_U36/reduce_ops_dadd_64ns_64ns_64_5_full_dsp_1_ip_u/inst/i_synth/ADDSUB_OP.ADDSUB/LOGIC_SPEED.OP/ALIGN_BLK/FRAC_ADDSUB/DSP_ADD.FRAC_ADDSUB/i_no_versal_es1_workaround.DSP48E1_ADD.DSP48E1_ADD/i_no_versal_es1_workaround.DSP_7

Enable RX Fan-In

Currently the receive pipeline supports fan-in == 1 and all collectives are ring-based as a result. Adding support for fan-in > 1 would enable tree collectives.

Send/Recv to same rank

Currently it is not possible to have the same source and destination rank for send/recv.
A possible replacement for this call may be a copy from or to stream to buffer the data manually.

Add support for in-place operators

It would be useful to add support for in-place operators in the reduce collectives, so that you can pass the same buffer for input and output.

Document which collectives support streaming

For some collectives it's impossible to use streaming input/output, for example reduce can't use a stream as output. For new users it might not be easy to find out where this feature is supported. It would be useful to have a table on readthedocs that shows for each argument of all collectives whether it supports streaming or not.

Application stuck on receive

Application b_eff works in emulation and simulation as expected.
ACCL XRT tests succeed with the bitstream (except streaming send/recv which is not used by the application).
Application hangs on first receive when executed on hardware.

Run run_mpi.sh in following directory: /proj/xlabs_t3/users/mariusm/runs/2022-10-20-b_eff_accl_test_u55c

Application repo: https://gitenterprise.xilinx.com/mariusm/HPCC_FPGA.git
Bitstream: /proj/xlabs_t3/users/mariusm/synth/benchmarks/b_eff/u55c_pl/build

stream_put uses inconsistent stream_id

stream_put requires to pass a stream_id. This stream_id needs to be >= 9. In the C++ driver, this id is subtracted by 9 in this line because of non-obvious reasons.

This leads to the issue, that if stream_put is called with stream_id=9 and source and destination rank are the same, it will send the data to stream_id=0.
The behavior is described in this test which should pass in my eyes.

For send/recv, the stream_id/tag is used as expected as it can be seen in this test.

In my opinion, the behavior should be the same for both operations.

No way of de-allocating buffers in simulator/emulator

Buffers get allocated sequentially and never get de-allocated, which can create problems especially with the simulator, which has very little memory. There needs to be a method to delete buffers and reuse the freed memory for new buffers (i.e. a real allocator)

Implement alltoall

Provide alltoall collective in ACCL. Explore ring-based implementation for fanin==1 and broadcast-based implementations for fanin >= 1.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.