Hi, I am trying to use latest AMD driver <a href="https://www.amd.com/en/support/graph

<a class="issue-link js-issue-link" data-error-text="Failed to load title" data-id="21

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

ERROR: task loop loop thread failed: Got nonfinite for policy sum about katago HOT 11 OPEN

ntkylin commented on September 15, 2024

ERROR: task loop loop thread failed: Got nonfinite for policy sum

from katago.

Comments (11)

jojobm commented on September 15, 2024

#900 maybe this will help your?

from katago.

lightvector commented on September 15, 2024

@ntkylin - is this the same machine or GPU as in #908?

It might be the case that your GPU has a heating or power or other internal issue that causes it to malfunction or produce incorrect values under high load for a sustained amount of time. In that case, I would advise simply avoiding running contribute, at least to to maintain GPU health but also because if the error check in contribution manage not to catch bad data and some slips through it might not be great for KataGo training.

Have you ever seen the same error happen when doing extended amounts of analysis using KataGo just for personal use, when not running contribute?

from katago.

ntkylin commented on September 15, 2024

@lightvector, yes, this issue is the same GPU as used in #908.

I haven't use the GPU for personal use such as analysis personal games, but only for contribution. The dGPU has been successful continuously running for over 3 month in last year for contribution routine without any overclock and overpower.

One other question is that, is there any possible method to run only training games not rating games?

from katago.

lightvector commented on September 15, 2024

Yes, that's possible, you can set maxRatingMatches to 0. It's not advertised heavily because if everyone does it, we won't have any rating games.

How are things going now, and how many times have errors reoccured?

If your GPU continues to fail commonly, I would request that you not continue to attempt to run contribute using it, and if you don't find a way to run the GPU in a way that prevents the problem fundamentally, I would request that you not try to find workarounds such as putting a loop around to restart the contribute process when it fails.

Because I'm concerned that this may also result in some proportion of games where it doesn't fail badly to crash the process but does produce incorrect results and bad moves in those games, and that it might be harmful to training. There was at least one case of a user in the past whose GPU commonly produced bad data and we eventually had to filter all of their data out.

from katago.

ntkylin commented on September 15, 2024

@lightvector, after install the latest driver in Ubuntu 22.04, the contribution routine can be run in a stable way, at the moment is withstanding over 30 hours about 800 training games. But in Windows11, cannot be going further for 100 games. It seems that there are some issues on the AMD's Windows driver.

The evaluation should still be test with more period under Ubuntu. Could you please show me how to check the quality of the result of training games or how to identify the incorrect bad moves of the training results then? Thanks in advance.

from katago.

lightvector commented on September 15, 2024

Try using KataGo a lot for personal analysis while the GPU is also otherwise under heavy load. See if it suggests nonsensical moves occasionally or if the eval swings a lot or other things like that. Also you could try running exclusively a few rating games for contribute and see if the games look normal compared to other rating games.

This might not be easy, and might require expertise at Go (I'm not sure what your personal experience level is) and it's still possible that if errors are rare on a per-move basis, you might not be able to tell, yet still the games may be harmful for training.

Based on your observation I would just recommend you entirely avoid running it on Windows if it seems to fail after a short time then. Please do NOT do further testing via contributing any training games on Windows, so as to avoid the risk of contaminating the training data even while testing.

from katago.

lightvector commented on September 15, 2024

Anyways, if it continues to seem stable under the Ubuntu driver, feel free to keep running it there. And of course, thanks again for raising the issue and working through all the trouble and for generously contributing the compute power to self-play in the first place. Let me know if there is anything else I can help with!

from katago.

lightvector commented on September 15, 2024

By the way, if you do want to keep testing on Windows other parameters and driver configurations and you need a convenient way to use KataGo to heavily load the GPU but you want to do it outside of live contribute, one option besides trying personal analysis is to use KataGo to play a ton of games locally. See match_example.cfg which is included with precompiled releases or in the repo here https://github.com/lightvector/KataGo/blob/master/cpp/configs/match_example.cfg. Set numGameThreads to a decently large number and set the other settings below to configure which neural net model file to use, the rules and board sizes, etc, and you can run it (via ./katago match) and it will play a bunch of games locally and write out the sgfs.

from katago.

ntkylin commented on September 15, 2024

Thanks for your help!
I have stopped the contribution and try to check the situation if any damage of GPU ever it has. With attached match_example.cfg, the result seems it failed to continued as shown below:

ntkylin@fest:~/katago-v1.14.1-opencl-linux-x64$ ./katago match -config match_example.cfg -log-file match.log -sgf-output-dir ./temp
2024-03-13 14:37:27+0800: Running with following config:
allowResignation = true
bSizeRelProbs = 90,5,5
bSizes = 19,13,9
botName = FOO
chosenMoveTemperature = 0.20
chosenMoveTemperatureEarly = 0.60
handicapCompensateKomiProb = 1.0
handicapProb = 0.0
hasButtons = false,true
koRules = SIMPLE,POSITIONAL,SITUATIONAL
komiAuto = True
logGamesEvery = 50
logMoves = false
logSearchInfo = false
logToStdout = true
maxMovesPerGame = 1200
maxVisits = 500
multiStoneSuicideLegals = false,true
nnCacheSizePowerOfTwo = 21
nnMaxBatchSize = 32
nnModelFile = PATH_TO_MODEL
nnMutexPoolSizePowerOfTwo = 17
nnRandomize = true
numBots = 1
numGameThreads = 16
numGamesTotal = 100
numNNServerThreadsPerModel = 1
numSearchThreads = 16
resignConsecTurns = 6
resignThreshold = -0.95
scoringRules = AREA,TERRITORY
taxRules = NONE,SEKI,ALL

2024-03-13 14:37:27+0800: Match Engine starting...
2024-03-13 14:37:27+0800: Git revision: f2dc582f98a79fefeb11b2c37de7db0905318f4f
2024-03-13 14:37:27+0800: Loaded neural net
terminate called after throwing an instance of 'StringError'
  what():  MatchPairer: no matchups specified
Aborted (core dumped)

so what should to do then?
match_example.cfg.txt

from katago.

ntkylin commented on September 15, 2024

@lightvector, by the way, could you please show me the way, how to make a record for errors if it happened to break the routines and make them output into log file? Is there any commend parameters then?

from katago.

lightvector commented on September 15, 2024

Sorry for the confusion - it looks like there's an oversight in the match code where right now it requires at least 2 bots to function, it doesn't have a special case where it will play only a single bot against itself.

So you can fix this by having
numBots=2
botName0 = A
botName1 = B

And then A and B should be able to play each other, with identical parameters.

If you're asking about how to make files that also contain stderr, take a look at https://askubuntu.com/questions/868335/how-do-i-save-a-shells-stderr-and-stdout-to-a-file-while-still-having-it-output , for example. Does that answer your question? Note that it will need to be a separate txt file than the log that KataGo outputs by itself.

from katago.

ERROR: task loop loop thread failed: Got nonfinite for policy sum about katago HOT 11 OPEN

Comments (11)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent