Comments (11)
#900 maybe this will help your?
from katago.
@ntkylin - is this the same machine or GPU as in #908?
It might be the case that your GPU has a heating or power or other internal issue that causes it to malfunction or produce incorrect values under high load for a sustained amount of time. In that case, I would advise simply avoiding running contribute, at least to to maintain GPU health but also because if the error check in contribution manage not to catch bad data and some slips through it might not be great for KataGo training.
Have you ever seen the same error happen when doing extended amounts of analysis using KataGo just for personal use, when not running contribute?
from katago.
@lightvector, yes, this issue is the same GPU as used in #908.
I haven't use the GPU for personal use such as analysis personal games, but only for contribution. The dGPU has been successful continuously running for over 3 month in last year for contribution routine without any overclock and overpower.
One other question is that, is there any possible method to run only training games not rating games?
from katago.
Yes, that's possible, you can set maxRatingMatches
to 0. It's not advertised heavily because if everyone does it, we won't have any rating games.
How are things going now, and how many times have errors reoccured?
If your GPU continues to fail commonly, I would request that you not continue to attempt to run contribute using it, and if you don't find a way to run the GPU in a way that prevents the problem fundamentally, I would request that you not try to find workarounds such as putting a loop around to restart the contribute process when it fails.
Because I'm concerned that this may also result in some proportion of games where it doesn't fail badly to crash the process but does produce incorrect results and bad moves in those games, and that it might be harmful to training. There was at least one case of a user in the past whose GPU commonly produced bad data and we eventually had to filter all of their data out.
from katago.
@lightvector, after install the latest driver in Ubuntu 22.04, the contribution routine can be run in a stable way, at the moment is withstanding over 30 hours about 800 training games. But in Windows11, cannot be going further for 100 games. It seems that there are some issues on the AMD's Windows driver.
The evaluation should still be test with more period under Ubuntu. Could you please show me how to check the quality of the result of training games or how to identify the incorrect bad moves of the training results then? Thanks in advance.
from katago.
Try using KataGo a lot for personal analysis while the GPU is also otherwise under heavy load. See if it suggests nonsensical moves occasionally or if the eval swings a lot or other things like that. Also you could try running exclusively a few rating games for contribute and see if the games look normal compared to other rating games.
This might not be easy, and might require expertise at Go (I'm not sure what your personal experience level is) and it's still possible that if errors are rare on a per-move basis, you might not be able to tell, yet still the games may be harmful for training.
Based on your observation I would just recommend you entirely avoid running it on Windows if it seems to fail after a short time then. Please do NOT do further testing via contributing any training games on Windows, so as to avoid the risk of contaminating the training data even while testing.
from katago.
Anyways, if it continues to seem stable under the Ubuntu driver, feel free to keep running it there. And of course, thanks again for raising the issue and working through all the trouble and for generously contributing the compute power to self-play in the first place. Let me know if there is anything else I can help with!
from katago.
By the way, if you do want to keep testing on Windows other parameters and driver configurations and you need a convenient way to use KataGo to heavily load the GPU but you want to do it outside of live contribute, one option besides trying personal analysis is to use KataGo to play a ton of games locally. See match_example.cfg
which is included with precompiled releases or in the repo here https://github.com/lightvector/KataGo/blob/master/cpp/configs/match_example.cfg. Set numGameThreads to a decently large number and set the other settings below to configure which neural net model file to use, the rules and board sizes, etc, and you can run it (via ./katago match
) and it will play a bunch of games locally and write out the sgfs.
from katago.
Thanks for your help!
I have stopped the contribution and try to check the situation if any damage of GPU ever it has. With attached match_example.cfg, the result seems it failed to continued as shown below:
ntkylin@fest:~/katago-v1.14.1-opencl-linux-x64$ ./katago match -config match_example.cfg -log-file match.log -sgf-output-dir ./temp
2024-03-13 14:37:27+0800: Running with following config:
allowResignation = true
bSizeRelProbs = 90,5,5
bSizes = 19,13,9
botName = FOO
chosenMoveTemperature = 0.20
chosenMoveTemperatureEarly = 0.60
handicapCompensateKomiProb = 1.0
handicapProb = 0.0
hasButtons = false,true
koRules = SIMPLE,POSITIONAL,SITUATIONAL
komiAuto = True
logGamesEvery = 50
logMoves = false
logSearchInfo = false
logToStdout = true
maxMovesPerGame = 1200
maxVisits = 500
multiStoneSuicideLegals = false,true
nnCacheSizePowerOfTwo = 21
nnMaxBatchSize = 32
nnModelFile = PATH_TO_MODEL
nnMutexPoolSizePowerOfTwo = 17
nnRandomize = true
numBots = 1
numGameThreads = 16
numGamesTotal = 100
numNNServerThreadsPerModel = 1
numSearchThreads = 16
resignConsecTurns = 6
resignThreshold = -0.95
scoringRules = AREA,TERRITORY
taxRules = NONE,SEKI,ALL
2024-03-13 14:37:27+0800: Match Engine starting...
2024-03-13 14:37:27+0800: Git revision: f2dc582f98a79fefeb11b2c37de7db0905318f4f
2024-03-13 14:37:27+0800: Loaded neural net
terminate called after throwing an instance of 'StringError'
what(): MatchPairer: no matchups specified
Aborted (core dumped)
so what should to do then?
match_example.cfg.txt
from katago.
@lightvector, by the way, could you please show me the way, how to make a record for errors if it happened to break the routines and make them output into log file? Is there any commend parameters then?
from katago.
Sorry for the confusion - it looks like there's an oversight in the match code where right now it requires at least 2 bots to function, it doesn't have a special case where it will play only a single bot against itself.
So you can fix this by having
numBots=2
botName0 = A
botName1 = B
And then A and B should be able to play each other, with identical parameters.
If you're asking about how to make files that also contain stderr, take a look at https://askubuntu.com/questions/868335/how-do-i-save-a-shells-stderr-and-stdout-to-a-file-while-still-having-it-output , for example. Does that answer your question? Note that it will need to be a separate txt file than the log that KataGo outputs by itself.
from katago.
Related Issues (20)
- Error checks for kata-raw-human-nn HOT 1
- How to set rules to "twisted cross and eating" for beginner HOT 2
- Throwback HOT 4
- Lack of ability HOT 3
- Katago cannot give a definitive answer for the best move. HOT 2
- Report an error HOT 3
- allowResignation affects humanSL strength?? HOT 4
- Loose ladder
- katago 1.15.1 build failure HOT 1
- Do you have a plan to create a GUI? HOT 4
- kata-analyze reports a non-human-like move for the preaz_20k human profile HOT 1
- Evaluate the value of moves HOT 2
- KataGo works with old libssl versions only HOT 5
- katago-v1.15.3-opencl-windows-x64 Server returned error 400: This version of KataGo is not enabled for distributed. HOT 2
- trt 15.3 causes system crashes
- Training of human-like models HOT 8
- The Homebrew version is still 1.14.1 HOT 20
- What are the difference between ancient-territory and stone-scoring rules? HOT 3
- Effect of pass moves on ownership
- Identify equivalent states in MCTS
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from katago.