Comments (19)
In other words,
clCreateBuffer
succeeds even for a size beyondget_global_memory_size(gpu_id)
, but then the kernel invocation fails later.
Nvidia supports an "Unified Memory" (CPU, GPU's memories) you can use with Cuda (OpenCL too). In my laptop GPU we get the best performance using all GPU memory and a little of system memory. I am not sure what are the specific requirements for this "feature" and GPU memory management in general is very finicky. I think they change quite a lot with drivers.
from john.
Why would that be, @alainesp?
It is a problem with a very small MAX_KEYS_PER_CRYPT in this code:
// TODO: Optimize GWS/LWS, or select a safe bet
size_t gws_copy[] = {MAX_KEYS_PER_CRYPT * 2, lanes};
size_t lws_copy[] = {64, 1};
MAX_KEYS_PER_CRYPT in your example is with GWS=512: GWS / THREADS_PER_LANE / lanes = 16. So gws_copy==32 and lws_copy==64.
We can do something like:
size_t lws_copy[] = {MIN(get_kernel_preferred_multiple(gpu_id, pre_processing_kernel), MAX_KEYS_PER_CRYPT * 2 * lanes), 1};
but we have the same problem if we increase memory beyond the minimum LWS supported (Is there an OpenCL minimum LWS?). The better option is to add a param to pre_processing_kernel
and put a maximum number of keys to process.
from john.
Thank you for the comments, @alainesp! Do you intent to try and implement some improvements here?
from john.
Do you intent to try and implement some improvements here?
Yes, at least the pre-processing kernel error. The global memory problem, I am not sure.
from john.
A solution to the problems mentioned here on: https://github.com/alainesp/john/tree/argon2_opencl
One issue that may or may not be problematic is that now by default we use non-power-of-two MAX_KEYS_PER_CRYPT because we try to use the GPU memory fully, and the number reported by OpenCL is a very small amount below a power of two.
from john.
Thank you, Alain! How did you test this?
from john.
Thank you, Alain! How did you test this?
Very light testing on my laptop and super.
from john.
Testing this on super, it often doesn't actually "use all GPU memory" because the attempted allocation is too close to maximum and fails, in which case it uses slightly less than half the memory now. This happens for smaller m_cost. With larger m_cost, the rounding is such that the allocation is just far enough from the maximum to succeed.
On Titan Kepler, this reduced the --test
speed from ~5k at 4 GiB with old code to ~4k at ~3 GB with Alain's current code. However, fixing that to allow it to allocate ~6 GB (by using 31/32 of the maximum) also results in a speed of ~4k.
- // Use all GPU memory by default
- MAX_KEYS_PER_CRYPT = get_global_memory_size(gpu_id) / max_memory_size;
+ // Use almost all GPU memory by default
+ MAX_KEYS_PER_CRYPT = get_global_memory_size(gpu_id) * 31 / 32 / max_memory_size;
In general, also on other devices, I am seeing either no speed change or slowdowns, sometimes by up to 3x, when going for maximum memory usage even for low m_cost hashes. So the "bug" where the initial allocation attempt would fail for those is actually beneficial.
I guess the slowdowns are because the "excessive" concurrency hurts GPU cache hit rate when (re-)accessing the previous/reference/current blocks. The way Argon2 is designed, at least the previous block is supposed to still be in cache, and the current block being rewritten should stay in cache during this process (between the read and write). BTW, maybe we could improve performance by using local memory for some of these (or even for their portions).
So it looks like we need to determine MAX_KEYS_PER_CRYPT
not only by available memory, but also by available cache. For the latter, it can be something like total device L2 cache size divided by 1, 2, or 3 KiB (for this many Argon2 blocks fitting in cache). Alain, would you try that?
For high m_cost, Alain's changes so far seem to work well.
from john.
Hmm, no, my cache theory is at least incomplete. At the MAX_KEYS_PER_CRYPT
figures we use, we're still very far from L2 cache size for those blocks. It could be an L1 cache thing, but then testing some more I see that the slowdown on Titan Kepler is actually from it auto-tuning to LWS=32, vs. 64 that it had before. Forcing --lws=64
makes it run fast again, even at ~6 GB.
Alain, would you look into this LWS issue? (Instead of what I had said about the cache.)
from john.
but then testing some more I see that the slowdown on Titan Kepler is actually from it auto-tuning to LWS=32, vs. 64 that it had before. Forcing
--lws=64
makes it run fast again, even at ~6 GB.Alain, would you look into this LWS issue? (Instead of what I had said about the cache.)
Yes, the problem is that we autotune LWS only if MAX_KEYS_PER_CRYPT is a power of two. It can be changed to something more general, but I think the simpler option is to use a multiple of two given that the get_kernel_preferred_multiple(gpu_id, kernels[type])
is normally a power of two.
from john.
Yes, the problem is that we autotune LWS only if MAX_KEYS_PER_CRYPT is a power of two.
Why don't we do that in other cases?
I think the simpler option is to use a multiple of two
Do you mean doubling the LWS value?
from john.
Why don't we do that in other cases?
To solve the problem generally, we need to factorize MAX_KEYS_PER_CRYPT and then use these factors to test what combination gives us the best performance. That may be overkill for a small performance boost.
I think the simpler option is to use a multiple of two
Do you mean doubling the LWS value?
No, I mean to test only LWS that are multiple of twos until we use all possible 2-factors of MAX_KEYS_PER_CRYPT. That is easy to implement and MAX_KEYS_PER_CRYPT should have some 2-factors in many cases.
from john.
For non-tiny MAX_KEYS_PER_CRYPT
, we can round it down to the previous even number (unless it's already even). Then we should be able to try at least LWS 32 and 64, as well as 128 and on if there happen to be more 2's in the factorization. Would you try that?
from john.
For non-tiny
MAX_KEYS_PER_CRYPT
, we can round it down to the previous even number (unless it's already even). Then we should be able to try at least LWS 32 and 64, as well as 128 and on if there happen to be more 2's in the factorization. Would you try that?
Yes, I was already trying that. Testing now on super.
from john.
Yes, I was already trying that. Testing now on super.
Testing on super the change doesn't improve performance. What improved performance was to use almost all GPU memory (for real) AND the change. We may want to do something like:
MAX_KEYS_PER_CRYPT = get_global_memory_size(gpu_id)*4/5 / (max_memory_size + ARGON2_PREHASH_DIGEST_LENGTH);
to use only 80% of the GPU memory. The best ratio may be different than 4/5 though, we need to test it.
from john.
There is a new version with the discussed changes on the same branch as before. Testing on super, by default, gives the best performance I can easily check, except for TITAN X. There using 9728 MB -> 6589 c/s and now using 11392 MB -> 6221 c/s, so in some cases using most of the GPU memory isn't advantageous.
from john.
except for TITAN X. There using 9728 MB -> 6589 c/s and now using 11392 MB -> 6221 c/s, so in some cases using most of the GPU memory isn't advantageous.
Thanks, Alain! I confirm this, on super's devices other than Titan X Maxwell everything is good now, but on this one at low to moderate m_costs we got slowdowns. In fact, at our test 16 MiB hash $argon2d$v=19$m=16384,t=3,p=1$c2hvcnRfc2FsdA$TLSTPihIo+5F67Y1vJdfWdB9
the slowdown from our currently merged code (which uses 4 GiB by default) is as bad as 2x. I wonder why that would be.
Reading up on Maxwell's caches (which I had read years ago, but forgot), it has 48 KiB L1 per SMM and 24 SMMs. We currently auto-tune to LWS=256 GWS=22784. Going from the GWS, it's something like 22784/32/24 = ~30 KB per SMM per Argon2 instance per block. I can see how this can be too much when we're working with previous/reference/current blocks simultaneously, it's something like 90 KB vs. 48 KiB cache. However, the slowdown starts to be seen already at much lower GWS - basically, anything above 8192 is slower. 8192 means about 32 KiB per 3 blocks, which is still only 2/3 of 48 KiB cache. I don't see why caching a few more than 3 would matter for Argon2. Maybe it's other demand for the cache or its limited associativity. I also wonder why similar isn't seen on our other GPUs.
from john.
@alainesp Please send us a PR with your changes. I will then add mine on top of yours. Most likely the below, which works well for me:
+++ b/src/opencl_argon2_fmt_plug.c
@@ -510,13 +510,27 @@ static void reset(struct db_main *db)
//----------------------------------------------------------------------------------------------------------------------------
// Create OpenCL objects
//----------------------------------------------------------------------------------------------------------------------------
- // Use all GPU memory by default
+ // Use almost all GPU memory by default
+ unsigned int warps = 6, limit, target;
if (gpu_amd(device_info[gpu_id])) {
- MAX_KEYS_PER_CRYPT = get_max_mem_alloc_size(gpu_id) / max_memory_size;
+ limit = get_max_mem_alloc_size(gpu_id) / max_memory_size;
} else {
- MAX_KEYS_PER_CRYPT = get_global_memory_size(gpu_id) * 15 / 16 / (max_memory_size + ARGON2_PREHASH_DIGEST_LENGTH);
+ if (gpu_nvidia(device_info[gpu_id])) {
+ unsigned int major = 0, minor = 0;
+ get_compute_capability(gpu_id, &major, &minor);
+ if (major == 5) /* NVIDIA Maxwell */
+ warps = 2;
+ }
+ limit = get_global_memory_size(gpu_id) * 31 / 32 / (max_memory_size + ARGON2_PREHASH_DIGEST_LENGTH);
}
- MAX_KEYS_PER_CRYPT -= MAX_KEYS_PER_CRYPT & (MAX_KEYS_PER_CRYPT > 128 ? 3 : 1); // Make it even or multiple of 4
+ do {
+ target = (get_processors_count(gpu_id) * warps + THREADS_PER_LANE - 1) / THREADS_PER_LANE;
+ } while (target > limit && --warps > 1);
+ if (target > limit)
+ target = limit;
+ if (target > 16)
+ target -= target & (target > 128 ? 3 : 1); // Make it even or multiple of 4
+ MAX_KEYS_PER_CRYPT = target;
// Load GWS from config/command line
opencl_get_user_preferences(FORMAT_NAME);
if (global_work_size && !self_test_running) {
from john.
Alain's portion of changes is now merged via #5404.
from john.
Related Issues (20)
- Possible to get the Core Generator project for the ZTEX source project? HOT 10
- Not ISSUE: GPG SYMMETRIC ENCRYPTION HOT 3
- NVIDIA driver 551.86 new "Warning: Function [...] is a kernel, so overriding noinline attribute" HOT 12
- Złamanie hasła HOT 1
- libreoffice2john.py: UnboundLocalError: cannot access local variable 'start_key_generation_name' HOT 12
- Broken lower-case mssql hash validation HOT 1
- PBKDF2 With salt Support HOT 4
- On recent NVIDIA drivers, OpenCL formats in 1.9.0-jumbo-1 fail with: error: passing '__generic uchar *' (aka '__generic unsigned char *') to parameter of type 'const uchar *' (aka 'const unsigned char *') changes address space of pointer HOT 2
- Dynamic memory leak when using a constant HOT 5
- GCC is again complaining about `use after [..]alloc` HOT 2
- Wrong usage of pcapngepb and pcapngpb in wpapcap2john.c HOT 5
- Omit the number 4 from the make -sj4 recommendation HOT 5
- Add Python `requirements.txt` file with pinned version numbers HOT 10
- Undefined Behavior Sanitizer "errors" HOT 12
- Support Keplr v2 wallets HOT 10
- 'make check' oddities HOT 10
- limit number of threads HOT 2
- mssql05 misses some cracks in AVX512BW builds HOT 15
- Test suite failures HOT 1
- crypt format occasionally fails a `--test-full=0 --format=cpu` run on Ubuntu 22 powerpc64le HOT 13
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from john.