The current argon2-opencl format has usability and efficiency issues: <p dir="auto

Why would that be, <a class="user-mention notranslate" data-hovercard-typ

Thank you for the comments, <a class="user-mention notranslate" data-hovercard-type="u

A solution to the problems mentioned here on: <a href="https://github.com/alainesp/joh

Improve argon2-opencl (auto-)tuning for high m_cost about john HOT 19 CLOSED

solardiz commented on September 21, 2024

Improve argon2-opencl (auto-)tuning for high m_cost

from john.

Comments (19)

alainesp commented on September 21, 2024

In other words, clCreateBuffer succeeds even for a size beyond get_global_memory_size(gpu_id), but then the kernel invocation fails later.

Nvidia supports an "Unified Memory" (CPU, GPU's memories) you can use with Cuda (OpenCL too). In my laptop GPU we get the best performance using all GPU memory and a little of system memory. I am not sure what are the specific requirements for this "feature" and GPU memory management in general is very finicky. I think they change quite a lot with drivers.

from john.

alainesp commented on September 21, 2024

Why would that be, @alainesp?

It is a problem with a very small MAX_KEYS_PER_CRYPT in this code:

// TODO: Optimize GWS/LWS, or select a safe bet
size_t gws_copy[] = {MAX_KEYS_PER_CRYPT * 2, lanes};
size_t lws_copy[] = {64, 1};

MAX_KEYS_PER_CRYPT in your example is with GWS=512: GWS / THREADS_PER_LANE / lanes = 16. So gws_copy==32 and lws_copy==64.

We can do something like:

size_t lws_copy[] = {MIN(get_kernel_preferred_multiple(gpu_id, pre_processing_kernel), MAX_KEYS_PER_CRYPT * 2 * lanes), 1};

but we have the same problem if we increase memory beyond the minimum LWS supported (Is there an OpenCL minimum LWS?). The better option is to add a param to pre_processing_kernel and put a maximum number of keys to process.

from john.

solardiz commented on September 21, 2024

Thank you for the comments, @alainesp! Do you intent to try and implement some improvements here?

from john.

alainesp commented on September 21, 2024

Do you intent to try and implement some improvements here?

Yes, at least the pre-processing kernel error. The global memory problem, I am not sure.

from john.

alainesp commented on September 21, 2024

A solution to the problems mentioned here on: https://github.com/alainesp/john/tree/argon2_opencl

One issue that may or may not be problematic is that now by default we use non-power-of-two MAX_KEYS_PER_CRYPT because we try to use the GPU memory fully, and the number reported by OpenCL is a very small amount below a power of two.

from john.

solardiz commented on September 21, 2024

Thank you, Alain! How did you test this?

from john.

alainesp commented on September 21, 2024

Thank you, Alain! How did you test this?

Very light testing on my laptop and super.

from john.

solardiz commented on September 21, 2024

Testing this on super, it often doesn't actually "use all GPU memory" because the attempted allocation is too close to maximum and fails, in which case it uses slightly less than half the memory now. This happens for smaller m_cost. With larger m_cost, the rounding is such that the allocation is just far enough from the maximum to succeed.

On Titan Kepler, this reduced the --test speed from ~5k at 4 GiB with old code to ~4k at ~3 GB with Alain's current code. However, fixing that to allow it to allocate ~6 GB (by using 31/32 of the maximum) also results in a speed of ~4k.

-       // Use all GPU memory by default
-       MAX_KEYS_PER_CRYPT = get_global_memory_size(gpu_id) / max_memory_size;
+       // Use almost all GPU memory by default
+       MAX_KEYS_PER_CRYPT = get_global_memory_size(gpu_id) * 31 / 32 / max_memory_size;

In general, also on other devices, I am seeing either no speed change or slowdowns, sometimes by up to 3x, when going for maximum memory usage even for low m_cost hashes. So the "bug" where the initial allocation attempt would fail for those is actually beneficial.

I guess the slowdowns are because the "excessive" concurrency hurts GPU cache hit rate when (re-)accessing the previous/reference/current blocks. The way Argon2 is designed, at least the previous block is supposed to still be in cache, and the current block being rewritten should stay in cache during this process (between the read and write). BTW, maybe we could improve performance by using local memory for some of these (or even for their portions).

So it looks like we need to determine MAX_KEYS_PER_CRYPT not only by available memory, but also by available cache. For the latter, it can be something like total device L2 cache size divided by 1, 2, or 3 KiB (for this many Argon2 blocks fitting in cache). Alain, would you try that?

For high m_cost, Alain's changes so far seem to work well.

from john.

solardiz commented on September 21, 2024

Hmm, no, my cache theory is at least incomplete. At the MAX_KEYS_PER_CRYPT figures we use, we're still very far from L2 cache size for those blocks. It could be an L1 cache thing, but then testing some more I see that the slowdown on Titan Kepler is actually from it auto-tuning to LWS=32, vs. 64 that it had before. Forcing --lws=64 makes it run fast again, even at ~6 GB.

Alain, would you look into this LWS issue? (Instead of what I had said about the cache.)

from john.

alainesp commented on September 21, 2024

but then testing some more I see that the slowdown on Titan Kepler is actually from it auto-tuning to LWS=32, vs. 64 that it had before. Forcing --lws=64 makes it run fast again, even at ~6 GB.

Alain, would you look into this LWS issue? (Instead of what I had said about the cache.)

Yes, the problem is that we autotune LWS only if MAX_KEYS_PER_CRYPT is a power of two. It can be changed to something more general, but I think the simpler option is to use a multiple of two given that the get_kernel_preferred_multiple(gpu_id, kernels[type]) is normally a power of two.

from john.

solardiz commented on September 21, 2024

Yes, the problem is that we autotune LWS only if MAX_KEYS_PER_CRYPT is a power of two.

Why don't we do that in other cases?

I think the simpler option is to use a multiple of two

Do you mean doubling the LWS value?

from john.

alainesp commented on September 21, 2024

Why don't we do that in other cases?

To solve the problem generally, we need to factorize MAX_KEYS_PER_CRYPT and then use these factors to test what combination gives us the best performance. That may be overkill for a small performance boost.

I think the simpler option is to use a multiple of two

Do you mean doubling the LWS value?

No, I mean to test only LWS that are multiple of twos until we use all possible 2-factors of MAX_KEYS_PER_CRYPT. That is easy to implement and MAX_KEYS_PER_CRYPT should have some 2-factors in many cases.

from john.

solardiz commented on September 21, 2024

For non-tiny MAX_KEYS_PER_CRYPT, we can round it down to the previous even number (unless it's already even). Then we should be able to try at least LWS 32 and 64, as well as 128 and on if there happen to be more 2's in the factorization. Would you try that?

from john.

alainesp commented on September 21, 2024

For non-tiny MAX_KEYS_PER_CRYPT, we can round it down to the previous even number (unless it's already even). Then we should be able to try at least LWS 32 and 64, as well as 128 and on if there happen to be more 2's in the factorization. Would you try that?

Yes, I was already trying that. Testing now on super.

from john.

alainesp commented on September 21, 2024

Yes, I was already trying that. Testing now on super.

Testing on super the change doesn't improve performance. What improved performance was to use almost all GPU memory (for real) AND the change. We may want to do something like:

MAX_KEYS_PER_CRYPT = get_global_memory_size(gpu_id)*4/5 / (max_memory_size + ARGON2_PREHASH_DIGEST_LENGTH);

to use only 80% of the GPU memory. The best ratio may be different than 4/5 though, we need to test it.

from john.

alainesp commented on September 21, 2024

There is a new version with the discussed changes on the same branch as before. Testing on super, by default, gives the best performance I can easily check, except for TITAN X. There using 9728 MB -> 6589 c/s and now using 11392 MB -> 6221 c/s, so in some cases using most of the GPU memory isn't advantageous.

from john.

solardiz commented on September 21, 2024

except for TITAN X. There using 9728 MB -> 6589 c/s and now using 11392 MB -> 6221 c/s, so in some cases using most of the GPU memory isn't advantageous.

Thanks, Alain! I confirm this, on super's devices other than Titan X Maxwell everything is good now, but on this one at low to moderate m_costs we got slowdowns. In fact, at our test 16 MiB hash $argon2d$v=19$m=16384,t=3,p=1$c2hvcnRfc2FsdA$TLSTPihIo+5F67Y1vJdfWdB9 the slowdown from our currently merged code (which uses 4 GiB by default) is as bad as 2x. I wonder why that would be.

Reading up on Maxwell's caches (which I had read years ago, but forgot), it has 48 KiB L1 per SMM and 24 SMMs. We currently auto-tune to LWS=256 GWS=22784. Going from the GWS, it's something like 22784/32/24 = ~30 KB per SMM per Argon2 instance per block. I can see how this can be too much when we're working with previous/reference/current blocks simultaneously, it's something like 90 KB vs. 48 KiB cache. However, the slowdown starts to be seen already at much lower GWS - basically, anything above 8192 is slower. 8192 means about 32 KiB per 3 blocks, which is still only 2/3 of 48 KiB cache. I don't see why caching a few more than 3 would matter for Argon2. Maybe it's other demand for the cache or its limited associativity. I also wonder why similar isn't seen on our other GPUs.

from john.

solardiz commented on September 21, 2024

@alainesp Please send us a PR with your changes. I will then add mine on top of yours. Most likely the below, which works well for me:

+++ b/src/opencl_argon2_fmt_plug.c
@@ -510,13 +510,27 @@ static void reset(struct db_main *db)
        //----------------------------------------------------------------------------------------------------------------------------
        // Create OpenCL objects
        //----------------------------------------------------------------------------------------------------------------------------
-       // Use all GPU memory by default
+       // Use almost all GPU memory by default
+       unsigned int warps = 6, limit, target;
        if (gpu_amd(device_info[gpu_id])) {
-               MAX_KEYS_PER_CRYPT = get_max_mem_alloc_size(gpu_id) / max_memory_size;
+               limit = get_max_mem_alloc_size(gpu_id) / max_memory_size;
        } else {
-               MAX_KEYS_PER_CRYPT = get_global_memory_size(gpu_id) * 15 / 16 / (max_memory_size + ARGON2_PREHASH_DIGEST_LENGTH);
+               if (gpu_nvidia(device_info[gpu_id])) {
+                       unsigned int major = 0, minor = 0;
+                       get_compute_capability(gpu_id, &major, &minor);
+                       if (major == 5) /* NVIDIA Maxwell */
+                               warps = 2;
+               }
+               limit = get_global_memory_size(gpu_id) * 31 / 32 / (max_memory_size + ARGON2_PREHASH_DIGEST_LENGTH);
        }
-       MAX_KEYS_PER_CRYPT -= MAX_KEYS_PER_CRYPT & (MAX_KEYS_PER_CRYPT > 128 ? 3 : 1); // Make it even or multiple of 4
+       do {
+               target = (get_processors_count(gpu_id) * warps + THREADS_PER_LANE - 1) / THREADS_PER_LANE;
+       } while (target > limit && --warps > 1);
+       if (target > limit)
+               target = limit;
+       if (target > 16)
+               target -= target & (target > 128 ? 3 : 1); // Make it even or multiple of 4
+       MAX_KEYS_PER_CRYPT = target;
        // Load GWS from config/command line
        opencl_get_user_preferences(FORMAT_NAME);
        if (global_work_size && !self_test_running) {

from john.

solardiz commented on September 21, 2024

Alain's portion of changes is now merged via #5404.

from john.

Improve argon2-opencl (auto-)tuning for high m_cost about john HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent