When running grin-miner with the c31 plugin, 1 thread will use ~11 GB of memory as exp

Additional CPU threads occupy significant memory space about cuckoo HOT 13 CLOSED

kachind commented on August 28, 2024

Additional CPU threads occupy significant memory space

from cuckoo.

Comments (13)

tromp commented on August 28, 2024

Each thread needs an additional 42 MB, which is mostly sizeof(yzbucket).
I don't know why yours would have 70 MB per thread.

from cuckoo.

kachind commented on August 28, 2024

Thanks, that's helpful. How do CUDA / OpenCL implementations avoid multiple 42 MB allocations?

I've noticed some other bizarre behavior with high thread counts, particularly that around 32 threads the GPS stops increasing. If I start two 32-thread instances, the GPS is roughly double the GPS of one 64 thread instance. I can open a new issue for that if you like.

from cuckoo.

tromp commented on August 28, 2024

You have limited memory bandwidth, so once exhausted, your additional threads just slow things down with increased contention for the limited bandwidth.

from cuckoo.

kachind commented on August 28, 2024

That doesn't explain why two separate instances break the bottleneck, though? It's the same number of total threads.

The Xeon Phi has approximately 500 GiB/sec of memory bandwidth to the MCDRAM so I'm skeptical that's the issue.

from cuckoo.

tromp commented on August 28, 2024

That's sequential bandwidth. But your accesses are not all that sequential, causing a lot of memory latency as well. So I think of it as exhausting the latency bandwidth.

from cuckoo.

kachind commented on August 28, 2024

Does this AVX implementation produce much less sequential memory access than the OpenCL/CUDA implementations do? Otherwise I would expect GPUs to suffer from the same bandwidth issues, but they clearly are achieving significantly better performance.

Sorry for all the questions, I'm not familiar with the details of this algorithm.

from cuckoo.

timolson commented on August 28, 2024

500 GiB/sec of memory bandwidth

Bandwidth numbers are completely misleading unless you adjust for CAS latency. You can only get 500 GiB/s writing straight down DRAM, with zero random access. If the implementation of Cuckoo Cycle is caching partition writes into blocks of 16 edges at a time, then typical DDR4-3000 will slow down more than 50%. Using MCDRAM instead of DDR DIMMs does not help with latency at all.

That being said, I find it dubious that running two solvers simultaneously would help this situation, unless they are writing to the same pages in RAM? If you allocate separate memory for each solver, this would only exacerbate the latency problem, and something else must be going on.

from cuckoo.

timolson commented on August 28, 2024

Otherwise I would expect GPUs to suffer from the same bandwidth issues

They absolutely do suffer from this problem. Nvidia GPU's can only batch up to 32 words of 32-bits each. GDDR5 has a much higher transfer rate than DDR4, however, and GPU's compute siphash way faster than a CPU. 2^29 Siphashes on a 1080Ti takes about 30-35ms. I don't have a Xeon Phi but I'm guessing it takes seconds.

from cuckoo.

tromp commented on August 28, 2024

siphashes on a Xeon Phi would benefit from avx512 instructions, but that's not supported in my solvers, and would still not make the Phi competitive with GPUs.

from cuckoo.

kachind commented on August 28, 2024

Yeah, here are the test details. I have to use the c29 algorithm, since otherwise I run out of memory.

(figures are plus or minus a couple of percent, since the solve time varies slightly between graphs)

[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 64

GPS: 0.24

[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 32
device = 0

[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 32
device = 1

Device 1 GPS: 0.18
Device 2 GPS: 0.18

I was hoping for a simple solution but it sounds like that's unlikely.

In your opinion, if the the CUDA implementation could be ported over to OpenMP or similar, the performance would still be poor? The Xeon Phi KNL architecture has L1 and L2 caches, which I had hoped might mitigate the HBM2 access latency, but if all of the accesses are totally unpredictable/random then it's obviously hopeless, and there's no practical motivation to fix this issue.

from cuckoo.

kachind commented on August 28, 2024

As an aside... KNL does support a family of unique 512-bit prefetch instructions as part of AVX512-PF, and it's deep OoO with 4 threads per physical core.

https://www.felixcloutier.com/x86/vgatherpf1dps:vgatherpf1qps:vgatherpf1dpd:vgatherpf1qpd

This would permit a thread to assemble a full cache line from non-sequential data, provided it's all located in the same 4 KiB page, and the processor would not stall since it could move to a different hyperthread. Is that relevant here? Is data generally dispersed within pages, between pages, or both?

from cuckoo.

tromp commented on August 28, 2024

So the Phi performs worse than an i7 on the avx2 solver and if a Phi expert were to devote significant effort on optimizing the solver, they might be able to quadruple the performance and it would still fall far short of a GPU...

from cuckoo.

kachind commented on August 28, 2024

Fair enough, closing issue.

from cuckoo.

Additional CPU threads occupy significant memory space about cuckoo HOT 13 CLOSED

Comments (13)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent