Git Product home page Git Product logo

Comments (13)

tromp avatar tromp commented on August 28, 2024

Each thread needs an additional 42 MB, which is mostly sizeof(yzbucket).
I don't know why yours would have 70 MB per thread.

from cuckoo.

kachind avatar kachind commented on August 28, 2024

Thanks, that's helpful. How do CUDA / OpenCL implementations avoid multiple 42 MB allocations?

I've noticed some other bizarre behavior with high thread counts, particularly that around 32 threads the GPS stops increasing. If I start two 32-thread instances, the GPS is roughly double the GPS of one 64 thread instance. I can open a new issue for that if you like.

from cuckoo.

tromp avatar tromp commented on August 28, 2024

You have limited memory bandwidth, so once exhausted, your additional threads just slow things down with increased contention for the limited bandwidth.

from cuckoo.

kachind avatar kachind commented on August 28, 2024

That doesn't explain why two separate instances break the bottleneck, though? It's the same number of total threads.

The Xeon Phi has approximately 500 GiB/sec of memory bandwidth to the MCDRAM so I'm skeptical that's the issue.

from cuckoo.

tromp avatar tromp commented on August 28, 2024

That's sequential bandwidth. But your accesses are not all that sequential, causing a lot of memory latency as well. So I think of it as exhausting the latency bandwidth.

from cuckoo.

kachind avatar kachind commented on August 28, 2024

Does this AVX implementation produce much less sequential memory access than the OpenCL/CUDA implementations do? Otherwise I would expect GPUs to suffer from the same bandwidth issues, but they clearly are achieving significantly better performance.

Sorry for all the questions, I'm not familiar with the details of this algorithm.

from cuckoo.

timolson avatar timolson commented on August 28, 2024

500 GiB/sec of memory bandwidth

Bandwidth numbers are completely misleading unless you adjust for CAS latency. You can only get 500 GiB/s writing straight down DRAM, with zero random access. If the implementation of Cuckoo Cycle is caching partition writes into blocks of 16 edges at a time, then typical DDR4-3000 will slow down more than 50%. Using MCDRAM instead of DDR DIMMs does not help with latency at all.

That being said, I find it dubious that running two solvers simultaneously would help this situation, unless they are writing to the same pages in RAM? If you allocate separate memory for each solver, this would only exacerbate the latency problem, and something else must be going on.

from cuckoo.

timolson avatar timolson commented on August 28, 2024

Otherwise I would expect GPUs to suffer from the same bandwidth issues

They absolutely do suffer from this problem. Nvidia GPU's can only batch up to 32 words of 32-bits each. GDDR5 has a much higher transfer rate than DDR4, however, and GPU's compute siphash way faster than a CPU. 2^29 Siphashes on a 1080Ti takes about 30-35ms. I don't have a Xeon Phi but I'm guessing it takes seconds.

from cuckoo.

tromp avatar tromp commented on August 28, 2024

siphashes on a Xeon Phi would benefit from avx512 instructions, but that's not supported in my solvers, and would still not make the Phi competitive with GPUs.

from cuckoo.

kachind avatar kachind commented on August 28, 2024

Yeah, here are the test details. I have to use the c29 algorithm, since otherwise I run out of memory.

(figures are plus or minus a couple of percent, since the solve time varies slightly between graphs)


[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 64

GPS: 0.24


[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 32
device = 0

[[mining.miner_plugin_config]]
plugin_name = "cuckaroo_cpu_avx2_29"
[mining.miner_plugin_config.parameters]
nthreads = 32
device = 1

Device 1 GPS: 0.18
Device 2 GPS: 0.18


I was hoping for a simple solution but it sounds like that's unlikely.

In your opinion, if the the CUDA implementation could be ported over to OpenMP or similar, the performance would still be poor? The Xeon Phi KNL architecture has L1 and L2 caches, which I had hoped might mitigate the HBM2 access latency, but if all of the accesses are totally unpredictable/random then it's obviously hopeless, and there's no practical motivation to fix this issue.

from cuckoo.

kachind avatar kachind commented on August 28, 2024

As an aside... KNL does support a family of unique 512-bit prefetch instructions as part of AVX512-PF, and it's deep OoO with 4 threads per physical core.

https://www.felixcloutier.com/x86/vgatherpf1dps:vgatherpf1qps:vgatherpf1dpd:vgatherpf1qpd

This would permit a thread to assemble a full cache line from non-sequential data, provided it's all located in the same 4 KiB page, and the processor would not stall since it could move to a different hyperthread. Is that relevant here? Is data generally dispersed within pages, between pages, or both?

from cuckoo.

tromp avatar tromp commented on August 28, 2024

So the Phi performs worse than an i7 on the avx2 solver and if a Phi expert were to devote significant effort on optimizing the solver, they might be able to quadruple the performance and it would still fall far short of a GPU...

from cuckoo.

kachind avatar kachind commented on August 28, 2024

Fair enough, closing issue.

from cuckoo.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.