TL;DR: when you get hot, you run slowly. And it's a bit hard to predict when you get h

BTW, see commit <a class="commit-link" data-hovercard-type="commit" data-hovercard-url

Titan X thermo behavior might cause performance fluctuation about convnet-benchmarks HOT 16 CLOSED

soumith commented on August 17, 2024

Titan X thermo behavior might cause performance fluctuation

from convnet-benchmarks.

Comments (16)

scott-gray commented on August 17, 2024

If you're looking to benchmark performance of a long running kernel, turning off the boost clock will give you the most accurate results:

sudo nvidia-smi -i 1 --auto-boost-default=0

You can also adjust clocks more directly with:

sudo nvidia-smi -i 1 -ac 3505,1392
sudo nvidia-smi -i 1 -ac 3505,1000

This lets you run cuda at the full memory clock (with and without boost), but I'm not sure if that's wise without ECC.

But benchmarking with autoboost enabled is still useful as you can see exactly where the kernel becomes power limited. The factor most important for power limits is the amount of DDR access. So the more you can keep data in L2 or below, the less power you'll draw (and the easier the chip will be to cool).

I like to set the power limit at 275 with full clocks and benchmark while running this:

sudo nvidia-smi -i 1 -pl 275
nvidia-smi -i 1 --loop-ms=333 --format=csv,noheader --query-gpu=power.draw,clocks.gr,temperature.gpu,fan.speed,clocks_throttle_reasons.sw_power_cap,clocks_throttle_reasons.hw_slowdown

This gives me a really good sense of the power profile of the kernel.

from convnet-benchmarks.

ozabluda commented on August 17, 2024

I like to set the power limit at 275

But Titan-X doesn't allow setting power limits (275W is default). The only way I found to meaningfully benchmark Titan-X, is to watch temperature and if it reaches 84 degrees, crank up fan manually (through a GUI, command-line has a bug and doesn't work), since stock BIOS inexplicably wouldn't do it. A hacked BIOS is probably the way to go for multiple Titan-X cards in server(s).

Overclocking my Titan-X, I was able to get 10-14% faster timings that reported by Soumith. Without overclocking, I reproduce his timings to within 1%.

from convnet-benchmarks.

ozabluda commented on August 17, 2024

BTW, see commit d6177f9 for how Soumith is doing warmup for both nervana and cudnn and its effect

from convnet-benchmarks.

soumith commented on August 17, 2024

The thermal behavior on these GPUs is very very interesting.

While benchmarking, I constantly monitor the card to make sure it is at a stable clock rate (and the same clock all across).

Some more interesting things that one would want to know are that:

Nervana's Neon kernels cant sustain boost clock over training time. They actually clock down if you run them for long enough. These kernels push the GPU to an absolute extreme.
CuDNN Kernels dont push the GPUs this hard overall. They went for lower power draw + FFT instead, to get the same performance.
There seem to be power and speed optimizations depending on special cases of zero. If you send in a uniformly distributed input, it will run slightly slower than an all-zero input. @scott-gray observed the same. This is quite interesting, especially more in the context of ReLU nets.

Also, a quote from @scott-gray while we were discussing the benchmarks in an email thread (I wanted to make sure I was doing things right).

The GPU has a very active power sensor and dynamically changes the clock depending on power draw (independent of temperature). This happens on the millisecond time scale. My fprop and bprop kernels run at 7.2 TFlops when the input is all 1 ones (or any other low entropy data). Switching to random data they top out at 6.6 Tflops or so. One of the reasons that fp16 is faster (aside from reduced bandwidth) is that after converting to fp32 for compute only 10 bits of the mantissa are populated. You can compare the difference if you truncate the inputs to fp16 and then convert back again to fp32 prior to sending the data to the fp32 kernels.

from convnet-benchmarks.

ozabluda commented on August 17, 2024

There seem to be power and speed optimizations depending on special cases of zero.

very interesting. This must be in hardware. I wonder if the hardware automatically uses less power when many bits are zero, or it's an explicit hardware optimization. Maybe all mantissa bits must be zero for that?

My fprop and bprop kernels run at 7.2 TFlops when the input is all 1 ones (or any other low entropy data).

Really, any other low entropy data? Or data with lots (or even all) zero bits? What if bits are all 1? Maybe it'll be even lower than random?

Switching to random data they top out at 6.6 Tflops or so.

from convnet-benchmarks.

ozabluda commented on August 17, 2024

FWIW, the following shows data-dependent (all-zero vs all-one) 7% difference of power consumption on integer matrix multiplication on AVR. Titan-X is likely to have the same effect, IMO, even without explicit hardware optimizations, if they exist at all.

"Data dependent energy modelling for worst case energy consumption analysis" (2015) Pallister et al
http://arxiv.org/abs/1505.03374

from convnet-benchmarks.

scott-gray commented on August 17, 2024

My TitanX defaults to a power limit of 250. For power draw it's the toggling of bits that matters (particularly over long wires). So all ones will be almost as fast as all zeros. It's more random data patterns that draw the most power.

One thing I want to point out is that I'll have a completely new set of kernels out soonish, and these do a much better job of keeping data in L2 and using larger tiles when possible. This keeps the power levels significantly lower allowing the clock to run at full boost. I'll also have everything working at small minibatches across the board. This should make them much easier to scale with multiple gpus.

from convnet-benchmarks.

hughperkins commented on August 17, 2024

My TitanX defaults to a power limit of 250. For power draw it's the toggling of bits that matters (particularly over long wires). So all ones will be almost as fast as all zeros. It's more random data patterns that draw the most power.

You mean, because all 1s, or all 0s, will basically just run dc along the cables, but alternating 1s and 0s will start radiating em-radiation?

from convnet-benchmarks.

scott-gray commented on August 17, 2024

I'm not an expert on these matters, but it's clear the more things are toggling on the chip the more power it draws. It's also possible that additional power is saved when an all zero condition is met. Portions of the logic might be dynamically disabled. No idea if the gpu does this.

from convnet-benchmarks.

ozabluda commented on August 17, 2024

One thing I want to point out is that I'll have a completely new set of kernels out soonish, and these do a much better job of keeping data in L2 and using larger tiles when possible. This keeps the power levels significantly lower allowing the clock to run at full boost. I'll also have everything working at small minibatches across the board. This should make them much easier to scale with multiple gpus.

Super awesome.

My TitanX defaults to a power limit of 250.

Oops, sorry, my mistake. 250W is the default, can be raised to max 275W. Here is how I overclock my Titan-X:

#increase “application clock” and power
nvidia-smi -i 0 -pm 1
nvidia-smi -i 0 -ac 3505,1392 −−power−limit=275

#enable coolbits
user@dnn1:~$ cat /etc/X11/xorg.conf
[…]
Section "Device"
Identifier "nvidia"
Driver "nvidia"
BusID "PCI:1@0:0:0"
Option "ConstrainCursor" "off"
Option "Coolbits" "31"
EndSection
[…]

#set PowerMizer mode to “Prefer Maximum performance”
DISPLAY=:0 nvidia-settings -a [gpu:0]/GPUPowerMizerMode=1

#overvolt
DISPLAY=:0 nvidia-settings -a [gpu:0]/GPUOverVoltageOffset=112399

#from nvidia-settings GUI (can’t do it from command-line) set “Graphics clock Offset” to 300 Mhz, “Memory Transfer Rate Offset” to 800

from convnet-benchmarks.

gujunli commented on August 17, 2024

Thanks Scott Gray and Soumith for sharing the info. It is very interesting
H/W behavior.

On Mon, Dec 7, 2015 at 9:11 PM, Scott Gray [email protected] wrote:

My TitanX defaults to a power limit of 250. For power draw it's the
toggling of bits that matters (particularly over long wires). So all ones
will be almost as fast as all zeros. It's more random data patterns that
draw the most power.

One thing I want to point out is that I'll have a completely new set of
kernels out soonish, and these do a much better job of keeping data in L2
and using larger tiles when possible. This keeps the power levels
significantly lower allowing the clock to run at full boost. I'll also have
everything working at small minibatches across the board. This should make
them much easier to scale with multiple gpus.

—
Reply to this email directly or view it on GitHub
#71 (comment)
.

Junli Gu--谷俊丽
Coordinated Science Lab
University of Illinois at Urbana-Champaign

from convnet-benchmarks.

ozabluda commented on August 17, 2024

t's clear the more things are toggling on the chip the more power it draws. It's also possible that additional power is saved when an all zero condition is met. Portions of the logic might be dynamically disabled. No idea if they gpu does this.

I don't know what NVIDIA does, but a chip can indeed detect 0.0, 1.0, etc, situation and turn portions off, like described here:
https://en.wikipedia.org/wiki/Clock_gating

from convnet-benchmarks.

hughperkins commented on August 17, 2024

Presumably we need to water cool the GPUs, so they can dissipate 250-300 watts under continuous operation? What is common practice for doing this?

from convnet-benchmarks.

ozabluda commented on August 17, 2024

In a workstation, it's a standard thing you can find in the gamers' forums and magazines, for example, see http://www.maximumpc.com/a-beginners-guide-to-liquid-cooling/. In a standard rack-mounted server, there is no room for it. NVIDIA DIGITS workstation is air-cooled.

from convnet-benchmarks.

hughperkins commented on August 17, 2024

Nice link and pictures. Thanks! :-)

from convnet-benchmarks.

hughperkins commented on August 17, 2024

Out of curiosity, I had a play with looking at the effect of load on the nimbix instances. I ran https://gist.github.com/hughperkins/6194efd67ad7fcbf5678b1285cc45327 with no arguments (except -gpu 1 for one of the gpus) on a dual Titan X insance, with one process on one gpu, and the other on the other. It just runs vgg model 'a' forward, with a batchsize of 128, on cudnnv4, with cudnn.fastest = true set.

When cold, the forward time was ~0.524

After running for ~10 minutes or so:

GPU 1 was stable at 67C, forward time 0.539
GPU 2 was stable at 75C, forward time 0.535

In other words:

the difference in perf, on these GPUs, between cold and hot was ~2.8% for GPU1, and ~2% for GPU2, which seems fairly small, compared to the differences in benchmark resluts that we're mostly concerned with
these GPUs are running pretty cold, nowhere near 85 celsius

from convnet-benchmarks.

Titan X thermo behavior might cause performance fluctuation about convnet-benchmarks HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent