I tried the floowing code from tutorial with THEANO_FLAGS="contexts=dev0->cud

Performance of libgpuarray is slow about libgpuarray HOT 10 CLOSED

theano commented on August 11, 2024

Performance of libgpuarray is slow

from libgpuarray.

Comments (10)

abergeron commented on August 11, 2024

This is more of a Theano issue than a libgpuarray issue.

But I'm not sure why this happens exactly. If I had to take a guess I would say that the transfer back to the CPU memory at the end is the culprit. Probably the second device (cuda1) has a slower PCIe bus and thus less bandwith for the transfer which slows down the function.

As for the difference between the cuda and gpuarray backends, I will have to investigate.

from libgpuarray.

nouiz commented on August 11, 2024

Can you do a theano.printing.debugprint(f) for all the compiled version?
Just to be sure there isn't something strange.

On Fri, Jan 29, 2016 at 1:59 PM, abergeron [email protected] wrote:

This is more of a Theano issue than a libgpuarray issue.

But I'm not sure why this happens exactly. If I had to take a guess I
would say that the transfer back to the CPU memory at the end is the
culprit. Probably the second device (cuda1) has a slower PCIe bus and thus
less bandwith for the transfer which slows down the function.

As for the difference between the cuda and gpuarray backends, I will have
to investigate.

—
Reply to this email directly or view it on GitHub
#124 (comment).

from libgpuarray.

avostryakov commented on August 11, 2024

Can you do a theano.printing.debugprint(f) for all the compiled version?

Yes, I'll give it on Monday. Now I don't have access to the server with GPUs.

Probably the second device (cuda1) has a slower PCIe bus and thus less bandwidth for the transfer which slows down the function.

I'm sure that both video cards have the same PCIe bus - x16. Moreover, they can communicate directly with each other. I don't know how it is named exactly, some kind of bridge/link between GPUs.

from libgpuarray.

abergeron commented on August 11, 2024

The SLI bus doesn't matter for compute. It is exclusively used for graphics.

Also, an SLI configuration could mess with compute performance so you might want to try on a machine that doesn't have that.

Unless you where speaking about NVLink, but then I'll wonder how you came back from the future.

from libgpuarray.

avostryakov commented on August 11, 2024

I didn't mean SLI. I mean just PCIe communication between Video Cards :)

Anatoly Vostryakov,
Energy,
mailto: [email protected]

//////

2016-01-30 0:53 GMT+02:00 abergeron [email protected]:

The SLI bus doesn't matter for compute. It is exclusively used for
graphics.

Also, an SLI configuration could mess with compute performance so you
might want to try on a machine that doesn't have that.

Unless you where speaking about NVLink, but then I'll wonder how you came
back from the future.

—
Reply to this email directly or view it on GitHub
#124 (comment).

from libgpuarray.

avostryakov commented on August 11, 2024

Sorry for delay. So, Here is output of theano.printing.debugprint(f) for all of three versions:

THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on two different gpus:
Mapped name dev0 to device cuda0: GeForce GTX TITAN X
Mapped name dev1 to device cuda1: GeForce GTX TITAN X
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|<GpuArrayType(float32, (False, False))> [id C]
|<GpuArrayType(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|<GpuArrayType(float32, (False, False))> [id G]
|<GpuArrayType(float32, (False, False))> [id H]
THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on single gpu:
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|<GpuArrayType(float32, (False, False))> [id C]
|<GpuArrayType(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|<GpuArrayType(float32, (False, False))> [id G]
|<GpuArrayType(float32, (False, False))> [id H]
THEANO_FLAGS="device=gpu", old gpu backend:
Using gpu device 0: GeForce GTX TITAN X (CNMeM is enabled)
HostFromGpu [id A] '' 3
|GpuDot22 [id B] '' 1
|<CudaNdarrayType(float32, matrix)> [id C]
|<CudaNdarrayType(float32, matrix)> [id D]
HostFromGpu [id E] '' 2
|GpuDot22 [id F] '' 0
|<CudaNdarrayType(float32, matrix)> [id G]
|<CudaNdarrayType(float32, matrix)> [id H]

from libgpuarray.

nouiz commented on August 11, 2024

Did you enabled cnmem on the old back-end? If so, can you do your timming
while not timming the first call to the function? Both back-end have a
different way to cache the call to cudaMalloc and the new back-end do this
during the first call.

Other reason would make the that first call isn't the same speed, so in all
case, can you benchmark while ignoring the first call?

On Tue, Feb 2, 2016 at 5:31 AM, Magic [email protected] wrote:

Sorry for delay. So, Here is output of theano.printing.debugprint(f) for
all of three versions:

THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on
two different gpus:
Mapped name dev0 to device cuda0: GeForce GTX TITAN X
Mapped name dev1 to device cuda1: GeForce GTX TITAN X
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|(float32, (False, False))> [id C]
|(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|(float32, (False, False))> [id G]
|(float32, (False, False))> [id H]
2.

THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on
single gpu:
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|(float32, (False, False))> [id C]
|(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|(float32, (False, False))> [id G]
|(float32, (False, False))> [id H]
3.

THEANO_FLAGS="device=gpu", old gpu backend:
Using gpu device 0: GeForce GTX TITAN X (CNMeM is enabled)
HostFromGpu [id A] '' 3
|GpuDot22 [id B] '' 1
| [id C]
| [id D]
HostFromGpu [id E] '' 2
|GpuDot22 [id F] '' 0
| [id G]
| [id H]

—
Reply to this email directly or view it on GitHub
#124 (comment).

from libgpuarray.

avostryakov commented on August 11, 2024

Yes. I enabled cnmem for the old backend. I did how you described. Now the old backend works with the same speed as a new one on one GPU. A new backend with every dot-operation on its own GPU works a little faster but not a lot.
Can you guess why two parallel operations is not two times faster?

from libgpuarray.

abergeron commented on August 11, 2024

Because it synchronizes on transfer to the host and you do that at the end of each dot in the function.

from libgpuarray.

avostryakov commented on August 11, 2024

Ok. I'll close this issue.

from libgpuarray.

Performance of libgpuarray is slow about libgpuarray HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent