Git Product home page Git Product logo

Comments (10)

abergeron avatar abergeron commented on August 11, 2024

This is more of a Theano issue than a libgpuarray issue.

But I'm not sure why this happens exactly. If I had to take a guess I would say that the transfer back to the CPU memory at the end is the culprit. Probably the second device (cuda1) has a slower PCIe bus and thus less bandwith for the transfer which slows down the function.

As for the difference between the cuda and gpuarray backends, I will have to investigate.

from libgpuarray.

nouiz avatar nouiz commented on August 11, 2024

Can you do a theano.printing.debugprint(f) for all the compiled version?
Just to be sure there isn't something strange.

On Fri, Jan 29, 2016 at 1:59 PM, abergeron [email protected] wrote:

This is more of a Theano issue than a libgpuarray issue.

But I'm not sure why this happens exactly. If I had to take a guess I
would say that the transfer back to the CPU memory at the end is the
culprit. Probably the second device (cuda1) has a slower PCIe bus and thus
less bandwith for the transfer which slows down the function.

As for the difference between the cuda and gpuarray backends, I will have
to investigate.


Reply to this email directly or view it on GitHub
#124 (comment).

from libgpuarray.

avostryakov avatar avostryakov commented on August 11, 2024

Can you do a theano.printing.debugprint(f) for all the compiled version?

Yes, I'll give it on Monday. Now I don't have access to the server with GPUs.

Probably the second device (cuda1) has a slower PCIe bus and thus less bandwidth for the transfer which slows down the function.

I'm sure that both video cards have the same PCIe bus - x16. Moreover, they can communicate directly with each other. I don't know how it is named exactly, some kind of bridge/link between GPUs.

from libgpuarray.

abergeron avatar abergeron commented on August 11, 2024

The SLI bus doesn't matter for compute. It is exclusively used for graphics.

Also, an SLI configuration could mess with compute performance so you might want to try on a machine that doesn't have that.

Unless you where speaking about NVLink, but then I'll wonder how you came back from the future.

from libgpuarray.

avostryakov avatar avostryakov commented on August 11, 2024

I didn't mean SLI. I mean just PCIe communication between Video Cards :)

Anatoly Vostryakov,
Energy,
mailto: [email protected]

//////

2016-01-30 0:53 GMT+02:00 abergeron [email protected]:

The SLI bus doesn't matter for compute. It is exclusively used for
graphics.

Also, an SLI configuration could mess with compute performance so you
might want to try on a machine that doesn't have that.

Unless you where speaking about NVLink, but then I'll wonder how you came
back from the future.


Reply to this email directly or view it on GitHub
#124 (comment).

from libgpuarray.

avostryakov avatar avostryakov commented on August 11, 2024

Sorry for delay. So, Here is output of theano.printing.debugprint(f) for all of three versions:

  1. THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on two different gpus:
    Mapped name dev0 to device cuda0: GeForce GTX TITAN X
    Mapped name dev1 to device cuda1: GeForce GTX TITAN X
    HostFromGpu(gpuarray) [id A] '' 3
    |GpuDot22 [id B] '' 1
    |<GpuArrayType(float32, (False, False))> [id C]
    |<GpuArrayType(float32, (False, False))> [id D]
    HostFromGpu(gpuarray) [id E] '' 2
    |GpuDot22 [id F] '' 0
    |<GpuArrayType(float32, (False, False))> [id G]
    |<GpuArrayType(float32, (False, False))> [id H]
  2. THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on single gpu:
    HostFromGpu(gpuarray) [id A] '' 3
    |GpuDot22 [id B] '' 1
    |<GpuArrayType(float32, (False, False))> [id C]
    |<GpuArrayType(float32, (False, False))> [id D]
    HostFromGpu(gpuarray) [id E] '' 2
    |GpuDot22 [id F] '' 0
    |<GpuArrayType(float32, (False, False))> [id G]
    |<GpuArrayType(float32, (False, False))> [id H]
  3. THEANO_FLAGS="device=gpu", old gpu backend:
    Using gpu device 0: GeForce GTX TITAN X (CNMeM is enabled)
    HostFromGpu [id A] '' 3
    |GpuDot22 [id B] '' 1
    |<CudaNdarrayType(float32, matrix)> [id C]
    |<CudaNdarrayType(float32, matrix)> [id D]
    HostFromGpu [id E] '' 2
    |GpuDot22 [id F] '' 0
    |<CudaNdarrayType(float32, matrix)> [id G]
    |<CudaNdarrayType(float32, matrix)> [id H]

from libgpuarray.

nouiz avatar nouiz commented on August 11, 2024

Did you enabled cnmem on the old back-end? If so, can you do your timming
while not timming the first call to the function? Both back-end have a
different way to cache the call to cudaMalloc and the new back-end do this
during the first call.

Other reason would make the that first call isn't the same speed, so in all
case, can you benchmark while ignoring the first call?

On Tue, Feb 2, 2016 at 5:31 AM, Magic [email protected] wrote:

Sorry for delay. So, Here is output of theano.printing.debugprint(f) for
all of three versions:

THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on
two different gpus:
Mapped name dev0 to device cuda0: GeForce GTX TITAN X
Mapped name dev1 to device cuda1: GeForce GTX TITAN X
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|(float32, (False, False))> [id C]
|(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|(float32, (False, False))> [id G]
|(float32, (False, False))> [id H]
2.

THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on
single gpu:
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|(float32, (False, False))> [id C]
|(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|(float32, (False, False))> [id G]
|(float32, (False, False))> [id H]
3.

THEANO_FLAGS="device=gpu", old gpu backend:
Using gpu device 0: GeForce GTX TITAN X (CNMeM is enabled)
HostFromGpu [id A] '' 3
|GpuDot22 [id B] '' 1
| [id C]
| [id D]
HostFromGpu [id E] '' 2
|GpuDot22 [id F] '' 0
| [id G]
| [id H]


Reply to this email directly or view it on GitHub
#124 (comment).

from libgpuarray.

avostryakov avatar avostryakov commented on August 11, 2024

Yes. I enabled cnmem for the old backend. I did how you described. Now the old backend works with the same speed as a new one on one GPU. A new backend with every dot-operation on its own GPU works a little faster but not a lot.
Can you guess why two parallel operations is not two times faster?

from libgpuarray.

abergeron avatar abergeron commented on August 11, 2024

Because it synchronizes on transfer to the host and you do that at the end of each dot in the function.

from libgpuarray.

avostryakov avatar avostryakov commented on August 11, 2024

Ok. I'll close this issue.

from libgpuarray.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.