Comments (10)
This is more of a Theano issue than a libgpuarray issue.
But I'm not sure why this happens exactly. If I had to take a guess I would say that the transfer back to the CPU memory at the end is the culprit. Probably the second device (cuda1) has a slower PCIe bus and thus less bandwith for the transfer which slows down the function.
As for the difference between the cuda and gpuarray backends, I will have to investigate.
from libgpuarray.
Can you do a theano.printing.debugprint(f) for all the compiled version?
Just to be sure there isn't something strange.
On Fri, Jan 29, 2016 at 1:59 PM, abergeron [email protected] wrote:
This is more of a Theano issue than a libgpuarray issue.
But I'm not sure why this happens exactly. If I had to take a guess I
would say that the transfer back to the CPU memory at the end is the
culprit. Probably the second device (cuda1) has a slower PCIe bus and thus
less bandwith for the transfer which slows down the function.As for the difference between the cuda and gpuarray backends, I will have
to investigate.—
Reply to this email directly or view it on GitHub
#124 (comment).
from libgpuarray.
Can you do a theano.printing.debugprint(f) for all the compiled version?
Yes, I'll give it on Monday. Now I don't have access to the server with GPUs.
Probably the second device (cuda1) has a slower PCIe bus and thus less bandwidth for the transfer which slows down the function.
I'm sure that both video cards have the same PCIe bus - x16. Moreover, they can communicate directly with each other. I don't know how it is named exactly, some kind of bridge/link between GPUs.
from libgpuarray.
The SLI bus doesn't matter for compute. It is exclusively used for graphics.
Also, an SLI configuration could mess with compute performance so you might want to try on a machine that doesn't have that.
Unless you where speaking about NVLink, but then I'll wonder how you came back from the future.
from libgpuarray.
I didn't mean SLI. I mean just PCIe communication between Video Cards :)
Anatoly Vostryakov,
Energy,
mailto: [email protected]
//////
2016-01-30 0:53 GMT+02:00 abergeron [email protected]:
The SLI bus doesn't matter for compute. It is exclusively used for
graphics.Also, an SLI configuration could mess with compute performance so you
might want to try on a machine that doesn't have that.Unless you where speaking about NVLink, but then I'll wonder how you came
back from the future.—
Reply to this email directly or view it on GitHub
#124 (comment).
from libgpuarray.
Sorry for delay. So, Here is output of theano.printing.debugprint(f) for all of three versions:
- THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on two different gpus:
Mapped name dev0 to device cuda0: GeForce GTX TITAN X
Mapped name dev1 to device cuda1: GeForce GTX TITAN X
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|<GpuArrayType(float32, (False, False))> [id C]
|<GpuArrayType(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|<GpuArrayType(float32, (False, False))> [id G]
|<GpuArrayType(float32, (False, False))> [id H] - THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on single gpu:
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|<GpuArrayType(float32, (False, False))> [id C]
|<GpuArrayType(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|<GpuArrayType(float32, (False, False))> [id G]
|<GpuArrayType(float32, (False, False))> [id H] - THEANO_FLAGS="device=gpu", old gpu backend:
Using gpu device 0: GeForce GTX TITAN X (CNMeM is enabled)
HostFromGpu [id A] '' 3
|GpuDot22 [id B] '' 1
|<CudaNdarrayType(float32, matrix)> [id C]
|<CudaNdarrayType(float32, matrix)> [id D]
HostFromGpu [id E] '' 2
|GpuDot22 [id F] '' 0
|<CudaNdarrayType(float32, matrix)> [id G]
|<CudaNdarrayType(float32, matrix)> [id H]
from libgpuarray.
Did you enabled cnmem on the old back-end? If so, can you do your timming
while not timming the first call to the function? Both back-end have a
different way to cache the call to cudaMalloc and the new back-end do this
during the first call.
Other reason would make the that first call isn't the same speed, so in all
case, can you benchmark while ignoring the first call?
On Tue, Feb 2, 2016 at 5:31 AM, Magic [email protected] wrote:
Sorry for delay. So, Here is output of theano.printing.debugprint(f) for
all of three versions:
THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on
two different gpus:
Mapped name dev0 to device cuda0: GeForce GTX TITAN X
Mapped name dev1 to device cuda1: GeForce GTX TITAN X
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|(float32, (False, False))> [id C]
|(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|(float32, (False, False))> [id G]
|(float32, (False, False))> [id H]
2.THEANO_FLAGS="contexts=dev0->cuda0;dev1->cuda1", two dot operations on
single gpu:
HostFromGpu(gpuarray) [id A] '' 3
|GpuDot22 [id B] '' 1
|(float32, (False, False))> [id C]
|(float32, (False, False))> [id D]
HostFromGpu(gpuarray) [id E] '' 2
|GpuDot22 [id F] '' 0
|(float32, (False, False))> [id G]
|(float32, (False, False))> [id H]
3.THEANO_FLAGS="device=gpu", old gpu backend:
Using gpu device 0: GeForce GTX TITAN X (CNMeM is enabled)
HostFromGpu [id A] '' 3
|GpuDot22 [id B] '' 1
| [id C]
| [id D]
HostFromGpu [id E] '' 2
|GpuDot22 [id F] '' 0
| [id G]
| [id H]—
Reply to this email directly or view it on GitHub
#124 (comment).
from libgpuarray.
Yes. I enabled cnmem for the old backend. I did how you described. Now the old backend works with the same speed as a new one on one GPU. A new backend with every dot-operation on its own GPU works a little faster but not a lot.
Can you guess why two parallel operations is not two times faster?
from libgpuarray.
Because it synchronizes on transfer to the host and you do that at the end of each dot in the function.
from libgpuarray.
Ok. I'll close this issue.
from libgpuarray.
Related Issues (20)
- "host variables are not allowed in JIT mode" HOT 1
- Impossible to test API/ABI version from kernel code HOT 5
- Can not use cuDNN on context None: Device not supported HOT 1
- NCCL tests hang when run in slurm interactive job HOT 2
- ls
- GpuArrayException (b'GPU is too old for CUDA version') HOT 5
- Can not use cuDNN on context None:cannot compile with cuDNN. HOT 1
- Theano 1.0.2 GpuArrayException: b'cuInit
- PCI Bus ID length
- pygpu problem HOT 1
- problem when install
- gpuarray crashes with ARM Mali GPU HOT 3
- compile failure HOT 1
- Can't decode comm_id HOT 1
- CUDA 10.1 on Windows 10 shows error: pygpu.gpuarray.GpuArrayException: b'Could not load "nvrtc64_70.dll": The specified module could not be found.\r\n' theano HOT 11
- undefined symbol: GpuKernel_binary HOT 2
- How to start using the library ? HOT 2
- Please update python test
- Test
- Update versioneer
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from libgpuarray.