maaars / cudpp Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/cudpp
License: Other
Automatically exported from code.google.com/p/cudpp
License: Other
Purpose of code changes on this branch:
Add tridiagonal solvers to cudpp.
When reviewing my code changes, please focus on:
After the review, I'll merge this branch into:
/trunk
Original issue reported on code.google.com by [email protected]
on 13 Feb 2010 at 12:34
Reproduction of the problem:
1. Download files from http://www.ilab.sztaki.hu/~erikbodzsar/cudpp/
2. Compile test.cu
3. Run ./a.out <error.txt
The test program runs a segmented min scan on the input data contained in
error.txt. Cudpp gets some elements of the result wrong (the first wrong
element, and some preceding and following elements will be printed out by
the test program).
I'm using cudpp 1.1, on a 64-bit debian system with debian version 5.0.2,
CUDA 2.2, g++/gcc 4.1.3.
Original issue reported on code.google.com by [email protected]
on 5 Aug 2009 at 10:53
Hi,
I just build cudpp and ran cudpp_testrig which failed with
(all previous tests were correct)
Running a sort of 1048581 unsigned int key-value pairs
Unordered key[1048576]:4294966923 > key[1048577]:0
Incorrectly sorted value[1048577] (0) 3530798281 != 0
GPU test FAILED
Average execution time: 2.586515 ms
Running a sort of 2097152 unsigned int key-value pairs
Unordered key[1048576]:4294966923 > key[1048577]:0
Incorrectly sorted value[1048577] (0) 3530798281 != 0
GPU test FAILED
Average execution time: 0.000000 ms
Running a sort of 4194304 unsigned int key-value pairs
Unordered key[1048576]:4294966923 > key[1048577]:0
Incorrectly sorted value[1048577] (0) 3530798281 != 0
GPU test FAILED
Average execution time: 0.000000 ms
Running a sort of 8388608 unsigned int key-value pairs
Unordered key[1048576]:4294966923 > key[1048577]:0
Incorrectly sorted value[1048577] (0) 3530798281 != 0
GPU test FAILED
Average execution time: 0.000000 ms
My gpu card is a Tesla C1060.
Original issue reported on code.google.com by [email protected]
on 19 Jul 2009 at 10:07
When CUDPP is in a path that has the name "cudpp" in it twice, for example,
the way I keep branches:
~/src/idav/branches/proj/cudpp/release1.1/cudpp/
cudpp_testrig -rand fails to find its files. This is because cutupPath
goes from the root of the path above, finding the first /cudpp first. It
should instead work backwards up the tree, so it finds the closest instance
of "startDir", rather than the farthest -- I think this is what users will
expect.
I think the correct way to do this is not using strtok, but by using the
chdir() to traverse up the tree until either the startDir is found or the
root is hit. I find it hard to believe each OS doesn't have a built-in
function to do this, but a quick google search turns up nothing easy...
This needs to be fixed. However I think we can leave it until after the
release.
Original issue reported on code.google.com by [email protected]
on 29 Jun 2009 at 7:42
What steps will reproduce the problem?
1. Build cudpp release or debug
What is the expected output? What do you see instead?
Expect no errors or warnings or advisories. Instead get lots of these:
jS4_PjS9_
src/cta/segmented_scan_cta.cu(868): Advisory: Removed dead synchronization
intrinsic from function
_Z14segmentedScan4If19SegmentedScanTraitsIfL13CUDPPOperator2ELb0ELb0ELb0ELb0ELb1
ELb0EEEvPT_PKS3_PKjjS4_PjS9_
Suggested fix:
I realize that removing this __syncthreads() causes failure. I believe
though that the compiler is only removing it from some calls to the
function that includes it, not all. So instead of putting the
syncthreads() inside this function, put it right before the call to the
function, only where it is needed.
Please use labels and text to provide additional information.
Original issue reported on code.google.com by [email protected]
on 17 Jun 2009 at 1:38
The release libcudpp.a on OS X is over 110 MB now. On linux it is over 34 MB.
What is causing this?
Original issue reported on code.google.com by [email protected]
on 25 Jun 2009 at 12:15
As an example, if you pass a handle to a scan plan to cudppSegmentedScan, you
get a segmentation fault. This should instead raise an error at runtime
instead.
This is also related to issue 27.
Original issue reported on code.google.com by [email protected]
on 12 Jul 2009 at 9:58
There is no bug about it.
I wish I could use a MEX-Wrapper, wich allows me, to use CUDPP in M-Code.
This (CUDA-)MEX-file could be compiled at first use and allow after people
like me, who don't know much C, to use GPGPU very easily in Matlab.
On my site sort seems very interesing, in case it gaves back indices (like
it is need for sortrows).
At the moment there a two different Toolboxes available for using CUDA in
Matlab: Accelereyes' Jacket and GPUmat (from gp-you.org). Booth of them
dont allow a sortrows, sort from Jacket is very slow on the other hand.
Original issue reported on code.google.com by [email protected]
on 2 Sep 2009 at 2:04
Right now the only way to invoke the random number is through the API.
Provide a version which can use device level routines.
Original issue reported on code.google.com by [email protected]
on 19 Aug 2009 at 6:28
Purpose of code changes on this branch:
Review the changes I've made to the 1.1.1 branch before we release it.
When reviewing my code changes, please focus on:
Changes to scan_cta.cu, segmented_scan_cta.cu, radixsort_cta.cu,
radixsort_app.cu, cudpp_util.h, cudpp_globals.h
Original issue reported on code.google.com by [email protected]
on 11 Mar 2010 at 5:16
We get a lot of questions about supported size limitations. We need to
document all limitations in the CUDPP docs.
Original issue reported on code.google.com by [email protected]
on 11 Jun 2009 at 8:22
What steps will reproduce the problem?
1.
2.
3.
What is the expected output? What do you see instead?
What version of the product are you using? On what operating system?
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 12 Aug 2009 at 7:10
What steps will reproduce the problem?
1. Compile test_rand.cu on blaze (with makefile)
What is the expected output? What do you see instead?
I expect a clean build. Instead I get:
[jowens@blaze cudpp_testrig]$ make
test_rand.cu(63): error: pointer to incomplete class type is not allowed
test_rand.cu(64): error: pointer to incomplete class type is not allowed
2 errors detected in the compilation of
"/tmp/tmpxft_00005b19_00000000-4_cudpp_testrig.cpp1.ii".
Original issue reported on code.google.com by [email protected]
on 23 Jun 2009 at 1:24
The summary says it all. This is internal code and the fix is easy, but not
crucial.
Original issue reported on code.google.com by [email protected]
on 12 Oct 2009 at 9:30
CUDA 2.2 doesn't support 7.1 or earlier.
Original issue reported on code.google.com by [email protected]
on 16 Jun 2009 at 7:02
Not high priority, but this needs to be fixed.
Original issue reported on code.google.com by [email protected]
on 25 Jun 2009 at 12:26
Hi,
1) cudpp ships with precompiled libcutil.a, but compiling example apps
fails due to it being in a wrong format. I guess it is compiled for 32bit
host, and i am running 64 bit. So cudpp make, or some kind other make
should rebuild it. Running make in common/ fails with
./../common/inc/cmd_arg_reader.h: In member function ‘const T*
CmdArgReader::getArgHelper(const std::string&)’:
./../common/inc/cmd_arg_reader.h:416: error: must #include <typeinfo>
before using typeid
./../common/inc/cmd_arg_reader.h:432: error: must #include <typeinfo>
before using typeid
src/cmd_arg_reader.cpp: In destructor ‘CmdArgReader::~CmdArgReader()’:
src/cmd_arg_reader.cpp:101: error: must #include <typeinfo> before using
typeid
src/cmd_arg_reader.cpp:106: error: must #include <typeinfo> before using
typeid
src/cmd_arg_reader.cpp:111: error: must #include <typeinfo> before using
typeid
src/cmd_arg_reader.cpp:116: error: must #include <typeinfo> before using
typeid
src/cmd_arg_reader.cpp:121: error: must #include <typeinfo> before using
typeid
make: *** [obj/release/cmd_arg_reader.cpp_o] Error 1
2) If i change anything in kernels i hava to manualy delete compiled .o
files, make doesn;t recreate them automaticaly.
3) Building testrig fails with
In file included from spmvmult_gold.cpp:13:
sparse.h: In constructor ‘MMMatrix::MMMatrix(unsigned int, unsigned int,
unsigned int)’:
sparse.h:46: error: ‘malloc’ was not declared in this scope
spmvmult_gold.cpp: In function ‘void readMatrixMarket(MMMatrix*, const
char*)’:
spmvmult_gold.cpp:94: error: ‘exit’ was not declared in this scope
spmvmult_gold.cpp:122: error: ‘qsort’ was not declared in this scope
make: *** [obj/release/spmvmult_gold.cpp_o] Error 1
Original issue reported on code.google.com by [email protected]
on 22 Jul 2009 at 1:38
Added tools.h and tools.cpp into the trunk. Once the code is accepted, I
will update the code in testrig
I checked the file-finding code in the cutil library and that only searches
./data/ and ../../../projects/<executable_name>/data/, which is not general
enough for our purposes.
Based on John's SPMVMult testrig and my rand testrig, I've written two
types of file searching: finding a directory and finding a file. I use the
directory finding to find the data/ directory while it seems that John's
needs one to find a specific filename. The idea for both is that the
function will ascend to a parent directory and do a recursive search down
its children from there.
So for example, if I were looking for the data directory and I am in, say
/cudpp/bin/, then I'd call findDir("cudpp", "data", output) where output is
a character array. In the end of the function, output will contain
"../apps/data". Note that the recursive search does not search the .svn
directories (I don't think that any data file would be put in there...)
Ditto for the findFile function.
The code for both file and directory finding uses OS-dependent calls and
libraries. Linux / Mac uses the dirent.h and unistd.h to find the files
while the Windows version uses io.h and direct.h to find the files. Right
now I have only checked the two files tools.cpp and tools.h into the trunk
and once they are accepted I will check in the revised testrig files. I
have already tried the code on Blaze and this does fix Issue 3 (works as
well in Windows on my laptop). I've tried running the code from various
directories and it finds the regression files no sweat.
Stanley
Original issue reported on code.google.com by [email protected]
on 24 Jun 2009 at 12:27
What steps will reproduce the problem?
1. tar -xzvf sort_test.tar.gz
2. cd sort_test
3. make
4. ./testsort 1000000
What is the expected output? What do you see instead?
expected :
before sort
radix sort : 0.00833379 s 1000000 elements
what I see :
before sort
radix sort : 0.00833379 s 1000000 elements
sort error 4 720476 541723
What version of the product are you using? On what operating system?
Using device 0: Quadroplex 2200 S4
Quadroplex 2200 S4; global mem: 4294705152B; compute v1.3; clock: 1296000
kHz
cudpp 1.1.1
CUDA SDK 2.3
on linux 2.6 kernel
$uname -a
Linux tesla 2.6.18-128.1.1.el5 #1 SMP Tue Feb 10 11:36:29 EST 2009 x86_64
x86_64 x86_64 GNU/Linux
$cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 190.53 Wed Dec 9
15:29:46 PST 2009
GCC version: gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)
Please provide any additional information below.
This is a simple test to use cudppSort.
For a small array, it passes the test, but for a large array, it fails.
It also fails in cudpp_testrig as follows.
$./cudpp_testrig -sort -n=1000000
Using device 0: Quadroplex 2200 S4
Quadroplex 2200 S4; global mem: 4294705152B; compute v1.3; clock: 1296000
kHz
Running a sort of 1000000 unsigned int key-value pairs
Unordered key[3]:746051 > key[4]:16173
Incorrectly sorted value[0] (583160) 1153083146 != 460036
GPU test FAILED
Average execution time: 8.024296 ms
1 tests failed
If this is a driver version mismatch, please let me know which driver
version is needed. Thank you.
Original issue reported on code.google.com by [email protected]
on 30 Mar 2010 at 12:06
Attachments:
Have already run into some issues in the CUDA SDK samples with name conflicts
between functions used in cudppRadixSort and the samples. Add a cudpp
namespace to avoid this.
Original issue reported on code.google.com by [email protected]
on 12 Oct 2009 at 9:29
In vector_kernel, the function vectorSegmentedAddUniformToRight4 has an
argument "d_minIndices" but the documentation for that argument refers to
an "array of maximum indices". Please make the documentation and argument
name match the functionality.
Original issue reported on code.google.com by [email protected]
on 17 Jun 2009 at 10:48
What steps will reproduce the problem?
1. run cudpp_testrig -all
What is the expected output? What do you see instead?
It should test backward segmented scans (all ops, options), but it doesn't.
Original issue reported on code.google.com by [email protected]
on 17 Jun 2009 at 1:35
Request from a user:
As far as I understand cudpp currently only supports the use of single
precision floating numbers by specifying the data type CUDPP_FLOAT.
Is the support for double precision planned in a future version of
cudpp?
When will such a version be available?
What kind of loss in performance do you expect?
Original issue reported on code.google.com by [email protected]
on 25 Jun 2009 at 10:00
We need to get the faster version of radix sort from the CUDA SDK into
CUDPP. Mark is working on this.
Original issue reported on code.google.com by [email protected]
on 11 Jun 2009 at 8:41
What steps will reproduce the problem?
1. Build and run the satGL sample app
What is the expected output? What do you see instead?
Correct output can be seen by running the device emulation version. In
release or debug builds, instead the results are a green and blue smear.
Original issue reported on code.google.com by [email protected]
on 17 Jun 2009 at 1:57
Just duplicate the visual studio 8 projects. Perhaps we should look into
CMAKE.
Original issue reported on code.google.com by [email protected]
on 16 Jun 2009 at 7:02
We have key-value sorts, but sometimes you want to sort keys and multiple
value arrays along with them. Can we make this general and efficient?
This is not a definite CUDPP feature to add, it's something that should be
considered first.
Original issue reported on code.google.com by [email protected]
on 24 Jun 2009 at 9:58
What steps will reproduce the problem?
1. Run random tests
2. Look at output
What is the expected output? What do you see instead?
I see:
128
number of elements: 128, devOutputSize: 32
number of blocks: 1 blocksize: 32 devOutputsize = 32
number of threads: 32
What I want to see is something more like:
Generating 128 random numbers (1 block, 32 threads) ...
GPU test FAILED (x/y correct)
or something like that. (Look at the other ones.)
Also make sure the -q (quiet) option works, as Mark has previously described.
Original issue reported on code.google.com by [email protected]
on 17 Jun 2009 at 1:53
What version of the product are you using? On what operating system?
CUDPP 1.1, gcc 4.3.3-5ubuntu4 on Ubuntu 9.04 x64
Please provide any additional information below.
Building CUDPP 1.1 did not work out-of-the-box for me, and I believe
that some includes that should have been there are missing:
$ cudpp_1.1 cd common
$ common make
[...]
./../common/inc/cmd_arg_reader.h:417: error: must #include <typeinfo>
before using typeid
[...]
src/cutil.cpp:620: error: ‘strlen’ was not declared in this scope
[...]
To fix these:
in cmd_arg_reader.h
#include <typeinfo>
and in cutil.cpp
#include <cstring>
$ common make
[...]
./../common/inc/exception.h:89: error: ‘EXIT_FAILURE’ was not declared
in this scope
[...]
To fix:
#include <cstdlib>
in exception.h
Original issue reported on code.google.com by [email protected]
on 13 Aug 2009 at 9:52
What steps will reproduce the problem?
- unzip gtc_to_sort_test.tar.gz in NVIDIA_CUDA_SDK project folder
1. make (~/NVIDIA_CUDA_SDK/projects/gtc_to_sort_test/)
2. execution (~/NVIDIA_CUDA_SDK/bin/linux/release)
(e.g.: ./gtc_sort_test
~/NVIDIA_CUDA_SDK/projects/gtc_to_sort_test/input/1.txt 5 30 32)
3.
What is the expected output? What do you see instead?
[extected output]
Finished reading input file.
mi: 161795, mgrid: 32449
Sorting : Success
0.0425751 s Checksum: 0.000000
Sorting : Success
0.0424822 s Checksum: 0.000000
Sorting : Success
0.0425396 s Checksum: 0.000000
Sorting : Success
0.0425428 s Checksum: 0.000000
Sorting : Success
0.0427907 s Checksum: 0.000000
Sorting : Success
0.042537 s Checksum: 0.000000
Sorting : Success
0.0425729 s Checksum: 0.000000
Sorting : Success
0.0425132 s Checksum: 0.000000
Sorting : Success
0.0426874 s Checksum: 0.000000
Sorting : Success
0.0428964 s Checksum: 0.000000
=== Performance summary: BENCH_GPU A0 5057 blocks 32 threads/block ===
0.0286377 Gflops
Min: 0.0424822 s -- 0.674 Gflop/s
Mean: 0.0426137 s -- 0.672 Gflop/s
Max: 0.0428964 s -- 0.668 Gflop/s
Stddev: 0.000134837 s (+/- 0.3164%)
[output]
Finished reading input file.
mi: 161795, mgrid: 32449
Sorting : Success
0.0426965 s Checksum: 0.000000
Sorting : Success
0.0425468 s Checksum: 0.000000
Sorting : Success
0.0426379 s Checksum: 0.000000
Sorting : Success
0.0425811 s Checksum: 0.000000
Sorting : Success
0.0426666 s Checksum: 0.000000
Unordered key[983]: 138 > key[984]: 27
Sorting : FAIL
0.0436186 s Checksum: 0.000000
Unordered key[45]: 6392 > key[46]: 6384
Sorting : FAIL
0.0434239 s Checksum: 0.000000
Unordered key[147]: 3 > key[148]: 0
Sorting : FAIL
0.0435097 s Checksum: 0.000000
Unordered key[210]: 218 > key[211]: 0
Sorting : FAIL
0.0436116 s Checksum: 0.000000
Unordered key[132]: 14 > key[133]: 0
Sorting : FAIL
0.0435575 s Checksum: 0.000000
=== Performance summary: BENCH_GPU A0 5057 blocks 32 threads/block ===
0.0286377 Gflops
Min: 0.0425468 s -- 0.673 Gflop/s
Mean: 0.043085 s -- 0.665 Gflop/s
Max: 0.0436186 s -- 0.657 Gflop/s
Stddev: 0.000488756 s (+/- 1.134%)
What version of the product are you using? On what operating system?
GTX280
Ubuntu 8.04
cuda 2.2
Please provide any additional information below.
Sorting error occurs sometimes like the ouput example above.
Besides, the same error occurs in cudpp1.1 and cudpp1.1.1 test program as
well.
Original issue reported on code.google.com by [email protected]
on 19 Feb 2010 at 7:17
Attachments:
Right now we have:
enum CUDPPAlgorithm
{
CUDPP_SCAN,
CUDPP_SEGMENTED_SCAN,
CUDPP_COMPACT,
CUDPP_REDUCE,
CUDPP_SORT_RADIX,
CUDPP_SPMVMULT, /**< Sparse matrix-dense vector multiplication */
CUDPP_RAND_MD5, /**< Pseudo Random Number Generator using MD5
hash algorithm*/
CUDPP_ALGORITHM_INVALID, /**< Placeholder at end of enum */
};
I didn't catch this in time for release1.1.
Original issue reported on code.google.com by [email protected]
on 1 Jul 2009 at 8:02
Purpose of code changes on this branch:
To add Mark's optimizations to radix sort from the CUDA SDK.
When reviewing my code changes, please focus on:
radixsort_*.cu
cudpp_plan.*
cudpp_plan_manager.*
cudpp_maximal_launch.*
test_radixsort.cu (cudpp_testrig)
Use SVN diff to guide your review.
Original issue reported on code.google.com by [email protected]
on 16 Jun 2009 at 6:52
We need a way to regress SpMV. I don't think it's been tested for the 1.1
release.
Original issue reported on code.google.com by [email protected]
on 29 Jun 2009 at 7:49
This will make building (and adding files) much easier on windows. Get the
latest cuda.rules file from the CUDA SDK.
Original issue reported on code.google.com by [email protected]
on 16 Jun 2009 at 7:06
Request from a user:
is there an effort to support "unsigned long long int" data type in
CUDPPDatatype? I want to use this data type for getting the integral
image of the square of the image matrix, like SAT. If you use High
resolution images *unsigned int* cannot contain the values. . .
Original issue reported on code.google.com by [email protected]
on 25 Jun 2009 at 9:36
Current rand seed generation only uses a basic seed XOR'd with the
threadIdx and blockIdx. A more clever way would be to use an LCG.
Original e-mail suggesting the change:
From Thomas Bradley:
The threadIdx and blockIdx are 16-bit quantities and even fewer bits will
actually be non-zero, therefore you are really only changing the low bits
of your seed. It may be more robust to use an LCG to generate the “input”
fields, for example a=69069 m=32 is easy and not a bad LCG:
state = (state * 69069) & 0xffffffffUL; return state;
Where state is initialized to the seed (combined somehow with the threadIdx
and blockIdx).
Original issue reported on code.google.com by [email protected]
on 13 Jul 2009 at 2:57
What steps will reproduce the problem?
1. Run cudpp_testrig -rand on 8800 GT GPU
What is the expected output? What do you see instead?
Expect pass, get fail.
I know Stanley already knows about this issue, but I wanted to file it to
make sure it's covered.
Original issue reported on code.google.com by [email protected]
on 17 Jun 2009 at 11:33
What steps will reproduce the problem?
1. compile cutil (succeeds)
2. compile cudpp
What is the expected output? What do you see instead?
I expect cudpp to compile succesfully. Instead i get:
gauguin:/raid/filipe/cudpp_1.1/cudpp> make verbose=1
nvcc -o obj/release/segmented_scan_app.cu_o -c src/app/
segmented_scan_app.cu --host-compilation=C --compiler-options -fno-strict-
aliasing -I./ -I./include/ -Isrc/ -Isrc/app/ -Isrc/kernel/ -Isrc/cta/ -
I. -I/opt/cuda/include -I./../common/inc -DUNIX -O
In file included from /tmp/
tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c:6,
from src/app/segmented_scan_app.cu:247:
/opt/cuda/bin/../include/crt/host_runtime.h:178: warning: 'struct
surfaceReference' declared inside parameter list
/opt/cuda/bin/../include/crt/host_runtime.h:178: warning: its scope is
only this definition or declaration, which is probably not what you want
In file included from src/app/segmented_scan_app.cu:247:
/tmp/tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c: In
function
'__sti____cudaRegisterAll_53_tmpxft_00004836_00000000_4_segmented_scan_app_cpp1_
ii_999fefc3':
/tmp/tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c:11623:
error: '__fatDeviceText' undeclared (first use in this function)
/tmp/tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c:11623:
error: (Each undeclared identifier is reported only once
/tmp/tmpxft_00004836_00000000-1_segmented_scan_app.cudafe1.stub.c:11623:
error: for each function it appears in.)
make: *** [obj/release/segmented_scan_app.cu_o] Error 255
What version of the product are you using? On what operating system?
cudpp 1.1 with CUDA 2.3 on Debian sid with both gcc 4.3 and 4.1.
Please provide any additional information below.
I tried compile with emu=1 and this did work. I also tried dbg=1 but this
didn't seem to make any difference. This might well be a CUDA bug and not
cudpp's fault.
Original issue reported on code.google.com by [email protected]
on 27 Sep 2009 at 7:13
What steps will reproduce the problem?
1. Compile cudpp_testrig in Linux. You will see this new warning:
test_rand.cu(246): warning: variable "memSize" was declared but never
referenced
What is the expected output? What do you see instead?
No warnings.
Please use labels and text to provide additional information.
Original issue reported on code.google.com by [email protected]
on 22 Jun 2009 at 6:27
What steps will reproduce the problem?
I've included a function that runs the CUDPP multiScan and checks it
against what I think should appear using the CPU. For me it fails with a
datasize of 50000 and 100 rows.
What is the expected output? What do you see instead?
I would expect the test to pass. If you use the cudppScan inside the For
loop instead the test passes.
What version of the product are you using? On what operating system?
Using CUDPP 1.1 on Vista 64-bit with Visual Studio 2008. I'm using CUDA
2.2. I won't have time to test it using 2.3 as I'm about to leave the
country so maybe someone can confirm that it's still a fault with 2.3.
Please provide any additional information below.
I believe the problem lies in the scan_cta.cu file. In the scanCTA
function, a syncthreads is called only in emulation mode for backwards
scans. I think this needs to get called for device mode as well. At least
that fixed the problem for me. It looks like there would be race conditions
for large numbers of threads.
Original issue reported on code.google.com by [email protected]
on 29 Sep 2009 at 12:00
Attachments:
Stanley, when you get a chance, please document all the functions in
rand_cat.cu. The other files have good docs, but not this one.
Not marking for Release 1.1 because it's not crucial.
Original issue reported on code.google.com by [email protected]
on 26 Jun 2009 at 5:01
What steps will reproduce the problem?
==============================================
1) Extract attachment
2) Open cudpp/cudpp.sln in Visual Studio 2008
3) Rebuild Debug solution
4) Open apps/simpleCUDPP_openMP/simpleCUDPP.sln
5) Rebuild Debug solution
6) Start debugging simpleCUDPP
What is the expected output?
==============================================
All tests should pass
What do you see instead?
==============================================
---------
- Run 1:
---------
Windows has triggered a breakpoint in simpleCUDPP.exe.
This may be due to a corruption of the heap, which indicates a bug in
simpleCUDPP.exe or any of the DLLs it has loaded.
This may also be due to the user pressing F12 while simpleCUDPP.exe has focus.
The output window may have more diagnostic information.
---------
- Run 2:
---------
Error destroying CUDPPPlan
---------
- Run 3:
---------
Unhandled exception at 0x006ca87e (cudpp32d.dll) in simpleCUDPP.exe:
0xC0000005: Access violation writing location 0xddddddf1.
---------
- Run 4:
---------
Unhandled exception at 0x007c32e4 (cudpp32d.dll) in simpleCUDPP.exe:
0xC0000005: Access violation reading location 0xfeeefee8.
---------
- Run 5:
---------
Error creating CUDPPPlan
What version of the product are you using? On what operating system?
==============================================
CUDPP 1.1 with CUDA 2.3 beta, Windows XP 32 bit, Visual Studio 2008, GTX 295
Please provide any additional information below.
==============================================
- Running with OpenMP with 2 GPUs. It will occasionally work but generally
fail.
- See the original query at
http://groups.google.co.uk/group/cudpp/browse_thread/thread/507fba92fac36b1e?hl=
en
Original issue reported on code.google.com by [email protected]
on 29 Jul 2009 at 11:01
Attachments:
Segmented Scans and Scans doesnot work on very large array sizes.
Original issue reported on code.google.com by [email protected]
on 13 Nov 2009 at 5:45
Compile time continues to get longer as we add more functionality. CUDA is
really slow at
compiling template functions with multiple parameters, and we use a lot. There
are something like
384 different scan kernels, for example, and a similar number for segscan.
How can we reduce this code explosion? Can we give feedback to the CUDA
compiler team? (Emu
mode compiles WAY faster for example).
Original issue reported on code.google.com by [email protected]
on 25 Jun 2009 at 12:17
On line 302 of cuddp_util.h, the CUDPP_MIN identity for unsigned ints is
defined as INT_MAX. This is incorrect, as the maximum unsigned integer is
2*INT_MAX, or UINT_MAX.
Original issue reported on code.google.com by [email protected]
on 4 Aug 2009 at 3:23
See issue 17 for more information. It is not efficient to sort multiple
value arrays inside CUDPP -- one can sort key-index pairs and then use the
sorted indices to shuffle/gather the multiple arrays. This is more efficient
and more general, but it may not be obvious to users how to do it. So we
should provide an example in the "apps" directory.
Original issue reported on code.google.com by [email protected]
on 2 Sep 2009 at 9:43
When trying to build a shared library using cudpp on Linux(x86_64) the
version of cudpp delivered with the NVidia SDK (any version) as well any
Version of cudpp including 1.1 result in:
relocation R_X86_64_32 against `a local symbol' can not be used when making
a shared object; recompile with -fPIC
The Problem is known and has been discussed/solved on
http://forums.nvidia.com/lofiversion/index.php?t63748.html
It would be a good idea to include the changes (see attached patch for
version 1.1) in future releases of cudpp as well as the version shipped
with the NVidia SDK.
Original issue reported on code.google.com by [email protected]
on 4 Sep 2009 at 9:12
Attachments:
What steps will reproduce the problem?
Compile and run the following in emulation mode:
#include <stdio.h>
#include <cudpp/cudpp.h>
#include <cuda_runtime.h>
#include <cutil_inline.h>
typedef unsigned int uint;
#define N 12
uint keys[N] = {111, 37, 430, 433, 431, 357, 6190, 6193, 6191,
6117, 6837, 6911};
uint values[N] = {37, 111, 433, 430, 357, 431, 6193, 6190, 6117,
6191, 6911, 6837};
int main(){
cudaSetDevice(0);
int* keys_dev = 0;
int* vals_dev = 0;
cutilSafeCall(cudaMalloc((void**)&keys_dev, sizeof(uint) * N));
cutilSafeCall(cudaMalloc((void**)&vals_dev, sizeof(uint) * N));
CUDPPConfiguration sortConfig;
sortConfig.algorithm = CUDPP_SORT_RADIX;
sortConfig.datatype = CUDPP_UINT;
sortConfig.op = CUDPP_ADD;
sortConfig.options = CUDPP_OPTION_KEY_VALUE_PAIRS;
CUDPPHandle sortPlan;
cudppPlan(&sortPlan, sortConfig, 100 /* num elements */, 1 /* num
rows */, 100 /* pitch */);
printf("Before\n");
for (uint i = 0; i < N; i++) {
printf("(%d,\t%d)\n", keys[i], values[i]);
}
cutilSafeCall(cudaMemcpy(keys_dev, keys, sizeof(uint) * N,
cudaMemcpyHostToDevice));
cutilSafeCall(cudaMemcpy(vals_dev, values, sizeof(uint) * N,
cudaMemcpyHostToDevice));
cudppSort(sortPlan, keys_dev, vals_dev, 32, N);
cutilSafeCall(cudaMemcpy(keys, keys_dev, sizeof(uint) * N,
cudaMemcpyDeviceToHost));
cutilSafeCall(cudaMemcpy(values, vals_dev, sizeof(uint) * N,
cudaMemcpyDeviceToHost));
printf("After\n");
for (uint i = 0; i < N; i++) {
printf("(%d,\t%d)\n", keys[i], values[i]);
}
}
What is the expected output? What do you see instead?
The output should be a list of sorted keys + values. Instead:
Before
(111, 37)
(37, 111)
(430, 433)
(433, 430)
(431, 357)
(357, 431)
(6190, 6193)
(6193, 6190)
(6191, 6117)
(6117, 6191)
(6837, 6911)
(6911, 6837)
After
(37, 111)
(111, 37)
(357, 431)
(357, 431)
(357, 431)
(430, 433)
(6117, 6191)
(6117, 6191)
(6117, 6191)
(6190, 6193)
(6837, 6911)
(6911, 6837)
(6911, 6837)
Key/value pairs are indeed sorted however some pairs have been duplicated
whereas others have
been deleted.
What version of the product are you using? On what operating system?
Using the version bundled with the CUDA toolkit v3.0 beta1 on both MacOS 10.6
and Ubuntu
9.04
Please provide any additional information below.
Works correctly when run on the device.
Original issue reported on code.google.com by [email protected]
on 20 Nov 2009 at 9:46
Attachments:
What steps will reproduce the problem?
1. Run "cudpp_testrig -all" from the command line
What is the expected output? What do you see instead?
Expect "test passed" for all of them, but instead get a warning about not
finding a regression file. Also, when this happens, the tests still pass!
These should be test failures.
Original issue reported on code.google.com by [email protected]
on 16 Jun 2009 at 6:44
We need to finally add parallel reductions.
Note that there are two types: one for associative-commutative operators can be
optimized more
than one for operators that are only associative and not commutative.
Original issue reported on code.google.com by [email protected]
on 25 Jun 2009 at 10:02
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.