uikit0 / thrust Goto Github PK

Automatically exported from code.google.com/p/thrust

License: Apache License 2.0

Python 1.06% Shell 0.05% C++ 75.15% Cuda 23.06% C 0.67%

thrust's Introduction

Thrust: Code at the speed of light

Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust's high-level interface greatly enhances programmer productivity while enabling performance portability between GPUs and multicore CPUs. Interoperability with established technologies (such as CUDA, TBB, and OpenMP) facilitates integration with existing software. Develop high-performance applications rapidly with Thrust!

Examples

Thrust is best explained through examples. The following source code generates random numbers serially and then transfers them to a parallel device where they are sorted.

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/sort.h>
#include <thrust/copy.h>
#include <algorithm>
#include <cstdlib>

int main(void)
{
  // generate 32M random numbers serially
  thrust::host_vector<int> h_vec(32 << 20);
  std::generate(h_vec.begin(), h_vec.end(), rand);

  // transfer data to the device
  thrust::device_vector<int> d_vec = h_vec;

  // sort data on the device (846M keys per second on GeForce GTX 480)
  thrust::sort(d_vec.begin(), d_vec.end());

  // transfer data back to host
  thrust::copy(d_vec.begin(), d_vec.end(), h_vec.begin());

  return 0;
}

This code sample computes the sum of 100 random numbers in parallel:

#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <thrust/generate.h>
#include <thrust/reduce.h>
#include <thrust/functional.h>
#include <algorithm>
#include <cstdlib>

int main(void)
{
  // generate random data serially
  thrust::host_vector<int> h_vec(100);
  std:generate(h_vec.begin(), h_vec.end(), rand);

  // transfer to device and compute sum
  thrust::device_vector<int> d_vec = h_vec;
  int x = thrust::reduce(d_vec.begin(), d_vec.end(), 0, thrust::plus<int>());
  return 0;
}

thrust's People

Contributors

thrust's Issues

Test Thrust with --device-emulation

We should build a separate testing pass for --device-emulation.  The data
sizes need not be so large.

Original issue reported on code.google.com by jaredhoberock on 17 Sep 2009 at 11:12

add iterator support to raw_pointer_cast()

For simplicity, raw_pointer_cast() should accept (device) iterators.

There is a unit test for this case in /testing/device_ptr.cu

Original issue reported on code.google.com by wnbell on 17 Jul 2009 at 3:30

Thrust should be strict about checking whether it is safe to #include device-specific code

The following code, compiled with gcc, produces undefined references at
link time, rather than an error at compile time:

// bug.cpp:
#include <thrust/device_vector.h>
#include <thrust/sequence.h>

int main(void)
{
  unsigned int N = 10;
  thrust::device_vector<int> v(N);
  thrust::sequence(v.begin(), v.end());
  return 0;
}

Thrust should issue an error if an entry point cannot be compiled with the
given compiler/backend combination, as in this case:  gcc cannot compile
code that launches CUDA kernels, so don't try.

Original issue reported on code.google.com by jaredhoberock on 18 Jun 2009 at 1:47

clean up device radix_sort dispatch

1) Dispatch on ValueIterator
   - is_trivial<ValueIterator> & pod<ValueType> & sizeof(ValueType) == 4
       cast ValueIterator to unsigned int
   - is_trivial<ValueIterator> & pod<ValueType> & sizeof(ValueType) <  4
       embed ValueTypes in temporary unsigned int array
   - everything else
       use temporary unsigned int array to permute values

Now ValueIterator is an unsigned int *

2) Dispatch on KeyIterator
   - !pod<KeyType>
       generate error
   - !is_trivial<KeyIterator>
       use trivial_sequence<KeyIterator> to copy data
   - Continue with existing key dispatch

Original issue reported on code.google.com by wnbell on 4 Sep 2009 at 4:30

implement binary search functions

Thrust needs implementations of 
  * lower_bound
  * upper_bound
  * binary_search

In addition to the standard STL versions of these functions that perform a
single query we require vectorized versions that perform many queries.  Fo
example, the prototype for lower_bound should be:

template <class ForwardIterator, class InputIterator, class OutputIterator>
OutputIterator lower_bound(ForwardIterator begin, ForwardIterator end,
                InputIterator queries_begin, InputIterator queries_end,
                ResultIterator output_begin);

The returned OutputIterator is the end of the output range, which is usually:
  output_begin + distance(queries_begin, queries_end)
However, like transform(), it is necessary to return it since an
OutputIterator can't be advanced arbitrarily.

Original issue reported on code.google.com by wnbell on 2 Jun 2009 at 6:22

scatter across the bus & all paths of scatter_if are probably broken with any space iterators

scatter on the same side of the bus with any space iterators is tested and
should work.

Original issue reported on code.google.com by jaredhoberock on 14 Oct 2009 at 8:39

CUDA 2.3: MSVC9: version.cu example generates "nonstandard extension" warning when C runtime DLL used

This happens for me in WinXP 32-bit, but I suspect other windows versions 
will be the same.

What steps will reproduce the problem?
1. Create a new MSVC project.
2. Use the following command line to compile (/MDd means "Multithreaded 
Debug DLL" runtime library. Without this there are no warnings).  

"F:\CUDA\bin\nvcc.exe"    -arch sm_10 -ccbin "f:\Program Files\Microsoft 
Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /Od /Zi   /MDd  
" -I"F:\CUDA\include/thrust" -maxrregcount=32  --compile -o 
"Debug\version.cu.obj" "f:\CUDA\include\thrust\examples\version.cu" 

(Note my thrust is from SVN and the thrust include directory is 
F:\CUDA\include\thrust\thrust, so I use -I$(CUDA_INC_PATH)/thrust in the 
project settings.

Alternatively use the CUDA SDK 2.3 cuda.rules file and add "Multithreaded 
Debug DLL" option under "Hybrid CUDA/C++ Options".

3. Compile.

What is the expected output? What do you see instead?

Expected: no errors, no warnings.
I get this output:

1>Compiling with CUDA Build Rule...
1>"F:\CUDA\bin\nvcc.exe"    -arch sm_10 -ccbin "f:\Program Files\Microsoft 
Visual Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /Od /Zi   /MDd  
" -I"F:\CUDA\include/thrust" -maxrregcount=32  --compile -o 
"Debug\version.cu.obj" "f:\CUDA\include\thrust\examples\version.cu" 
1>version.cu
1>tmpxft_00000d18_00000000-3_version.cudafe1.gpu
1>tmpxft_00000d18_00000000-8_version.cudafe2.gpu
1>tmpxft_00000d18_00000000-3_version.cudafe1.cpp
1>tmpxft_00000d18_00000000-12_version.ii
1>f:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\istream(1139) : 
warning C4231: nonstandard extension used : 'extern' before template 
explicit instantiation
1>f:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\istream(1141) : 
warning C4231: nonstandard extension used : 'extern' before template 
explicit instantiation
1>f:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\istream(1143) : 
warning C4231: nonstandard extension used : 'extern' before template 
explicit instantiation
1>f:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\istream(1145) : 
warning C4231: nonstandard extension used : 'extern' before template 
explicit instantiation
1>f:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\istream(1147) : 
warning C4231: nonstandard extension used : 'extern' before template 
explicit instantiation
1>f:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\istream(1149) : 
warning C4231: nonstandard extension used : 'extern' before template 
explicit instantiation
1>f:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\istream(1151) : 
warning C4231: nonstandard extension used : 'extern' before template 
explicit instantiation
1>f:\Program Files\Microsoft Visual Studio 9.0\VC\INCLUDE\istream(1153) : 
warning C4231: nonstandard extension used : 'extern' before template 
explicit instantiation
1>Build log was saved at 
"file://f:\CUDA\include\thrust\examples\version\Debug\BuildLog.htm"
1>version - 0 error(s), 8 warning(s)


What version of the product are you using? On what operating system?

Thrust from SVN r521.

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 5 Oct 2009 at 2:44

host_vector will not compile if the value_type uses CUDA align

What steps will reproduce the problem?
1. Define a struct using CUDA's __align__ syntax and create a host_vector 
of that type:

struct __align__(16) MyVector { float x, y, z, w; }
thrust::host_vector<MyVector> vectors;

What is the expected output? What do you see instead?

I get this compilation error:

1>F:\CUDA\include\thrust/detail/vector_base.h(139) : error C2719: 'unnamed-
parameter': formal parameter with __declspec(align('16')) won't be aligned

Please use labels and text to provide additional information.

Original issue reported on code.google.com by [email protected] on 2 Oct 2009 at 4:48

The iterator space tags need to be renamed

We need to rename the iterator space tags to match the traversal tags:

space::device -> device_space_tag
space::host   -> host_space_tag
space::any    -> any_space_tag

Original issue reported on code.google.com by jaredhoberock on 31 Aug 2009 at 7:07

Add missing vector_base methods

What steps will reproduce the problem?
1. Insert anything into an empty device vector.

What is the expected output? What do you see instead?

terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDA error: invalid device pointer


What version of the product are you using? On what operating system?

Trunk: Head: Revision: 136

Please provide any additional information below.

In your implementation of this function, you shift down the contents of the
vector by n elements by copying the range from the insertion position to
the end. 

 thrust::copy(pos, this->end(), pos + n);

However, you resize the vector beforehand.

  resize(size() + n);

So you are copying n extra elements and run off the end of the vector.  I
think that changing

 thrust::copy(pos, this->end(), pos + n);

to 

 thrust::copy(pos, this->end() - n, pos + n);

will fix the problem.

Original issue reported on code.google.com by [email protected] on 4 Jun 2009 at 11:00

Add version of remove_if which takes a stencil range argument

Many versions of the XXXXX_if entry points accept an extra range to use as
an argument to the predicate.  remove_if should also have such a version.

Original issue reported on code.google.com by jaredhoberock on 31 Aug 2009 at 6:26

counting_iterator's difference_type needs to be properly deduced

counting_iterator's difference_type currently defaults to ptrdiff_t, but
needs to be properly deduced in the same way as Boost's.

Original issue reported on code.google.com by jaredhoberock on 26 Aug 2009 at 8:48

dispatch radix_sort when comp=greater<T> is used

With the appropriate transformations we can use radix_sort for sorting with
descending keys.  This is a natural extension of the enhancement in issue #27.

Original issue reported on code.google.com by wnbell on 31 Aug 2009 at 11:51

implement pair

Thrust needs an implementation of pair with __host__ __device__ 
constructors.

Original issue reported on code.google.com by jaredhoberock on 3 Jun 2009 at 2:37

gather's interface is defined incorrectly

gather's interface is

template<typename ForwardIterator , typename InputIterator , typename
RandomAccessIterator >
void thrust::gather (ForwardIterator first, ForwardIterator last,
                     InputIterator map,
                     RandomAccessIterator input)


The output range should come last.  gather's interface should be:

template<typename ForwardIterator , typename InputIterator , typename
RandomAccessIterator >
OutputIterator thrust::gather (ForwardIterator map_first,
                               ForwardIterator map_last,
                               RandomAccessIterator input,
                               OutputIterator result)

I don't remember if there's a reason we defined it the way we did.  I think
we should break the current API and correct it.  I propose we stick the
current interface in thrust::deprecated:: for v1.2 and the correct
interface in thrust::experimental:: and finally remove the old interface in
v1.3 and promote the new interface to thrust:: in v1.3.

Original issue reported on code.google.com by jaredhoberock on 1 Nov 2009 at 7:27

ensure device algorithms work for pointer-containing types

Device algorithms may be invoked on types containing device pointers or
device iterators that cannot be dereferenced on the host.  It is therefore
incorrect to perform certain portions of those operations on the host, as
the issue #28 demonstrated.

We should add unit tests for the fundamental algorithms to ensure that they
compute entirely on the device.  These functions are scan, reduce, and
for_each, and sort.

Original issue reported on code.google.com by wnbell on 10 Sep 2009 at 7:41

scans should return OutputIterator

Like transform(), all scan variants should return an OutputIterator
corresponding to output + (last - first).

Original issue reported on code.google.com by wnbell on 23 Jun 2009 at 8:03

make algorithms work for large data types

Many Thrust algorithms, particularly those that use shared memory, will
fail for large data types.  For Thrust v1.1 we should ensure that such
algorithms address this limitation.  In some intermediate-size cases we can
adjust BLOCK_SIZE (or similar) parameters to limit the number of elements
stored in shared memory.  In the uncommon case that this strategy fails
(e.g. sizeof(T) > 100) then we should introduce a workaround, such as
copying the data set to the host, performing the operation, and copying the
results back to the device.

The set of affected algorithms includes:
- reduce
- scan & segmented_scan
- sort

Original issue reported on code.google.com by wnbell on 30 Jul 2009 at 9:12

implement quantification functions all_of(), any_of(), none_of()

Proposal:
   http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2666.pdf

Function Prototypes:

template <class InputIterator, class Predicate>
  bool all_of(InputIterator first, InputIterator last, Predicate pred);
     Returns: true if pred(*i) is true for every iterator in the range
[first, last), and
     false otherwise.
     Complexity: At most last - first applications of the predicate.


template <class InputIterator, class Predicate>
  bool any_of(InputIterator first, InputIterator last, Predicate pred);
     Returns: true if there exists any iterator in the range [first, last)
such that
     pred(*i) is true, and false otherwise.
     Complexity: At most last - first applications of the predicate.


template <class InputIterator, class Predicate>
  bool none_of(InputIterator first, InputIterator last, Predicate pred);
     Returns: true if pred(*i) is false for every iterator in the range
[first, last),
     and false otherwise.
     Complexity: At most last - first applications of the predicate.

Original issue reported on code.google.com by wnbell on 11 Jun 2009 at 8:46

The lexicographical_sort example doesn't do any work on the device

The lexicographical_sort example should demonstrate how to do the
computation on the device.

Original issue reported on code.google.com by jaredhoberock on 25 Sep 2009 at 12:52

Adapt the Boost System Library

The Boost System Library [1] provides a standard & portable way to report
errors and exceptional behavior of underlying system-dependent operations.
 Thrust should adapt this library to report exceptional behavior of its
backends.

Original issue reported on code.google.com by jaredhoberock on 8 Sep 2009 at 5:40

Update Tutorial for v1.1

Update the Tutorial with links and references to v1.1 features.

Original issue reported on code.google.com by wnbell on 2 Sep 2009 at 9:25

permutation_iterator needs unit tests

The problem of dispatching gather/scatter to and from different spaces
needs to be solved comprehensively with copy + permutation_iterator.

Here's a prototypical implementation:

// all three spaces match
gather(Space result,
       Space map,
       Space input)
{
  copy(make_permutation_iterator(map, input),
       result)
}

// spaces differ in at least one place
gather(Space1 result,
       Space2 map,
       Space3 input)
{
  // bring the map & input to Space1
  // one of these is potentially a no-op
  bring_range_to_space<Space1> map1(map);
  bring_range_to_space<Space1> input1(input);

  gather(result, map1, input1);
}

Original issue reported on code.google.com by jaredhoberock on 31 Oct 2009 at 7:38

document reverse_iterator

We could implement a lot of functions with reverse_iterator [1].  For
example, reverse() could be implemented with transform() on
zip_iterator(iter, reverse_iter).  More importantly, reverse_iterator would
immediately provide reversed scan and segmented scan implementations
(although not necessarily optimal ones, due to mis-alignment).  Reversed
sequences would likely arise in many user codes as well.

[1]
http://www.boost.org/doc/libs/1_39_0/libs/iterator/doc/reverse_iterator.html

Original issue reported on code.google.com by wnbell on 22 Aug 2009 at 5:57

dispatch radix_sort even when comp=less<T> is explicitly provided

Currently we correctly dispatch radix_sort when sort(first, last) is
invoked with an implicit less<T> comparison method.  However, when the user
explicitly specifies any comparison, even the default less<T>, we dispatch
merge_sort.  We should intercept this usage and handle it correctly.

Original issue reported on code.google.com by wnbell on 31 Aug 2009 at 8:29

Errors on Snow Leopard

Thrust v1.1 examples don't compile on Snow Leopard.

Original issue reported on code.google.com by jaredhoberock on 13 Sep 2009 at 6:20

make algorithms work for arbitrarily large data types

Trust v1.1 will support large, but not necessarily arbitrarily large types
[1].  This enhancement extends support to types of any size, up to whatever
limits are imposed by the device.

[1] #16

Original issue reported on code.google.com by wnbell on 12 Aug 2009 at 11:18

ensure raw pointers and device_ptrs can be sorted efficiently

See this thread for context:
http://groups.google.com/group/thrust-users/browse_thread/thread/15e929909c4244e
b

Original issue reported on code.google.com by wnbell on 8 Jul 2009 at 7:04

thrust::gather dispatch with any space iterators is ambiguous

From this thread:

http://groups.google.com/group/thrust-users/browse_thread/thread/596bd90ae9973f9
4?hl=en

Original issue reported on code.google.com by jaredhoberock on 30 Oct 2009 at 11:41

Add a CUDA-specific pinned memory allocator

Details in this thread:

http://groups.google.com/group/thrust-users/browse_thread/thread/f09c6afeb2ecc49

Original issue reported on code.google.com by jaredhoberock on 22 Jul 2009 at 6:57

Move iterators from thrust::experimental to thrust:: namespace

Iterators currently live in thrust::experimental and should be moved to
thrust:: for v1.1

Original issue reported on code.google.com by wnbell on 31 Aug 2009 at 5:11

Large temporary objects can't be passed to thrust entry points

What steps will reproduce the problem?

device_new<T> with large sizeof(T) will produce the error message described
in this thread:

http://groups.google.com/group/thrust-users/browse_thread/thread/5fa902815f40448
8

What is the expected output? What do you see instead?
Error: Formal parameter space overflowed in function

Options:
1. do nothing
2. provide some sort of feedback to the user as to why the compile failed
3. provide a WAR for a few of the special cases such as device_new with n = 1

Original issue reported on code.google.com by jaredhoberock on 9 Jun 2009 at 7:39

Compiler Warning C4099 on MSVC 9.0 and CUDA 2.3

What steps will reproduce the problem?
1. Create a simple visual studio project with the following code in a .cu file.

#include <thrust/host_vector.h>
#include <thrust/version.h>

int main(int argc, char **argv)
{
    int major = THRUST_MAJOR_VERSION;
    int minor = THRUST_MINOR_VERSION;

    std::cout << "Thrust v" << major << "." << minor << std::endl;

    return 0;
}

2.  Compile debug mode using the Cuda.rules file v2.3.0 included with the CUDA 
SDK 
version 2.3

3.  Get the following output.

1>------ Rebuild All started: Project: cudaGjkTest, Configuration: Release 
Win32 ------
1>Deleting intermediate and output files for project 'cudaGjkTest', 
configuration 
'Release|Win32'
1>Compiling with CUDA Build Rule...
1>"F:\CUDA\bin\nvcc.exe"    -arch sm_10 -ccbin "f:\Program Files\Microsoft 
Visual 
Studio 9.0\VC\bin"    -Xcompiler "/EHsc /W3 /nologo /O2 /Zi   /MT  "  
-maxrregcount=32  
--compile -o "Release\cudaGjkTest.cu.obj" 
"f:\p4\physx\experimental\cudaGjkTest\cudaGjkTest.cu" 
1>cudaGjkTest.cu
1>tmpxft_00000f6c_00000000-3_cudaGjkTest.cudafe1.gpu
1>tmpxft_00000f6c_00000000-8_cudaGjkTest.cudafe2.gpu
1>tmpxft_00000f6c_00000000-3_cudaGjkTest.cudafe1.cpp
1>tmpxft_00000f6c_00000000-12_cudaGjkTest.ii
1>F:\CUDA\include\thrust/device_reference.h(798) : warning C4099: 
'thrust::device_reference' : type name first seen using 'struct' now seen using 
'class'
1>        F:\CUDA\include\thrust/device_reference.h(45) : see declaration of 
'thrust::device_reference'
1>        F:\CUDA\include\thrust/device_reference.h(802) : see reference to 
class 
template instantiation 'thrust::device_reference<T>' being compiled
1>Linking...
1>LINK : /LTCG specified but no code generation required; remove /LTCG from the 
link 
command line to improve linker performance
1>Embedding manifest...
1>Build log was saved at 
"file://f:\p4\physx\experimental\cudaGjkTest\Release\BuildLog.htm"
1>cudaGjkTest - 0 error(s), 1 warning(s)
========== Rebuild All: 1 succeeded, 0 failed, 0 skipped ==========


What is the expected output? What do you see instead?

Expected output is zero warnings, zero errors.


What version of the product are you using? On what operating system?

Windows XP 32-bit
MS Visual Studio 2008 (aka v9.0)
CUDA 2.3
Thrust v1.1.0

Please provide any additional information below.

Original issue reported on code.google.com by [email protected] on 21 Sep 2009 at 6:12

Implement get, tuple_size, & tuple_element for pair

These tuple metafunctions need to just work for pair

Original issue reported on code.google.com by jaredhoberock on 30 Jul 2009 at 11:25

Implement host_vector::swap(device_vector&) and vice-versa

I believe this is all that's necessary to implement a release consistency
model efficiently with Thrust.  For example, this function will do the
right thing whether the input vector (any_vec) lives on the host or device.

template <typename Vector>
void generic_function(Vector &any_vec)
{
   typedef typename Vector::value_type T;

   thrust::device_vector<T> d_vec;

   vec.swap(any_vec); // device owns the data

   .... // compute on device

   vec.swap(any_vec); // caller owns the data
}

Original issue reported on code.google.com by wnbell on 8 Sep 2009 at 6:43

Thrust should distinguish between two separate use cases for allocating temporary storage in algorithm implementations

We have two separate use cases for allocating temporary storage in our
algorithm implementations.

1. We need temporary storage that may or may not be contiguous
 - e.g. we need to allocate a buffer for the 2nd level of a reduction

2. We need temporary storage whose iterators match something fancy
 - e.g. we are merge sorting zip_iterator, ping pong storage should be
zip_iterator, not value_type *
 - e.g. we are copying zip_iterator from host to device

If we ignore fancy iterators and implement the use cases of case 2 with 1,
we will suffer from poor performance.
 - zip_iterator's value_type will be serialized to a buffer whose elements
will not coalesce.

We'll need a way to allocate temporary storage that matches the type of an
iterator, possibly with a generic fallback mechanism that just allocates
contiguous storage if the iterator isn't recognized.

For the solution to 1, it will be necessary to control the way copies
interact with this buffer.  In most cases, we will probably want to do
trivial bit copies to limit invocations of assignment operators & weird
things that may happen when a fancy iterator gets dereferenced.  The idea
is to avoid unintended consequences of executing arbitrary user code
encapsulated in functors, non-primitive types, and fancy iterators.

Original issue reported on code.google.com by jaredhoberock on 4 Sep 2009 at 3:11

stable_partition_copy doesn't respect OutputIterator concept semantics

The host implementation of stable_partition_copy treats the result
OutputIterator like a ForwardIterator.

Original issue reported on code.google.com by wnbell on 18 Jun 2009 at 3:31

Please check cudaMemcpy error code

Please check cudaMemcpy error code

Original issue reported on code.google.com by [email protected] on 28 May 2009 at 12:58

1.1's new additions require documentation

This non-exhaustive list of non-experimental new additions to Thrust in 1.1
need documentation:

1. binary search functions
2. tuple
3. pair
4. iterators
5. segmented scan functions

Original issue reported on code.google.com by jaredhoberock on 30 Aug 2009 at 7:59

review our implementation of iterator_core_access::distance_from

Boost's implementation of iterator_core_access::distance_from seems more
complete than ours.  Figure out what their implementation does and adapt it
to ours.

Original issue reported on code.google.com by jaredhoberock on 26 Aug 2009 at 9:05

Bug in radix_sort in SVN release

Hello !

I'm trying to sort data using a radix sort, and I experience a bug, i.e.
the data are not always sorted.

On the opposite, when I use a merge sort, everytinhg is ok.

Moreover, when I use this code with the 1.0 release of thrust, the radix
sort seems to work.


You'll find enclosed a small code.

P.S. : I'm using a nvidia geforce 8800 with cuda2.3 under linux.

Original issue reported on code.google.com by [email protected] on 18 Aug 2009 at 8:37

Attachments:

thrustBug.tar.gz

implement minmax_element

minmax_element combines the output of min_element and max_element into a
single thrust::pair<>

minmax_element is proposed here:
http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2005/n1840.pdf

Original issue reported on code.google.com by wnbell on 16 Jul 2009 at 10:48

Ensure that we can copy SoA to AoS and vice versa with zip_iterator

Debug the SoA to AoS & AoS to SoA patterns and add a zip_iterator unit test.

There are 8 combinations:

host SoA to host AoS
host AoS to host SoA
host SoA to device AoS
host AoS to device SoA

device SoA to device AoS
device AoS to device SoA
device SoA to host AoS
device AoS to host SoA

Original issue reported on code.google.com by jaredhoberock on 9 Aug 2009 at 7:56

segmented_scan's unit tests non-deterministically fail on G80

What steps will reproduce the problem?
1.
2.
3.

What is the expected output? What do you see instead?


Please use labels and text to provide additional information.

Original issue reported on code.google.com by jaredhoberock on 23 Aug 2009 at 6:32

Iterator semantics should be well-codified

We currently don't have any good ways to query the following qualities of
iterators:

1. In which memory space does the iterator live?  host/device?
2. Which backend implements the iterator? cuda/openmp/etc?
3. Is the iterator mutable?
4. What type do I get if I dereference on the host vs. device?
5. What is the traversal category of the iterator?

One immediate symptom of this problem is the inability to dispatch based on
an iterator's host/device-ness.  Instead, we have to dispatch on the entire
tag at once, which necessitates separate paths for input/output iterator,
even though the implementations are identical.

Another symptom of this problem is the inability to query an iterator for
const-ness: this makes distinguishing between iterator & const_iterator
impossible at the moment.  device_ptr::value_type incorrectly inherits
const due to this problem. This problem also causes Komrade issue #3 [1].

We should port & extend Boost's comprehensive solution to these problems
[2,3].  We should also extensively test Thrust's entry points on the STL's
host-side containers to ensure these issues are solved.

[1] http://code.google.com/p/komrade/issues/detail?id=3
[2]
http://www.boost.org/doc/libs/1_39_0/libs/iterator/doc/iterator_archetypes.html
[3]
http://www.boost.org/doc/libs/1_39_0/libs/iterator/doc/iterator_concepts.html

Original issue reported on code.google.com by jaredhoberock on 11 Jun 2009 at 12:15

use unary_function and binary_function in all examples

These define 'result_type' which is needed by transform_iterator, so we
need to set a good example.

Original issue reported on code.google.com by wnbell on 18 Oct 2009 at 2:42

improve radix_sort keyBits optimization

Instead of looking for the maximum over the keys sorting keys in [0,max] we
should also compute the minimum and only sort [0, max - min] by subtracting
min from all the keys.

This process can be achieved with essentially no cost, since the min
reduction can be fused with the max reduction and the key - min
transformation can be fused with the current preprocess/postprocess steps.

This will provide an especially significant speedup for signed integers
with negative values near 0.

Original issue reported on code.google.com by wnbell on 4 Sep 2009 at 5:08

No way to overload host and device versions of method of a predicate.

Steps to reproduce the problem.
1. Compile example.cu
2. Run program

The expected output is
----- 8< -----
Original vector:
 1 2 4 0 4 4 3 3 2 4 0 0 1 2 1 1 0 2 2 1

Filtered vector:
 1 2 4 4 4 3 3 2 4 1 2 1 1 2 2 1
----- >8 -----

The output is actually as follows:
----- 8< -----
Original vector:
 1 2 4 0 4 4 3 3 2 4 0 0 1 2 1 1 0 2 2 1

Filtered vector:
 1 2 4 4 4 3 3 2 4 1 2 1 1 2 2
----- >8 -----
(last element of the filtered vector is missing).

I'm using thrust 1.0 on WinXP SP3. CUDA SDK version 2.3 and Visual Studio
2005 are installed.

The problem is as follows. I am using the thrust::remove_if() function, the
version taking a predicate as third argument. An instance of
if_xn_less_than_level structure is passed as the predicate. The
if_xn_less_than_level::operator() method accepts an element of the vector
and returns true if element is to be removed and false if not. To make the
decision, another vector is necessary, a device pointer to which is stored
in the if_xn_less_than_level::xd member variable (the parameter of
operator() is the index in xd). The problem is that operator() is not only
called on the device, but also on the host (thrust/remove.h, line 144). But
the dereferencing of the if_xn_less_than_level::xd pointer is only valid
when done on the device, not on the host. I also have found no way to
provide both __host__ and __device__ overloads for the
if_xn_less_than_level::operator() method. My question is simple: how to get
it work?

Original issue reported on code.google.com by [email protected] on 31 Aug 2009 at 8:38

Attachments:

example.cu

Matrix Support

It would be nice to support device and host matrices, as CUDA has native
support for them.

Original issue reported on code.google.com by [email protected] on 19 Jul 2009 at 12:29

Dispatch advance() and distance() on traversal tag instead of memory space

Currently advance() and distance() dispatch on memory space and assume that
device iterators are random access.  We should instead dispatch based on
traversal tags to ensure all iterator types are handled correctly in all
spaces.

Original issue reported on code.google.com by wnbell on 9 Sep 2009 at 9:57