Git Product home page Git Product logo

Comments (7)

vyasr avatar vyasr commented on August 16, 2024

CC @kkraus14 @jorisvandenbossche @paleolimbot @jrhemstad @davidwendt

from cudf.

jrhemstad avatar jrhemstad commented on August 16, 2024

A new type is fine, but I'm not clear on what you're envisioning for where/how this cudf::arrow_column would be used.

from cudf.

vyasr avatar vyasr commented on August 16, 2024

The API I'm envisioning would be

std::unique_ptr<cudf::arrow_column> from_arrow(ArrowSchema const* schema, ArrowDeviceArray *input, rmm::cuda_stream_view stream, rmm::mr::device_memory_resource mr);

The object would look roughly like this:

class arrow_column {
  arrow_column(ArrowDeviceArray *input) {
    ArrowArrayMove(input, arr.get());
  }
  ~arrow_column() {
    arr.release();
  }
  column_view view();
  mutable_column_view mutable_view();
private:
  // Using a shared_ptr to potentially allow re-export in the future, but that would require extra machinery to get right.
  std::shared_ptr<ArrowDeviceArray> arr;
}

from cudf.

jrhemstad avatar jrhemstad commented on August 16, 2024

Got it, so the idea would be to just introduce this type as a new return type for from_arrow that preserves the shared ownership semantics. Other cudf APIs would be unaffected and we wouldn't expect to update other APIs to also try and preserve shared ownership semantics.

Makes sense to me.

from cudf.

davidwendt avatar davidwendt commented on August 16, 2024

This reminds me a bit of contiguous-split where a struct is returned that contains device memory along with a view into that data. Though the details do not quite match here technically.
I'd like to think of this in terms of that perhaps.

from cudf.

vyasr avatar vyasr commented on August 16, 2024

Whoops also obviously CC @zeroshade

from cudf.

vyasr avatar vyasr commented on August 16, 2024

Here is a more complete sketch of what I'm imagining. I haven't thought all the way through how the to_arrow side of things should look, but here's one proposal:

// Class to manage lifetime semantics and allow re-export.
struct arrow_array_container {
  ArrowDeviceArray* arr;
  // Question: When the input data was host data, we could presumably release
  // immediately. Do we care? If so, how should we implement that?
  ~arrow_array_container() {
    arr->array.release(&arr->array);
  }
};

class arrow_column {
public:
  arrow_column(ArrowDeviceArray *input) {
    ArrowArrayMove(input, container->arr);
  }
  cudf::column_view view();
  cudf::mutable_column_view mutable_view();

  // Create Array whose private_data contains a shared_ptr to this->container
  // The output should be consumer-allocated, see
  // https://arrow.apache.org/docs/format/CDataInterface.html#member-allocation
  // Note: May need stream/mr depending on where we put what logic.
  void to_arrow(ArrowDeviceArray *output);
private:
  // Using a shared_ptr allows re-export via to_arrow
  std::shared_ptr<arrow_array_container> container;
};

class arrow_table {
public:
  arrow_table(std::vector<std::shared_ptr<arrow_column> columns) : columns{columns} {}
  cudf::table_view view();
  cudf::mutable_table_view mutable_view();
  // Create Array whose private_data contains shared_ptrs to all the underlying arrow_array_containers
  void to_arrow(ArrowDeviceArray *output);
private:
  // Would allow arrow_columns being in multiple arrow_tables
  std::vector<std::shared_ptr<arrow_column> columns;
};

// ArrowArrayStream and ArrowArray overloads (they can be overloads now instead
// of separate functions) are trivial wrappers around this function. Also need versions
// of all three that return an arrow_column instead of an arrow_table.
std::unique_ptr<arrow_table> from_arrow(
  ArrowSchema const* schema,
  ArrowDeviceArray *input,
  rmm::cuda_stream_view stream,
  rmm::mr::device_memory_resource mr);

// Produce an ArrowDeviceArray and then create an arrow_column around it.
std::unique_ptr<arrow_table> to_arrow(
  // Question: Do we really need a column_view overload? If we're going this
  // route, I think it's OK to always require a transfer of ownership to the
  // arrow_table, but there is potentially some small overhead there.
  std::unique_ptr<cudf::table> input,
  rmm::cuda_stream_view stream      = cudf::get_default_stream(),
  rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource());

from cudf.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.