Git Product home page Git Product logo

Comments (10)

yyy1000 avatar yyy1000 commented on July 20, 2024 1

I'd like to look at this and find how to do it. First I will try to see whether Unnest could do similar thing. :)

from arrow-datafusion.

yyy1000 avatar yyy1000 commented on July 20, 2024 1

I have some updates to share:

the with_column implementation in Datafusion can't add a new column, it's the same as Spark's implementation, which says in https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html

Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

The column expression must be an expression over this DataFrame; attempting to add a column from some other DataFrame will raise an error.
So using unnest is not a solution IMO.

When I tried to implement a new method, I got stuck on how to retrieve the data from a dataframe. I think Dataframe in Polars is consists of some columns, see https://docs.rs/polars-core/0.38.3/src/polars_core/frame/mod.rs.html#134, and it looks more like a physical one. But dataframe in Datafusion is of a LogicalPlan, so I think it maybe different and looks more like a logical one? 🤔

Correct me if I'm wrong, I'm not very familiar with this. :)

from arrow-datafusion.

yyy1000 avatar yyy1000 commented on July 20, 2024 1

Agree, I think adding the method the user needed would not be a good choice, because it will execute the dataframe.

from arrow-datafusion.

alamb avatar alamb commented on July 20, 2024

But dataframe in Datafusion is of a LogicalPlan, so I think it maybe different and looks more like a logical one? 🤔

That does sound correct.

Or maybe the only way to implement "add_column" would be to actually execute the he DataFrame (aka https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.collect) and then append the column's data to the resulting record batch

However, that API sounds somewhat specialized -- and I am not sure it make sense

So maybe this request doesn't make sense and we should close the issue 🤔

from arrow-datafusion.

Omega359 avatar Omega359 commented on July 20, 2024

My guess is that the idea for the Polars implementation came from Pandas where to add a column you do something like

df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)

which returns a new df

My first thought for this was to look into the cast issue with the first solution and to see if there was something there that could be adjusted to make it work.

from arrow-datafusion.

yyy1000 avatar yyy1000 commented on July 20, 2024

But dataframe in Datafusion is of a LogicalPlan, so I think it maybe different and looks more like a logical one? 🤔

That does sound correct.

Or maybe the only way to implement "add_column" would be to actually execute the he DataFrame (aka https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.collect) and then append the column's data to the resulting record batch

However, that API sounds somewhat specialized -- and I am not sure it make sense

So maybe this request doesn't make sense and we should close the issue 🤔

Yeah, I also think maybe we can only append the column to the RecordBatch, and I think maybe closing this issue makes sense to me.

from arrow-datafusion.

yyy1000 avatar yyy1000 commented on July 20, 2024

My first thought for this was to look into the cast issue with the first solution and to see if there was something there that could be adjusted to make it work.

Do you mean using the with_column method in DF? I think that doesn't make sense, and it can only add existing column by adding projection. 🤔

from arrow-datafusion.

alamb avatar alamb commented on July 20, 2024

Do you mean using the with_column method in DF? I think that doesn't make sense, and it can only add existing column by adding projection. 🤔

As you point out, DataFusion's dataframe can already add a new column as a derived expression (by using project).

What it can't do is append a new column to an existing DataFrame:

// Read 100 rows in
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;

// Create a new column with 100 integers
let new_column = RecordBatch::try_from_iter(vec![
  ("foo", Arc::new(Int32Array::from(0..100)))
]).unwrap();

// Append the new column to the dataframe
// (errors if the row counts don't match)
let df = df.append_column(new_column).await?

However, I am not sure how useful this feature would be

from arrow-datafusion.

Omega359 avatar Omega359 commented on July 20, 2024

Is this the first request for this? If so I'd say that while it seems useful it's not actually something that is an issue in general.

from arrow-datafusion.

alamb avatar alamb commented on July 20, 2024

Sounds good -- closing the ticket for now nd we can reopen / revisit if it is requested again.

Thanks @Omega359 and @yyy1000

from arrow-datafusion.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.