Comments (10)
I'd like to look at this and find how to do it. First I will try to see whether Unnest
could do similar thing. :)
from arrow-datafusion.
I have some updates to share:
the with_column
implementation in Datafusion can't add a new column, it's the same as Spark's implementation, which says in https://spark.apache.org/docs/3.1.3/api/python/reference/api/pyspark.sql.DataFrame.withColumn.html
Returns a new DataFrame by adding a column or replacing the existing column that has the same name.
The column expression must be an expression over this DataFrame; attempting to add a column from some other DataFrame will raise an error.
So usingunnest
is not a solution IMO.
When I tried to implement a new method, I got stuck on how to retrieve the data from a dataframe. I think Dataframe in Polars
is consists of some columns, see https://docs.rs/polars-core/0.38.3/src/polars_core/frame/mod.rs.html#134, and it looks more like a physical one. But dataframe in Datafusion is of a LogicalPlan
, so I think it maybe different and looks more like a logical one? 🤔
Correct me if I'm wrong, I'm not very familiar with this. :)
from arrow-datafusion.
Agree, I think adding the method the user needed would not be a good choice, because it will execute the dataframe.
from arrow-datafusion.
But dataframe in Datafusion is of a LogicalPlan, so I think it maybe different and looks more like a logical one? 🤔
That does sound correct.
Or maybe the only way to implement "add_column" would be to actually execute the he DataFrame (aka https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.collect) and then append the column's data to the resulting record batch
However, that API sounds somewhat specialized -- and I am not sure it make sense
So maybe this request doesn't make sense and we should close the issue 🤔
from arrow-datafusion.
My guess is that the idea for the Polars implementation came from Pandas where to add a column you do something like
df1 = df1.assign(e=pd.Series(np.random.randn(sLength)).values)
which returns a new df
My first thought for this was to look into the cast issue with the first solution and to see if there was something there that could be adjusted to make it work.
from arrow-datafusion.
But dataframe in Datafusion is of a LogicalPlan, so I think it maybe different and looks more like a logical one? 🤔
That does sound correct.
Or maybe the only way to implement "add_column" would be to actually execute the he DataFrame (aka https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.collect) and then append the column's data to the resulting record batch
However, that API sounds somewhat specialized -- and I am not sure it make sense
So maybe this request doesn't make sense and we should close the issue 🤔
Yeah, I also think maybe we can only append the column to the RecordBatch, and I think maybe closing this issue makes sense to me.
from arrow-datafusion.
My first thought for this was to look into the cast issue with the first solution and to see if there was something there that could be adjusted to make it work.
Do you mean using the with_column
method in DF? I think that doesn't make sense, and it can only add existing column by adding projection. 🤔
from arrow-datafusion.
Do you mean using the with_column method in DF? I think that doesn't make sense, and it can only add existing column by adding projection. 🤔
As you point out, DataFusion's dataframe can already add a new column as a derived expression (by using project
).
What it can't do is append a new column to an existing DataFrame:
// Read 100 rows in
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
// Create a new column with 100 integers
let new_column = RecordBatch::try_from_iter(vec![
("foo", Arc::new(Int32Array::from(0..100)))
]).unwrap();
// Append the new column to the dataframe
// (errors if the row counts don't match)
let df = df.append_column(new_column).await?
However, I am not sure how useful this feature would be
from arrow-datafusion.
Is this the first request for this? If so I'd say that while it seems useful it's not actually something that is an issue in general.
from arrow-datafusion.
Sounds good -- closing the ticket for now nd we can reopen / revisit if it is requested again.
from arrow-datafusion.
Related Issues (20)
- INSERT INTO SQL failing on CSV-backed table HOT 3
- Unify SQL planning for `ORDER BY`, `HAVING`, `DISTINCT`, etc
- Unable to perform lead/lag built in functions on List and Struct data types HOT 1
- Enable `split_file_groups_by_statistics` by default HOT 3
- Avoid inlining non deterministic CTE HOT 4
- Make all SchemaProvider trait APIs async HOT 4
- Document timezone semantics HOT 2
- Schema incorrect after select over aggregate function that returns a different type than the input HOT 5
- clippy failure in main HOT 1
- Document Sort Merge Join algorithm HOT 4
- `LogFunc` simplifier swaps the order of arguments
- Standardize the separator in name HOT 1
- Onyl recompute schema in `TypeCoercion` when necessary
- Better timezone functionalities HOT 3
- Auto-update mechanism for dataframe test HOT 1
- Remove `Expr::GetIndexedField` and `GetFieldAccess` and always use function `get_field` for indexing HOT 6
- Support user defined display for UDF HOT 2
- Remove DataPtr trait and use Arc::ptr_eq directly
- Sort Merge Join. LeftSemi issues
- Sort Merge Join. LeftAnti issues HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from arrow-datafusion.