Comments (12)
Can you post the error message/stack trace that the UDF is giving you? It's hard to say where your issue might be coming from without the error message.
It is important to note the usage of array_construct
is creating a Snowflake array of column names, where as the array_agg
is then taking that list of column names and creating a list-of-lists with all of the corresponding data from those columns in a single Array in a Snowpark DataFrame.
In the UDF, pandas is taking a list-of-lists (the data) plus a list of column names to then construct a pandas dataframe. From your final code block, you're trying to construct a dataframe from a Snowpark function which will fail. But I'd like to understand the error you're seeing in the UDF to better answer your question, as I'm not totally following the problem you're encountering.
from sfguide-citibike-ml-snowpark-python.
Hey there, thanks for the respone. I made a tds article about the errors
I found when attempting to use the methods from this section: https://github.com/Snowflake-Labs/sfguide-citibike-ml-snowpark-python/blob/main/03_ML_Engineering.ipynb.
The tds article is:
https://towardsdatascience.com/using-python-udfs-and-snowflake-s-snowpark-to-build-and-deploy-machine-learning-models-part-2-2fe40d382ae7
The specific problem that this issue is about is converting the array construct to a dataframe within the udf. It doesn't work for me. It converts the array construct to a one column dataframe and doesn't split the data as 03_ML_Engineering.ipynb suggests I should be able to, using:
%%writefile dags/station_train_predict.py
def station_train_predict_func(historical_data:list,
historical_column_names:list,
target_column:str,
cutpoint: int,
max_epochs: int,
forecast_data:list,
forecast_column_names:list,
lag_values:list):
from torch import tensor
import pandas as pd
from pytorch_tabnet.tab_model import TabNetRegressor
from datetime import timedelta
import numpy as np
feature_columns = historical_column_names.copy()
feature_columns.remove('DATE')
feature_columns.remove(target_column)
forecast_steps = len(forecast_data)
df = pd.DataFrame(historical_data, columns = historical_column_names)
historical data here is :
historical_df = historical_df.group_by(F.col('STATION_ID'))\
.agg(F.array_agg(F.array_construct(*historical_column_list))\
.alias('HISTORICAL_DATA'))
and the udf is called with:
historical_df.join(forecast_df)\
.select(F.col('STATION_ID'),
F.call_udf('station_train_predict_udf',
F.col('HISTORICAL_DATA'),
F.lit(historical_column_names),
F.lit(target_column),
F.lit(cutpoint),
F.lit(max_epochs),
F.col('FORECAST_DATA'),
F.lit(forecast_column_names),
F.lit(lag_values_array)).alias('PRED_DATA'))\
.write.mode('overwrite')\
.save_as_table('PRED_test')
To me, this suggests that an array construct can be fed in and converted seamlessly to a pandas dataframe within a udf that is uploaded to snowpark. Not sure if I am misunderstanding. Let me know your thoughts. The company I work for has an account with you guys, and I would be happy to jump on a call and walk you through each and every error I got, including this one.
Thanks,
Chris
from sfguide-citibike-ml-snowpark-python.
The issue is that your example code is using .collect()
which brings back Row objects locally, and in the case of your testing, the Row object turns the array into a String.
In the quickstart, the purpose of using array_construct
and array_agg
is to first: consolidate several columns in a Snowpark DF into a single column with type array
, and then second collapse all records for a given station into a single row so that you can pass in all records for a particular station into a single invocation of the UDF. This is fundamentally different from what you appear to be trying to locally test. See the below snippets that follow the same pattern and do work. You have to include an array_agg
statement in the input to your UDF to create an array of arrays which then can be used to construct a pandas dataframe for training. Consider this sample Snowpark table with 570 records:
To perform model training in a UDF, I need to collapse all records in this table into a single row, using df.group_by().agg(F.array_agg(F.array_construct(*columns)).alias('INPUT_DATA'))
as shown here. Notice this query returns a single row, with a single column consisting of arrays of arrays:
Now I will write an example dummy UDF that takes this input vectorized array, constructs a pandas dataframe from it, and then returns the number of rows in the pandas dataframe (recall this should be 570 based on the previous .count()
query from Snowpark):
@udf(name="pandas_test",packages=["pandas==1.3.5"],
is_permanent=True, replace=True, stage_location="@my_stage",
input_types=[ArrayType()],
return_type = FloatType())
def test_pandas(old_data: list) -> float:
return pd.DataFrame(old_data,columns=columns).shape[0]
The reason your existing code is not splitting the data properly is that, when you test locally, you are using .collect
and trying to construct a dataframe from a list of strings, as opposed to a list of lists.
Additionally UDF instances only process 1 row of data at a time, hence why you must use array_agg
to collapse all rows into a single row in order to create a meaningful dataframe in the UDF.
from sfguide-citibike-ml-snowpark-python.
It is also worth noting that in that particular demo, we perform the array_agg
and array_construct
because we are training hundreds of models simultaneously via UDFs. For training a single model, you are better served using a Python stored procedure with to_pandas()
on a high-memory warehouse Snowflake virtual warehouse
from sfguide-citibike-ml-snowpark-python.
Hey thanks for the response. If you look closer, I am not using .collect() within the udf, but either before or after when I am testing it outside of the udf. If you look farther down my example code, I tried that exact process where I collapse it down to an array construct before feeing it into the udf, then within the udf try to convert it to a pandas dataframe. The issue for me lies in converting it to a pandas dataframe within the udf. It doesn't split the data into separate columns when calling pd.dataframe(array_construct, columns=columns) within the udf on the un-collected array_construct.
I would like to be able to do the array_agg and array_construct because I can see the utility of training hundreds of models simultaneously.
from sfguide-citibike-ml-snowpark-python.
from sfguide-citibike-ml-snowpark-python.
from sfguide-citibike-ml-snowpark-python.
Hey thanks!! Super glad we got to this where I could get help with this error. Stoked to try it out. Been driving me nuts for days why I couldn't figure this out. I will try this out. Thanks.
from sfguide-citibike-ml-snowpark-python.
ROFL, I was training N independent models each on a single record. No wonder all my probabilities were (0.5, 0.5)
from sfguide-citibike-ml-snowpark-python.
Chris, can I close this issue? If you have corrected the issue, can you please also modify the blog post to reflect the corrections?
from sfguide-citibike-ml-snowpark-python.
Thanks for the explanation. So advice if training one model on lots of data - use stored procedure and hardware to solve?
from sfguide-citibike-ml-snowpark-python.
@RichardScottOZ that's correct, currently Snowpark only supports single-node style training of a single model (vs. parallel training of many models) via Stored Procedures. Snowpark-optimized warehouses (currently in Private Preview) will increase the warehouse resources available to single-node operations dramatically for exactly these kinds of use-cases (large data, single model).
from sfguide-citibike-ml-snowpark-python.
Related Issues (1)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from sfguide-citibike-ml-snowpark-python.