<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Converting from Array Construct to DataFrame about sfguide-citibike-ml-snowpark-python HOT 12 CLOSED

snowflake-labs commented on June 26, 2024

Converting from Array Construct to DataFrame

from sfguide-citibike-ml-snowpark-python.

Comments (12)

sfc-gh-cbaechtold commented on June 26, 2024

Can you post the error message/stack trace that the UDF is giving you? It's hard to say where your issue might be coming from without the error message.

It is important to note the usage of array_construct is creating a Snowflake array of column names, where as the array_agg is then taking that list of column names and creating a list-of-lists with all of the corresponding data from those columns in a single Array in a Snowpark DataFrame.

In the UDF, pandas is taking a list-of-lists (the data) plus a list of column names to then construct a pandas dataframe. From your final code block, you're trying to construct a dataframe from a Snowpark function which will fail. But I'd like to understand the error you're seeing in the UDF to better answer your question, as I'm not totally following the problem you're encountering.

from sfguide-citibike-ml-snowpark-python.

chriskuchar commented on June 26, 2024

Hey there, thanks for the respone. I made a tds article about the errors
I found when attempting to use the methods from this section: https://github.com/Snowflake-Labs/sfguide-citibike-ml-snowpark-python/blob/main/03_ML_Engineering.ipynb.

The tds article is:
https://towardsdatascience.com/using-python-udfs-and-snowflake-s-snowpark-to-build-and-deploy-machine-learning-models-part-2-2fe40d382ae7

The specific problem that this issue is about is converting the array construct to a dataframe within the udf. It doesn't work for me. It converts the array construct to a one column dataframe and doesn't split the data as 03_ML_Engineering.ipynb suggests I should be able to, using:

%%writefile dags/station_train_predict.py
def station_train_predict_func(historical_data:list, 
                               historical_column_names:list, 
                               target_column:str,
                               cutpoint: int, 
                               max_epochs: int, 
                               forecast_data:list,
                               forecast_column_names:list,
                               lag_values:list):
    
    from torch import tensor
    import pandas as pd
    from pytorch_tabnet.tab_model import TabNetRegressor
    from datetime import timedelta
    import numpy as np
    
    feature_columns = historical_column_names.copy()
    feature_columns.remove('DATE')
    feature_columns.remove(target_column)
    forecast_steps = len(forecast_data)
    
    df = pd.DataFrame(historical_data, columns = historical_column_names)

historical data here is :

historical_df = historical_df.group_by(F.col('STATION_ID'))\
                             .agg(F.array_agg(F.array_construct(*historical_column_list))\
                                  .alias('HISTORICAL_DATA'))

and the udf is called with:

historical_df.join(forecast_df)\
             .select(F.col('STATION_ID'),
                     F.call_udf('station_train_predict_udf', 
                                F.col('HISTORICAL_DATA'),
                                F.lit(historical_column_names), 
                                F.lit(target_column),
                                F.lit(cutpoint), 
                                F.lit(max_epochs),
                                F.col('FORECAST_DATA'),
                                F.lit(forecast_column_names),
                                F.lit(lag_values_array)).alias('PRED_DATA'))\
           .write.mode('overwrite')\
           .save_as_table('PRED_test')

To me, this suggests that an array construct can be fed in and converted seamlessly to a pandas dataframe within a udf that is uploaded to snowpark. Not sure if I am misunderstanding. Let me know your thoughts. The company I work for has an account with you guys, and I would be happy to jump on a call and walk you through each and every error I got, including this one.

Thanks,
Chris

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold commented on June 26, 2024

The issue is that your example code is using .collect() which brings back Row objects locally, and in the case of your testing, the Row object turns the array into a String.

In the quickstart, the purpose of using array_construct and array_agg is to first: consolidate several columns in a Snowpark DF into a single column with type array, and then second collapse all records for a given station into a single row so that you can pass in all records for a particular station into a single invocation of the UDF. This is fundamentally different from what you appear to be trying to locally test. See the below snippets that follow the same pattern and do work. You have to include an array_agg statement in the input to your UDF to create an array of arrays which then can be used to construct a pandas dataframe for training. Consider this sample Snowpark table with 570 records:

To perform model training in a UDF, I need to collapse all records in this table into a single row, using df.group_by().agg(F.array_agg(F.array_construct(*columns)).alias('INPUT_DATA')) as shown here. Notice this query returns a single row, with a single column consisting of arrays of arrays:

Now I will write an example dummy UDF that takes this input vectorized array, constructs a pandas dataframe from it, and then returns the number of rows in the pandas dataframe (recall this should be 570 based on the previous .count() query from Snowpark):

@udf(name="pandas_test",packages=["pandas==1.3.5"],
        is_permanent=True, replace=True, stage_location="@my_stage",
        input_types=[ArrayType()],
        return_type = FloatType())
def test_pandas(old_data: list) -> float:
    return pd.DataFrame(old_data,columns=columns).shape[0]

Notice the output:

The reason your existing code is not splitting the data properly is that, when you test locally, you are using .collect and trying to construct a dataframe from a list of strings, as opposed to a list of lists.

Additionally UDF instances only process 1 row of data at a time, hence why you must use array_agg to collapse all rows into a single row in order to create a meaningful dataframe in the UDF.

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold commented on June 26, 2024

It is also worth noting that in that particular demo, we perform the array_agg and array_construct because we are training hundreds of models simultaneously via UDFs. For training a single model, you are better served using a Python stored procedure with to_pandas() on a high-memory warehouse Snowflake virtual warehouse

from sfguide-citibike-ml-snowpark-python.

chriskuchar commented on June 26, 2024

Hey thanks for the response. If you look closer, I am not using .collect() within the udf, but either before or after when I am testing it outside of the udf. If you look farther down my example code, I tried that exact process where I collapse it down to an array construct before feeing it into the udf, then within the udf try to convert it to a pandas dataframe. The issue for me lies in converting it to a pandas dataframe within the udf. It doesn't split the data into separate columns when calling pd.dataframe(array_construct, columns=columns) within the udf on the un-collected array_construct.

I would like to be able to do the array_agg and array_construct because I can see the utility of training hundreds of models simultaneously.

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold commented on June 26, 2024

…

On Mon, Aug 22, 2022, 16:16 Chris Kuchar ***@***.***> wrote: Hey thanks for the response. If you look closer, I am not using .collect() within the udf, but either before or after when I am testing it outside of the udf. If you look farther down my example code, I tried that exact process where I collapse it down to an array construct then try to convert it to a pandas dataframe. The issue for me lies in converting it to a pandas dataframe within the udf. It doesn't split the data into separate columns when calling pd.dataframe(array_construct, columns=columns) within the udf. I would like to be able to do the array_agg and array_construct because I can see the utility of training hundreds of models simultaneously. — Reply to this email directly, view it on GitHub <#4 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AYAPWBSJA4Q4QUWIE5C6QNDV2P32XANCNFSM53W7QTDQ> . You are receiving this because you commented.Message ID: <Snowflake-Labs/sfguide-citibike-ml-snowpark-python/issues/4/1223215073@ github.com>

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold commented on June 26, 2024

This is also why you are not seeing desired results in your model, you are effectively training N independent models, each on a single record. On Mon, Aug 22, 2022, 16:30 Caleb Baechtold ***@***.***> wrote:

…

The issue is that you are not using array_agg, this is also why your train test split fails. If you do not use array_agg you are effectively running a model training job on a single row of data. Array_construct and array_agg are fundamentally different and you have to use both. You only use array_construct in your code and so are constructing a dataframe in the UDF with one record in it, and thus cannot train test split successfully. Please look carefully at the above example and try again using array_agg. On Mon, Aug 22, 2022, 16:16 Chris Kuchar ***@***.***> wrote: > Hey thanks for the response. If you look closer, I am not using > .collect() within the udf, but either before or after when I am testing it > outside of the udf. If you look farther down my example code, I tried that > exact process where I collapse it down to an array construct then try to > convert it to a pandas dataframe. The issue for me lies in converting it to > a pandas dataframe within the udf. It doesn't split the data into separate > columns when calling pd.dataframe(array_construct, columns=columns) within > the udf. > > I would like to be able to do the array_agg and array_construct because I > can see the utility of training hundreds of models simultaneously. > > — > Reply to this email directly, view it on GitHub > <#4 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AYAPWBSJA4Q4QUWIE5C6QNDV2P32XANCNFSM53W7QTDQ> > . > You are receiving this because you commented.Message ID: > <Snowflake-Labs/sfguide-citibike-ml-snowpark-python/issues/4/1223215073@ > github.com> >

from sfguide-citibike-ml-snowpark-python.

chriskuchar commented on June 26, 2024

Hey thanks!! Super glad we got to this where I could get help with this error. Stoked to try it out. Been driving me nuts for days why I couldn't figure this out. I will try this out. Thanks.

from sfguide-citibike-ml-snowpark-python.

chriskuchar commented on June 26, 2024

ROFL, I was training N independent models each on a single record. No wonder all my probabilities were (0.5, 0.5)

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold commented on June 26, 2024

Chris, can I close this issue? If you have corrected the issue, can you please also modify the blog post to reflect the corrections?

from sfguide-citibike-ml-snowpark-python.

RichardScottOZ commented on June 26, 2024

Thanks for the explanation. So advice if training one model on lots of data - use stored procedure and hardware to solve?

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold commented on June 26, 2024

@RichardScottOZ that's correct, currently Snowpark only supports single-node style training of a single model (vs. parallel training of many models) via Stored Procedures. Snowpark-optimized warehouses (currently in Private Preview) will increase the warehouse resources available to single-node operations dramatically for exactly these kinds of use-cases (large data, single model).

from sfguide-citibike-ml-snowpark-python.

Converting from Array Construct to DataFrame about sfguide-citibike-ml-snowpark-python HOT 12 CLOSED

Comments (12)

Related Issues (1)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent