Git Product home page Git Product logo

Comments (12)

sfc-gh-cbaechtold avatar sfc-gh-cbaechtold commented on June 26, 2024

Can you post the error message/stack trace that the UDF is giving you? It's hard to say where your issue might be coming from without the error message.

It is important to note the usage of array_construct is creating a Snowflake array of column names, where as the array_agg is then taking that list of column names and creating a list-of-lists with all of the corresponding data from those columns in a single Array in a Snowpark DataFrame.

In the UDF, pandas is taking a list-of-lists (the data) plus a list of column names to then construct a pandas dataframe. From your final code block, you're trying to construct a dataframe from a Snowpark function which will fail. But I'd like to understand the error you're seeing in the UDF to better answer your question, as I'm not totally following the problem you're encountering.

from sfguide-citibike-ml-snowpark-python.

chriskuchar avatar chriskuchar commented on June 26, 2024

Hey there, thanks for the respone. I made a tds article about the errors
I found when attempting to use the methods from this section: https://github.com/Snowflake-Labs/sfguide-citibike-ml-snowpark-python/blob/main/03_ML_Engineering.ipynb.

The tds article is:
https://towardsdatascience.com/using-python-udfs-and-snowflake-s-snowpark-to-build-and-deploy-machine-learning-models-part-2-2fe40d382ae7

The specific problem that this issue is about is converting the array construct to a dataframe within the udf. It doesn't work for me. It converts the array construct to a one column dataframe and doesn't split the data as 03_ML_Engineering.ipynb suggests I should be able to, using:

%%writefile dags/station_train_predict.py
def station_train_predict_func(historical_data:list, 
                               historical_column_names:list, 
                               target_column:str,
                               cutpoint: int, 
                               max_epochs: int, 
                               forecast_data:list,
                               forecast_column_names:list,
                               lag_values:list):
    
    from torch import tensor
    import pandas as pd
    from pytorch_tabnet.tab_model import TabNetRegressor
    from datetime import timedelta
    import numpy as np
    
    feature_columns = historical_column_names.copy()
    feature_columns.remove('DATE')
    feature_columns.remove(target_column)
    forecast_steps = len(forecast_data)
    
    df = pd.DataFrame(historical_data, columns = historical_column_names)


historical data here is :

historical_df = historical_df.group_by(F.col('STATION_ID'))\
                             .agg(F.array_agg(F.array_construct(*historical_column_list))\
                                  .alias('HISTORICAL_DATA'))

and the udf is called with:

historical_df.join(forecast_df)\
             .select(F.col('STATION_ID'),
                     F.call_udf('station_train_predict_udf', 
                                F.col('HISTORICAL_DATA'),
                                F.lit(historical_column_names), 
                                F.lit(target_column),
                                F.lit(cutpoint), 
                                F.lit(max_epochs),
                                F.col('FORECAST_DATA'),
                                F.lit(forecast_column_names),
                                F.lit(lag_values_array)).alias('PRED_DATA'))\
           .write.mode('overwrite')\
           .save_as_table('PRED_test')

To me, this suggests that an array construct can be fed in and converted seamlessly to a pandas dataframe within a udf that is uploaded to snowpark. Not sure if I am misunderstanding. Let me know your thoughts. The company I work for has an account with you guys, and I would be happy to jump on a call and walk you through each and every error I got, including this one.

Thanks,
Chris

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold avatar sfc-gh-cbaechtold commented on June 26, 2024

The issue is that your example code is using .collect() which brings back Row objects locally, and in the case of your testing, the Row object turns the array into a String.

In the quickstart, the purpose of using array_construct and array_agg is to first: consolidate several columns in a Snowpark DF into a single column with type array, and then second collapse all records for a given station into a single row so that you can pass in all records for a particular station into a single invocation of the UDF. This is fundamentally different from what you appear to be trying to locally test. See the below snippets that follow the same pattern and do work. You have to include an array_agg statement in the input to your UDF to create an array of arrays which then can be used to construct a pandas dataframe for training. Consider this sample Snowpark table with 570 records:
Screen Shot 2022-08-22 at 3 46 53 PM

To perform model training in a UDF, I need to collapse all records in this table into a single row, using df.group_by().agg(F.array_agg(F.array_construct(*columns)).alias('INPUT_DATA')) as shown here. Notice this query returns a single row, with a single column consisting of arrays of arrays:

Screen Shot 2022-08-22 at 3 48 32 PM

Now I will write an example dummy UDF that takes this input vectorized array, constructs a pandas dataframe from it, and then returns the number of rows in the pandas dataframe (recall this should be 570 based on the previous .count() query from Snowpark):

@udf(name="pandas_test",packages=["pandas==1.3.5"],
        is_permanent=True, replace=True, stage_location="@my_stage",
        input_types=[ArrayType()],
        return_type = FloatType())
def test_pandas(old_data: list) -> float:
    return pd.DataFrame(old_data,columns=columns).shape[0]

Notice the output:
Screen Shot 2022-08-22 at 3 51 31 PM

The reason your existing code is not splitting the data properly is that, when you test locally, you are using .collect and trying to construct a dataframe from a list of strings, as opposed to a list of lists.

Additionally UDF instances only process 1 row of data at a time, hence why you must use array_agg to collapse all rows into a single row in order to create a meaningful dataframe in the UDF.

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold avatar sfc-gh-cbaechtold commented on June 26, 2024

It is also worth noting that in that particular demo, we perform the array_agg and array_construct because we are training hundreds of models simultaneously via UDFs. For training a single model, you are better served using a Python stored procedure with to_pandas() on a high-memory warehouse Snowflake virtual warehouse

from sfguide-citibike-ml-snowpark-python.

chriskuchar avatar chriskuchar commented on June 26, 2024

Hey thanks for the response. If you look closer, I am not using .collect() within the udf, but either before or after when I am testing it outside of the udf. If you look farther down my example code, I tried that exact process where I collapse it down to an array construct before feeing it into the udf, then within the udf try to convert it to a pandas dataframe. The issue for me lies in converting it to a pandas dataframe within the udf. It doesn't split the data into separate columns when calling pd.dataframe(array_construct, columns=columns) within the udf on the un-collected array_construct.

I would like to be able to do the array_agg and array_construct because I can see the utility of training hundreds of models simultaneously.

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold avatar sfc-gh-cbaechtold commented on June 26, 2024

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold avatar sfc-gh-cbaechtold commented on June 26, 2024

from sfguide-citibike-ml-snowpark-python.

chriskuchar avatar chriskuchar commented on June 26, 2024

Hey thanks!! Super glad we got to this where I could get help with this error. Stoked to try it out. Been driving me nuts for days why I couldn't figure this out. I will try this out. Thanks.

from sfguide-citibike-ml-snowpark-python.

chriskuchar avatar chriskuchar commented on June 26, 2024

ROFL, I was training N independent models each on a single record. No wonder all my probabilities were (0.5, 0.5)

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold avatar sfc-gh-cbaechtold commented on June 26, 2024

Chris, can I close this issue? If you have corrected the issue, can you please also modify the blog post to reflect the corrections?

from sfguide-citibike-ml-snowpark-python.

RichardScottOZ avatar RichardScottOZ commented on June 26, 2024

Thanks for the explanation. So advice if training one model on lots of data - use stored procedure and hardware to solve?

from sfguide-citibike-ml-snowpark-python.

sfc-gh-cbaechtold avatar sfc-gh-cbaechtold commented on June 26, 2024

@RichardScottOZ that's correct, currently Snowpark only supports single-node style training of a single model (vs. parallel training of many models) via Stored Procedures. Snowpark-optimized warehouses (currently in Private Preview) will increase the warehouse resources available to single-node operations dramatically for exactly these kinds of use-cases (large data, single model).

from sfguide-citibike-ml-snowpark-python.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.