Projects

Predictions with Linear Regression

Two different Linear Regression models are built. One using Snowflake's Snowpark and Python's sciki-learn model.

Taresh Mulky

03 Apr 2024 — 5 min read

This is my first attempt at building a Linear Regression model using Snowflake's Snowpark. I built two different Linear Regression models.

One using Snowflake's Snowpark and the other using Python's sciki-learn model.

I then compare the predicted results based on the results from the model that I built in Python.

Overview

The objective is to predict inflation based on historical data from Snowflake's data cloud. We will process data with Snowpark, develop a simple ML model and create a Python User Defined Function (UDF) in Snowflake.

Getting The Data

We shall get the data from the Snowflake marketplace. FINANCIAL__ECONOMIC_ESSENTIALS. The table we will be using is the Financial_Fred_Timeseries data.

Create a New Database

Once we have received the data source, we need to create a database that will store our User Defined Function (UDF). We will create a database with the name SUMMIT_HOL

-- Setup database, need to be logged in as accountadmin role
-- Set role and warehouse (compute)
USE ROLE accountadmin;
USE WAREHOUSE compute_wh;

--Create database and stage for the Snowpark Python UDF
CREATE DATABASE IF NOT EXISTS summit_hol;
USE DATABASE summit_hol;
CREATE STAGE IF NOT EXISTS udf_stage;

Explore Data with Jupyter Notebook

I first need to initialize Notebook, import libraries and create Snowflake connection

Creating Features for ML Training

We want to predict the Personal consumption price index (PCE) . So we will need to create a Pandas dataframe that can be used for training the model with the scikit-learn Linear Regression model. The .to_pandas() function - pushes the query to Snowflake and return the results as a Padas dataframe.

In the above example, we set up the Data Frame by creating this SQL statement using data frame style programming. When you run this, you get the same data back - now it's done through data frames.

Train the Linear Regression Model

Once we have created the features, we can train the model. We will transform the Pandas dataframe with the features to arrays using the NumPy library. Once trained we can display a prediction.

Actual values for 2023 are 119.01 - and the predicted values were 113.47

Creating a User Defined Function within Snowflake to do the score there

Here, we are deploying this UDF into Snowflake and it does that inside the udf_stage. We specify which packages are needed to run this function.

Test the UDF using a SQL command in Python

We can then leverage this function in future SQL Series. At this point, we are done building a simple model using Snowpark. In the section below, I build a separate Linear Regression in Python with the same dataset mentioned above.

Linear Regression Model

Using Scikit-learn

I built the following Linear Regression model using scikit-learn and then compared the results of the 2 models.

Step 1: Import the necessary Libraries and read the dataset.

Step 2: Prepare the Dataset

Split the dataset into features X (Year) and target variable y (Value). Basically extracting the independent variable (Year) and the dependent variable (Value)

Step 3: Split the Dataset

The dataset is split into training and testing sets using scikit-learn’s ‘train_test_split()’ function.

Step 4: Fitting Linear Regression Model

We create and instance of the Linear Regression model and fit it to the training data using the ‘fit()’ method. We create a linear regression model using ‘LinearRegression’ from ‘scikit-learn’

Create a Linear Regression object
model=LinearRegression()

Train the model using the training sets
model.fit(X_train, y_train)

Step 5: Making Predictions

We use the trained model to make predictions on both the training and testing sets.The model is trained using the training data (X_train and y_train).

Step 6: Evaluating the Model

We calculate the Root Mean Squared Error (RMSE) for both the training and testing sets to evaluate the performance of the model. Lower RMSE values indicate better model performance.

Mean Absolute Error (MAE): 3.6429708092429847
Mean Squared Error (MSE): 21.783922761054384
Root Mean Squared Error (RMSE): 4.6673250112944125

Step 7: Plot Existing Dataset

Just to visualise what the dataset looks like before we make any predictions.

Step 8: Run the model to predict values for the next 5 years

We can use our trained model to predict the value for the Year 2023.

Predicted value for the year 2023: [111.25635893]

Using the model. lets predict the values for the next 5 years of values starting from 2020.

Year: 2020, Value: 106.28741227408727
Year: 2021, Value: 107.94372782544497
Year: 2022, Value: 109.60004337680266
Year: 2023, Value: 111.25635892816035
Year: 2024, Value: 112.91267447951805
Year: 2025, Value: 114.56899003087574
Year: 2026, Value: 116.22530558223343

Step 9: Plot Actual & Predicted Values

We can plot both the actual and predicted values for the entire dataset to visualize how well our model fits the data.

Compare the predicted results

Code run using Snowflake's Snowpark & the SciKit-Learn in Python

Year	Actual	Run using Snowpark	Python Model
2020	104.458	108.2	106.287
2021	106.145	109.96	107.943
2022	112.829	111.71	109.600
2023	119.011	113.47	111.256
2024	121.906	115.23	112.912
2025		116.98	114.568
2026		118.74	116.225
2027		120.49	117.881

Thanks for reading