Comparing Model Performance on Training Data vs. Test Data

🇨🇦

OkCupid Dataset

Réserver un appel
Artificial Intelligence
Published: Nov 23, 2024 Edited: Nov 23, 2024

In this step of our machine learning project, we've successfully completed several key steps in preparing our model for deployment: Training on the training dataset Performing cross-validation Selecting the best model (in our case, a linear regression model)

After these steps, we've arrived at the critical phase of testing the model's performance on unseen data. This article will examine how the model performs on the training data versus the test data.

A Quick Recap: Preprocessing the Test Data

Before running our trained model on the test dataset, several preprocessing steps were applied to ensure the data was in the right format for prediction. Below are the key transformations and actions we performed, which were essential for preparing the data:

  • Raw Data Edits: The raw test data was initially loaded and cleaned by removing missing values. In particular, we dealt with the mating_success column (our target variable) by removing NA values, ensuring that our data was consistent and usable for model testing.
  • Cleaned the test Data: We cleaned the test data to produce the preprocessed test data.
  • Saved the trained model based on the training data.
  • ID Removal: The test data was filtered to remove entries with persistent_ids not present in the preprocessed test data (which was cleaned).
  • Entry verification and consistency: We also ensured that the features in the test dataset matched those in the preprocessed test dataset. In this case, we verified that the features in the test set aligned with those in the cleaned training data (dating_data_prepared), ensuring consistency across datasets.
  • Ensured all entries were labelled on cleaned and non-cleaned test data.

These IDs serve as unique identifiers and do not contain any predictive value, so they are dropped to avoid introducing noise into the model.

Once these preprocessing steps were completed, the model was ready to be tested on the unseen data (the test dataset). This process ensures the model performs under conditions similar to real-world applications, where it encounters new, previously unseen data.

Test the model against the test set

Feature and Label Separation: After cleaning, we separated the test data and preprocessed it into features (the input variables) and labels (the target variable, mating_success). This is a crucial step, as the model uses only the features to make predictions while the labels evaluate the model's performance.

We run the model to make predictions, compare them to the actual labels, and calculate the RMSE.

import os
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
import numpy as np
import joblib
import pandas as pd

# Load the trained model from the file
model_filename = "./chapter2/data2/test/dating_model.pkl"
lin_reg = joblib.load(model_filename)

print(f"Model loaded from {model_filename}")


# Define the path to the dataset
DATING_PATH = "./chapter2/datasets/dating/copies6"

# Function to load the raw data
def load_dating_data(dating_path=DATING_PATH):
    csv_path = os.path.join(dating_path, "7_classified_features_needed_label_stratified_test_set.csv_filtered.csv")
    return pd.read_csv(csv_path)

# Function to load the prepared (cleaned) data
def load_cleaned_dating_data(dating_path=DATING_PATH):
    csv_path = os.path.join(dating_path, "7_cleaned_classified_features_needed_label_stratified_test_set.csv")
    return pd.read_csv(csv_path)

# Load the raw data
dating_data = load_dating_data()
print(f"Rows after loading raw data: {dating_data.shape[0]}")

# Load the cleaned data (prepared data)
dating_data_prepared = load_cleaned_dating_data()
print(f"Rows after loading cleaned data: {dating_data_prepared.shape[0]}")

# Separate features and labels
features = dating_data.drop('mating_success', axis=1)  # Remove the target variable from features
labels = dating_data['mating_success']  # The target variable
print(f"\n\nRows after separating features and labels (raw data): {features.shape[0]}, {labels.shape[0]}")

# Separate features and labels for prepared (cleaned) data
features_prepared = dating_data_prepared.drop('mating_success', axis=1)  # Remove the target variable from features
print(f"Rows after separating features and labels (prepared data): {features_prepared.shape[0]}")

# Drop the persistent ID column(s) -- assuming the column name is 'persistent_id' or something similar
features = features.drop(columns=['persistent_id'], errors='ignore')  # 'errors=ignore' ensures no error if the column doesn't exist
features_prepared = features_prepared.drop(columns=['persistent_id'], errors='ignore')  # 'errors=ignore' ensures no error if the column doesn't exist
print(f"\n\nRows after dropping 'persistent_id' columns (raw and prepared): {features.shape[0]}, {features_prepared.shape[0]}")



# Example: Make predictions on the first 5 rows of data
some_data = features.iloc[:5]  # First 5 data points
some_data_prepared = features_prepared.iloc[:5]  # First 5 data points
some_labels = labels.iloc[:5]  # First 5 labels

# Make predictions
predictions = lin_reg.predict(some_data_prepared)

# Output predictions and actual labels
print("Predictions:", predictions)
print("Labels:", list(some_labels))

# Calculate the Mean Squared Error and RMSE for the whole dataset
predictions_all = lin_reg.predict(features_prepared)
mse = mean_squared_error(labels, predictions_all)
rmse = np.sqrt(mse)

print("RMSE:", rmse)

Model Evaluation: Training Data vs. Test Data

Now that our trained model is ready and the test dataset preprocessed let's evaluate how well the model performs on both the training and test data. We'll focus on two key metrics: Predictions and Root Mean Squared Error (RMSE).

Performance on the Training Data

First, look at the model's performance on the training data. After training the linear regression model, we tested it on the first five rows of the prepared training data (features_prepared). The output showed the following predictions and corresponding actual labels:

Predictions on Training Data:

Predictions: [0.2398392  1.08243193 2.84553505 0.5059264  0.99843327]  
Labels: [0.0, 0.0, 4.0, 0.0, 0.0]

RMSE on Training Data:

RMSE: 1.1439084247515616
As we can see from the predictions, the model can generally predict mating success values close to the actual labels. However, there is some error in the projections, particularly for higher values of the mating_success label, such as 4.0, which the model predicted as 2.8455.

Performance on the Test Data

Next, we tested the same model on the unseen test data. This is where the model's generalization ability is truly put to the test. We evaluated the model on the first five rows of the prepared test data (features_prepared), and the results were as follows:

Predictions on Test Data:

Predictions: [0.41243245 0.56738681 1.55097905 1.20148879 0.1961678 ]  
Labels: [0.0, 0.0, 4.0, 2.0, 2.0]

RMSE on Test Data:

RMSE: 1.1107366363418567

Comparing Model Performance

Now that we have the predictions and RMSE values for both the training and test datasets let's compare the two:

Training Data:

The model's predictions on the training data were relatively close to the actual values. The RMSE for the training data was 1.1439, which is a reasonable value given the nature of the problem (a regression task predicting continuous values).

Test Data:

The RMSE slightly improved to 1.1107 on the test data. This indicates that the model performed better on the test set than the training data, suggesting that the model is generalizing well to new, unseen data.

The key takeaway is that the RMSE values on the training and test datasets are close, which is a good sign. It suggests that the model does not overfit the training data—i.e., it can generalize well to unseen data, as indicated by the performance of the test dataset.

Conclusion

This project section explored how the trained linear regression model performed on the training and test datasets. After performing essential preprocessing steps on the test data (including cleaning, feature separation, and alignment), we evaluated the model's performance by comparing its predictions and RMSE values.

While the RMSE on both datasets is similar, the slight improvement in the test data suggests that the model generalizes well to new data. This is an encouraging result, as it indicates that the model has not overfitted to the training data and can likely make accurate predictions on new, unseen examples.

Links to other parts of the project