Exploring OkCupid Dataset with Machine Learning: Predicting Mating Success for Straight Males

🇨🇦

From Data Preparation to Feature Engineering for Accurate Predictions

Réserver un appel
Artificial Intelligence
Published: Oct 17, 2024 Edited: Nov 23, 2024

Re-adjusting the project proposal

From the previous blog article titled OkCupid Dataset Analysis for Machine Learning, we have decided to edit the choice of our features.

This project aims to create a predictive model by focusing exclusively on male entries from the dataset .

The target label will combine the "offspring" and "status" columns, which together capture mating success for males.

For the model's primary features, we will focus on "body_type," "ethnicity," and "height" as they may provide significant insights.

Additionally, secondary features such as "drinks," "drugs," "smokes," and "religion" will be incorporated to capture lifestyle and personal habits.

Finally, "income" may also be included as a tertiary feature to explore its potential influence, though it will have a lower priority in the analysis.

The aim is to carefully prepare and optimize this feature set to train a robust predictive model.

A screenshot of a sample dataset from the OkCupid dataset showing various features like body type, diet, drinks, drugs, age, status, sex, orientation, education, ethnicity, height, income, pets, religion, location, offspring, and multiple essay sections.
  • Blue: labels
  • Red: primary features
  • Orange: secondary features
  • Yellow: tertiary features

Adjusting the data

Keep only sex=male, and orientation=straight entries in the dataset

The OkCupid dataset contains various entries with different genders and orientations. Since our project aims to focus exclusively on mating success for straight males, the first step is to explore the available entries for the sex and orientation columns and filter out only those labelled as male and straight.

We will then store the filtered data in a new DataFrame and save it as a CSV file for further analysis.

The following Python code demonstrates how to inspect the available values in the sex and orientation columns and filter the data to keep only entries where the sex is "male" and the orientation is "straight."

./chapter2/data2/data_manipulations2/step1.py

dating_data["sex"].value_counts()
sex
m    35829
f    24117
Name: count, dtype: int64
dating_data["orientation"].value_counts()
orientation
straight    51606
gay          5573
bisexual     2767
Name: count, dtype: int64
import os
import pandas as pd

# Define the path to the dating dataset folder
DATING_PATH = os.path.join("datasets", "dating")

# Function to load the dating CSV file
def load_dating_data(dating_path=DATING_PATH):
    csv_path = os.path.join(dating_path, "okcupid_profiles.csv")  # Adjust the filename if necessary
    return pd.read_csv(csv_path)

# Load the dating data
dating_data = load_dating_data()

# Filter the dataset to keep only male entries
male_data = dating_data[dating_data['sex'] == 'm']

# Further filter the dataset to keep only entries with orientation 'straight.'
male_straight_data = male_data[male_data['orientation'] == 'straight']

# Check the shape of the new filtered DataFrame to confirm the number of entries
print(f"Number of male and straight entries: {male_straight_data.shape[0]}")

# Save the filtered DataFrame to a new CSV file
output_path = "/Users/Zouhir/Documents/UQO_AUT2024/INF6333_code/book_aurelien_geron/chapter2/datasets/dating/step1_male_straight_data.csv"
male_straight_data.to_csv(output_path, index=False)

print(f"Filtered male and straight entries saved to: {output_path}")


Number of male and straight entries: 31073
Filtered male and straight entries saved to: /Users/Zouhir/Documents/UQO_AUT2024/INF6333_code/book_aurelien_geron/chapter2/datasets/dating/step1_male_straight_data.csv

Create a Test Set

When working with machine learning, we typically divide the dataset into train and test sets:

  • Train Set: Used to teach the model patterns in the data.
  • Test Set: Used to evaluate the model on unseen data to estimate how well it generalizes.

However, datasets evolve. New entries might be added (e.g., more data collected), or some rows might be removed (e.g., duplicates or incorrect entries). If you don't have a stable way to split the data, your train and test sets could change between runs, leading to inconsistent results.

If your split is based on persistent IDs, deleting entries from the training set will not change the composition of the test set. The remaining test entries will stay the same as they were, ensuring no data leakage or unintended reassignment.

If new entries are added to the dataset (with their persistent IDs), they will be assigned to either the train or test set based on the same split logic. This ensures that the test set remains consistent, with only new data getting appropriately assigned.

One of the key benefits of using persistent IDs is that it guarantees consistent splits across dataset modifications.

Let's generate persistent IDs for our dataset:

./chapter2/data2/data_manipulations2/step2.py

import os
import pandas as pd
import uuid  # To generate unique persistent IDs

# Define the path to the dataset folder
DATING_PATH = os.path.join("datasets", "dating")

# Function to load the dataset (male-only, straight)
def load_dating_data(dating_path=DATING_PATH):
    csv_path = os.path.join(dating_path, "step1_male_straight_data.csv")  # Ensure correct file extension
    return pd.read_csv(csv_path)

# Load the male-straight-only dataset
male_data = load_dating_data()

# Create a persistent ID for each row if it doesn't already exist
if 'persistent_id' not in male_data.columns:
    male_data['persistent_id'] = [str(uuid.uuid4()) for _ in range(len(male_data))]
    print("Persistent IDs generated and assigned to all entries.")

# Save the updated dataset with the new persistent IDs
output_path = os.path.join(DATING_PATH, "male_straight_data_with_persistent_id.csv")
male_data.to_csv(output_path, index=False)

print(f"Dataset with persistent IDs saved to: {output_path}")

Persistent IDs are generated and assigned to all entries.
Dataset with persistent IDs saved to: datasets/dating/male_straight_data_with_persistent_id.csv

Now that each entry has its respective and unique persistent ID, we can go ahead and execute the code below to split the data based on a 20% and 80% ratio for testing and training data, respectfully:

./chapter2/data2/train_test_data/script3.py

from zlib import crc32  # Import CRC32 hash function for consistency
import numpy as np
import pandas as pd
import os

# Define the path to the dataset folder
DATING_PATH = os.path.join("datasets", "dating")

# Function to load the dataset (male-only, straight, with persistent IDs)
def load_dating_data(dating_path=DATING_PATH):
    csv_path = os.path.join(dating_path, "male_straight_data_with_persistent_id.csv")
    return pd.read_csv(csv_path)

# Load the male-only data
male_straight_data_with_persistent_id = load_dating_data()

# Function to check if an identifier belongs in the test set based on CRC32 hash
def test_set_check(identifier, test_ratio):
    # Calculate hash value and assign based on the test ratio
    return crc32(identifier.encode()) & 0xffffffff < test_ratio * 2**32

# Function to split the dataset using the 'persistent_id' column
def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column].astype(str)  # Convert IDs to string if necessary
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))  # Apply hash check
    return data.loc[~in_test_set], data.loc[in_test_set]  # Return train and test sets

# Split the male-only data into 80% train and 20% test sets using 'persistent_id'
train_set, test_set = split_train_test_by_id(male_straight_data_with_persistent_id, 0.2, "persistent_id")

# Display the sizes of the train and test sets
print(f"Training set size: {train_set.shape[0]}")
print(f"Test set size: {test_set.shape[0]}")

# Save the train and test sets as CSV files
train_set.to_csv(os.path.join(DATING_PATH, "male_straight_data_with_persistent_id_train_set.csv"), index=False)
test_set.to_csv(os.path.join(DATING_PATH, "male_straight_data_with_persistent_id_test_set.csv"), index=False)

print("Train and test sets have been successfully saved.")

Conclusion

Through this deep dive into the OkCupid dataset, we've laid the groundwork for predicting mating success for straight males by carefully refining our feature set and using persistent IDs for stable train-test splits. We've explored essential concepts, from filtering relevant data to calculating CRC32 hashes, and applied these methods to ensure consistent results across multiple runs.

Our next steps will focus on applying machine learning models to uncover patterns between body type, ethnicity, height, and personal habits, aiming to predict which factors play the most significant role in dating success.

To get the full picture of this project, I recommend having a read of the following blog posts:

Annexe

Explanation of the splitting

Let's use an example UUID:

d8ea2885-a97e-4400-bc52-cf093c343e10

This UUID is in hexadecimal format (base-16). Each two-hex character represents one byte (or eight bits). Let's convert these values.

Hexadecimal to Decimal Conversion

Hexadecimal values use base-16, meaning each digit can be from 0 to F (where A = 10, B = 11, ..., F = 15).

d in hex = 13 (decimal)
8 in hex = 8 (decimal)

Expand Using Positional Values:

Each position in hexadecimal corresponds to a power of 16.

\[ d8_{16} = (13 \times 16^1) + (8 \times 16^0) \] \[ d8_{16} = (13 \times 16) + (8 \times 1) = 208 + 8 = 216 \] Thus, d8 in hex equals 216 in decimal.

Converting the Entire UUID to Bytes

Let's now convert all parts of the UUID into decimal values and byte (binary) values.

A table showing the conversion of hexadecimal pairs to decimal, binary, and byte string equivalents.

Resulting Byte String

Once converted, the UUID becomes the following 16-byte string:

b'\xd8\xea(\x85\xa9~D\x00\xbcR\xcf\t\x10'

This byte string is the format needed for the CRC32 hashing function.

CRC32 Hashing Process

Now, let's compute the CRC32 hash of the byte string step-by-step.

What is CRC32?

CRC32 stands for Cyclic Redundancy Check. A polynomial division algorithm produces a 32-bit integer hash from input data (in our case, the UUID byte string).

The output is a 32-bit hash, ranging from 0 to 2^32 - 1 (4,294,967,295).

from zlib import crc32

# Compute CRC32 hash of the byte string
byte_string = b'\xd8\xea(\x85\xa9~D\x00\xbcR\xcf\t\x10'
hash_value = crc32(byte_string)
print(hash_value)  # Example output: 2441647152

Ensure 32-bit Unsigned Integer with & 0xffffffff

When CRC32 computes a hash, it may return a signed 32-bit integer (which can be negative). However, we want to ensure the hash is treated as an unsigned 32-bit integer.

Bitwise AND Operation:

unsigned_hash = hash_value & 0xffffffff
0xffffffff is the largest 32-bit unsigned integer:

0xffffffff = 4,294,967,295 (decimal)

The & (bitwise AND) operation keeps only the lower 32 bits of the hash, ensuring it's treated as unsigned.

In our example:

2441647152 & 0xffffffff = 2441647152

Since the hash is already positive, this operation doesn't change its value.

Compare the Hash with the Test Ratio Threshold

Let's assume the test ratio is 0.2 (20%).

Calculate the Threshold:

\[ \text{Threshold} = 0.2 \times 2^{32} = 0.2 \times 4,294,967,296 = 858,993,459.2 \]

We compare the hash value with the threshold:

2441647152 < 858993459 # False

Since 2441647152 is greater than the threshold, this entry does not belong to the test set. It will be assigned to the training set.