OkCupid Dataset Analysis for Machine Learning

Artificial Intelligence

Published: Sep 30, 2024 Edited: Nov 23, 2024

Academic Machine Learning Project on Dating Data

This blog article is part 1 of a series. The whole series constitutes a university machine learning project.

Here are the other parts of the series:

Dataset Selection

Kaggle is a website that hosts a series of Datasets.

For my project, I found this dataset that I will be using:

Source: Kaggle
URL: https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles
Dataset Name: OkCupid Profiles
Updated: 4 years ago
Made by: Larxel
More about the author: https://www.kaggle.com/andrewmvd
Description: Explore dating app profiles looking for love

I download the ZIP from this URL:

https://www.kaggle.com/datasets/andrewmvd/okcupid-profiles?resource=download

After I execute the code below:

./chapter2/data2/load2.py

import os
import zipfile

# Define the path to the zip file and where to extract
dating_zip_path = os.path.join("datasets", "dating", "dating.zip")
extract_path = os.path.join("datasets", "dating")

# Function to extract zip file
def extract_zip(zip_path, extract_to):
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        zip_ref.extractall(extract_to)
    print(f"Extracted to {extract_to}")

# Call the function to extract dating.zip
extract_zip(dating_zip_path, extract_path)

import os
import pandas as pd

# Define the path to the dating dataset folder
DATING_PATH = os.path.join("datasets", "dating")

# Function to load the dating CSV file
def load_dating_data(dating_path=DATING_PATH):
    csv_path = os.path.join(dating_path, "okcupid_profiles.csv")  # Adjust the filename if necessary
    return pd.read_csv(csv_path)

# Load the dating data
dating_data = load_dating_data()

# Display the first few rows of the dataset
print(dating_data.head())

The code generates the CVS in this directory:

./datasets/dating/okcupid_profiles.csv

OkCupid Dataset analysis

To better understand our data, we will run a series of commands using Python 3 and a Jupiter notebook.

Below, we can see the first 3 rows of the data and the column names.

Screenshot of an OKCupid dataset displaying user profiles with attributes such as age, status, body type, education, height, income, and various essays filled by users

Below are the columns with the number of valid entries for each. We can also see the data type that each column contains:

Pandas dataframe information showing columns and non-null counts for OKCupid dataset

Note that 'status,' for example, is a categorical attribute:

We execute the code below to count the total number of entries that exist per category in the 'status' column:

Status column value counts for OKCupid dataset showing distribution of relationship statuses

To retrieve statistics about columns that are of numerical type, we can execute the following:

Descriptive statistics for age, height, and income in OKCupid dataset

So, the units for the 3 columns of numerical type are:

Age: years
Height: inches
Income: US dollars

Now we plot the histograms of the numeric columns (age, height and income (features)):

On the notebook run

%matplotlib inline
import matplotlib.pyplot as plt
dating_data.hist(bins=50, figsize=(20,15))
plt.show()

Histogram plot of age distribution in the OKCupid dataset

Histogram plot of height distribution in the OKCupid dataset

Histogram plot of income distribution in the OKCupid dataset

Project proposal

In this project, I am building a machine learning model to predict a combined label of relationship status and whether an individual has offspring. The goal is to understand how various demographic and physical characteristics contribute to relationship and parental status, specifically for males. The current features used for prediction, including ethnicity, body type, and height, are key personal attributes that may influence these outcomes, making them crucial to the model's accuracy and relevance to our study.

Additional features, such as drinking habits, drug use, smoking status, religion, and languages spoken, could be incorporated to further enhance the model. By extending the feature set, the model could gain more insight into lifestyle and cultural factors that affect relationship dynamics and family structure. This would provide a more comprehensive understanding of the interactions between physical, demographic, and behavioural traits in predicting life choices related to family and relationships.

Below the colour legend shows how the data will be dealt with:

Colored summary table for age, sex, orientation, drinks, and other attributes for OKCupid profiles

Green: will be separated based on category
Blue: labels
Red: important features
Orange: features of medium importance
Yellow: features of low importance

Information about other features

essay0 - My self summary
essay1 - What I'm doing with my life
essay2 - I'm really good at
essay3 - The first thing people usually notice about me
essay4 - Favorite books, movies, shows, music, and food
essay5 - The six things I could never do without
essay6 - I spend a lot of time thinking about
essay7 - On a typical Friday night, I am
essay8 - The most private thing I am willing to admit
essay9 - You should message me if…