Skip to content

Titanic Dataset

The Titanic dataset is commonly used in machine learning. The Titanic dataset contains information about the passengers aboard the RMS Titanic, which tragically sank in 1912. This dataset is often used for classification tasks where the goal is to predict whether a passenger survived or did not survive the disaster based on various features. This dataset is also used as a benchmark dataset i.e. used for testing/comparing the performance of different models.

Note

Learn more about this dataset on Kaggle.


Key Features

The Titanic dataset contains key features as described in the table below.

Feature Description
PassengerId A unique identifier for each passenger.
Survived Survival status (0 = No, 1 = Yes), which is the target variable.
Pclass Passenger class (1 = 1st, 2 = 2nd, 3 = 3rd).
Name Name of the passenger.
Sex Gender of the passenger.
Age Age of the passenger in years.
SibSp Number of siblings/spouses aboard the Titanic.
Parch Number of parents/children aboard the Titanic.
Ticket Ticket number.
Fare Passenger fare.
Cabin Cabin number.
Embarked Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

This dataset is very popular because of the following reasons:

Small Size

The Titanic dataset is relatively small and easy to understand, making it ideal for those new to machine learning.

Data Cleaning Practice

It includes missing values and requires data preprocessing, providing a good exercise in data cleaning tasks that are required in real life datasets.

Feature Engineering

The dataset offers opportunities to create new features or transform existing ones to improve model performance.

Classification Tasks

It is suitable for applying and comparing various classification algorithms like logistic regression, decision trees, and support vector machines.

Real World Dataset

The dataset has a slight class imbalance, which is a common issue in real-world datasets.