Overview
In this exercise, you will use a Kubeflow pipeline to orchestrate the complete machine learning workflow for the Titanic dataset, from data preparation to model training, registration, and prediction. This Kubeflow pipeline automates the entire lifecycle of a machine learning model for the Titanic dataset, from data preparation and preprocessing to training, model registration, and prediction.
Note
You will use TensorFlow to train the model and use TensorBoard to visualize the model and its metrics.
Each step in the process is modularized into individual components, which can be reused and extended. The pipeline also integrates with MLflow for tracking and managing the trained models, providing a full-fledged MLOps workflow. Click below if you are interested in watching an end-to-end video of the experience.
The Challenge¶
The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.
In this challenge, the user will build a predictive model that answers the following question:
What sorts of people were more likely to survive? You will use passenger data (i.e. name, age, gender, socio-economic class, etc) to make this prediction.
Step-by-Step Description¶
The flowchart below describes the steps in the pipeline at a high level .
Data Preparation¶
In the prepare_data
component, you will download the Titanic dataset (both train and eval datasets) from a public storage bucket and saves it locally at the specified data_path
.
Data Processing¶
In this step, you will reads the downloaded CSV files, performs preprocessing (such as one-hot encoding of categorical variables and normalization of numeric features), and split the data back into train and evaluation sets.
The purpose of this step is to make sure that the data is cleaned, transformed, and ready to be fed into the model for training.
Train Model¶
In the train_model
component, you will read the preprocessed training data and train a TensorFlow model using a basic architecture. You will log training metrics using callbacks so that it can be used in TensorBoard. The trained model is saved to the specified data_path
.
The purpose of this step is to train a simple binary classification model to predict whether a passenger survived the Titanic disaster.
Register Model¶
In the register_model
component, you will loads the trained Keras model and register it in MLflow. You will log the model along with relevant parameters and register it with the model registry.
The purpose of this step is to manage the model within an experiment tracking system, enabling traceability and version control of the trained model.
Data Prediction¶
In the predict_on_test_data
component, you will load the trained model from MLflow, apply it to the evaluation data, and generate predictions. The results are saved as a CSV file.
The purpose of this step is to evaluate the model on unseen data and generate predictions to assess its performance.
Data Storage & Sharing¶
A PVC is created for shared storage between the components. This ensures that each component has access to the intermediate data stored in the PVC. Each task mounts the same PVC to access the intermediate outputs (like preprocessed data, trained model, etc.) that were generated in the previous tasks.