Overview

In this section, we go through the end-to-end machine learning operations (MLOps) workflow. The overall MLOps workflow could be split into two major phases:

Training
Inference

Important

Users that are new to the concepts in Kubeflow can use the Getting Started guides that are developed and maintained by the Rafay team. Please contact us if you would like us to add guides for other use cases/

Training¶

Data Exploration¶

Integrated Notebooks are just 1-click away for data scientists. The notebooks come with turnkey support for a large list of popular frameworks and libraries.

Data Preparation¶

In order for machine learning (ML) algorithms to be effective, the traditional ETL (Extract, Transfer, Load) method can be applied to raw data to assure the quality of the data is suitable for the models.

Feature Engineering¶

Feature engineering allows you to transform raw data into features that can be used for ML model development. Users can leverage Feast as an integrated Feature Store in the platform.

Model Training¶

Once you develop a model, you can use the integrated training operators to train your the model. The list of the Training Operators are:

TFJob (TensorFlow)
PyTorchJob (PyTorch)
MXJob (Apache MXNet)
XGBoostJob (XGBoost)
MPIJob (MPI)

By employing these operators, you can effectively manage the model training process, monitor progress, and perform experiments to identify the best algorithm for your use case.

Model Tuning¶

Hyperparameters are the variables that control the model training process. The examples for hyperparameters are:

The learning rate in a neural network
The numbers of layers and nodes in a neural network
Regularization
Type of loss function

Hyperparameter tuning is the process of optimizing the hyperparameter values to maximize the model metrics such as accuracy in validation phase.

The platform provides a turnkey integration with Katib to automate the hyperparameter tuning process by automatically tuning the target variable which you specify in the configuration. Katib offers exploration algorithms such as Random search, Grid search and Bayesian optimization to perform the hyperparameter evaluation and tries to achieve the optimal set of hyperparameters for the given model.

Model Validation¶

Users can leverage Experiments and Runs to compare the metrics of a model. For instance, the same model may be trained on different datasets, or two models with different hyperparameters trained on the same dataset. Users can automate these processes to report whether a model runs smoothly or encounters some problems.

Model Registry¶

The MLflow based integrated model registry holds model specific data (classes) or weights. Its purpose is to hold trained models for fast retrieval by other applications. Without the model registry, the model classes and weights would be saved to the source code repository and are hard to retrieve and process.

Model Inference (Model Serving)¶

Once the model is selected from the validation where the metrics are met, users may wish to deploy the model to production. The model can then behave as a service that can handle prediction requests by the application. A seamless integration with Kserve allows users to easily deploy the model using Seldon Core, TFServe and KFServe.

A model-as-data approach is recommended. This provides the means to swap between model frameworks as seamlessly as possible. For example, users can train the model using PyTorch or TensorFlow. When the model is served, the underlying serving remains consistent with the user's APIs. Customers can also use specialized hardware (e.g. GPUs) for serving the model to deliver better performance.

The solution also abstracts the infrastructure complexities associated with model monitoring, scaling, revisioning during model serving. The hosted models can be quickly updated with latest versions if necessary or rolled back to a prior version if required.

Shared Storage¶

Users may require facility to share data across the various stages of the pipeline. The platform provides seamless integrations for various data storage options:

Block Storage (from host k8s cluster)
Object Storage (GCS for GCP, S3 for AWS and MinIO for On-Premises)