Overview
Just like centralized version management and code persistence provide an easy way for developers to iterate, review changes, deploy at scale and rollback in the case of failure. A model registry is a critical tool in your ML arsenal to help elevate the cycle of machine learning development.
A model registry is a centralized repository designed to tackle the specific challenges posed by ML model development. Unlike traditional software, ML models have multiple components that extend beyond just the code of the model - training data, hyperparameters, model weights, and the environment required for running the model.
Model registries accelerate the journey from research to production by providing a consolidated platform for secure model storage as well as the metrics required for evaluating model performance, allowing you to easily tune parameters and select the best model variation. Model registries also offer a seamless transition from training to deployment, enabling faster development and greater flexibility in inference experimentation.
The image below shows the typical workflow used at organizations describing the collaboration and handoff between "data scientists" and "MLOps Engineers" for models.
Integrated Model Registry¶
Rafay's MLOps Platform based on Kubeflow uses MLflow as its Model Registry service. This is automatically deployed and configured providing users a seamless integrated experience right in their dashboard.
Note
Users just need to click on the MLflow menu on the left to view the integrated model registry. Review instructions for additional details.
The model registry is configured to use multiple storage services. For example, on Google Cloud, the following services are provisioned and used.
Resources | Service/System |
---|---|
Model Artifacts | Google Cloud Storage (Object Store) |
Experiments & Runs | Google Cloud SQL (RDBMS) |
Why Model Registry?¶
Your data scientists and ML engineers already store all their code in Git.
Why not just store the ML model in Git?
While Git based repositories should certainly be used in conjunction with model registries for semantic versioning, PR review, and team based collaboration, there is more than just code management when it comes to machine learning development. When you create a version of a machine learning model or execute a training, there are several key aspects of that training that you need to keep track of:
Component | Description |
---|---|
Code | Actual execution code of the model |
Environment | Python version, pip libraries, or driver configurations used throughout the build |
Training Data | Specific input data that was used during the training (dates, columns, filters) |
Hyperparameters | External configuration settings used to control the learning process |
Metrics | Evaluation criteria used to measure the performance of the training |
Image | Container that will be used to replicate the training environment for inference deployment |
A Git based system can manage some of the requirements listed above, not all. A model registry is purpose built for the above requirements.
Benefits¶
There are a number of benefits of the integrated and centralized model registry.
Centralized Model Mgmt¶
All models and their versions are stored and managed in a centralized system
Version Control¶
When an issue is identified with a model in production, you may need to rollback to a model from 30 days back. Having historical records of your model variations and builds is necessary for creating a stable consistent production environment and will allow you to revert changes instantly.
Data Consistency and Security¶
Users require consistency and control over the data used for training. In adhoc training scenarios, datasets can float from machine to machine, which can present various concerns around security, data governance, and integrity. With a model registry, users know the exact data that was used for an experiment, and can ensure it always comes from a safe, secure location without any unintentional persistence.
Scalable Infrastructure¶
As the scale of data increases, training on local machines or static compute instances can present challenges around resource utilization. Having a dedicated environment that can dynamically scale up and down cloud instances of any size will allow for larger, more efficient training.
Model Performance Tracking¶
It is a critical requirement to define metrics and evaluation parameters that will gauge the performance of your models predictions. These metrics vary significantly depending on the type of model. For each training execution, these metrics need to be stored so you have a way of comparing the changes from run to run to ensure that the changes you make to a model actually improve performance.
The integrated and centralized model registry allows users to attach the evaluation metrics of the models directly to the training build. This allows users to easily compare across different variations, visualize metrics across experiments, and identify when models need to be retrained or updated.
CI/CD and Scheduled Trainings¶
Once a machine learning model is built and in production, teams will want to retrain their model based on new incoming data on some sort of a cadence or schedule. Without a centralized model registry, this process becomes extremely difficult because this requires users to stitch together different tools and services such as code repositories, scheduling tools and image repository.