Inference Endpoints
Inference refers to the process of using a trained AI/ML model to make predictions based on new input data. It plays a crucial role in machine learning workflows, as it enables models to provide valuable insights and automate decision-making in real-time or batch scenarios. Inference can be applied across various use cases, such as predicting customer behavior, classifying images, or detecting anomalies, depending on the specific business problem being addressed.
For users, the ability to deploy and manage inference endpoints is essential when they need to operationalize their models and scale AI/ML services. Users typically rely on this feature to deploy trained models into production environments for real-time inference, test models in development settings, or adjust the scale of inference services to meet growing demand. By centralizing the deployment and configuration process, Rafay simplifies the operationalization of machine learning workflows, ensuring consistency, scalability, and security.
In the Rafay console, the Inference Endpoint Configuration screen makes the process of setting up and managing inference endpoints seamless. It provides a centralized interface to configure essential parameters, such as compute instances, API credentials, and cluster details. This eliminates the need for multiple tools and manual scripts, allowing users to define all necessary configurations in one place. The console also integrates with existing clusters and cloud environments, optimizing resource utilization and enabling automation for tasks like blueprint application, scaling, and securing access via API keys. This functionality is ideal for ML engineers and data scientists who need a simple, efficient way to deploy, test, and scale their models across various environments.
Create Inference Endpoints¶
To create an inference endpoint, access the Developer Hub and navigate to the home page. The page provides options to create and manage inference endpoints, which are predefined configurations designed to simplify and accelerate the deployment of machine learning models as APIs. These profiles enable rapid prototyping and real-time predictions. On the Developer Hub home page, users can either click on View All to access the Inference Endpoints page or click on New Inference Endpoint to create a new inference endpoint. Users can also click on the Inference Endpoints menu on the left to directly access the Inference Endpoints page.
New Inference Endpoint¶
To create a new Inference Endpoint,
- Select Inference Endpoints from the menu on the left of the console
- Click on Inference Endpoints
- Select a suitable inference service profile that suits your requirements
- Provide a name for the Inference with an optional description
- Select the compute instance from the drop down you would like to deploy the inference to
- Provide the other details and click on Deploy
It can take a few minutes for the inference and associated software components to be deployed and ready for use.
Info
Users can deploy multiple inferences on an instance. The only constraint is whether the underlying instance has the resources required for all the inference.
- The instance initially displays a status of In Progress. Upon successful deployment, the status updates to Success
View Inferences¶
Clicking on the Inference Endpoints menu will list of all the inferences the user has access to. Note that inferences may span different workspaces and different instances. To view details about a specific inference, users just need to click on the name of the inference.
Delete Inference¶
To delete a Inference, users should click on the ellipses on the far right of the selected Inference and select delete.
Info
Once deletion has been initiated, it cannot be stopped or reversed. Users can create a new Inference if required.