Troubleshooting

Issue 1: Model Deployment Fails on Certain GPU Instances¶

A model deployment can fail on some GPU-enabled EC2 instances when the selected GPU does not support FlashAttention v2 (compute capability >= 8). The detailed error is visible in the vLLM pod logs.

Example log:

ERROR [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8

Resolution¶

In the model deployment workflow, go to Inference Engine > Extra Engine Arguments.

Add the following argument:

--enforce-eager

This forces eager execution and allows the model to deploy successfully on the GPU.