Troubleshooting
Issue 1: Model Deployment Fails on Certain GPU InstancesΒΆ
A model deployment can fail on some GPU-enabled EC2 instances when the selected GPU does not support FlashAttention v2 (compute capability >= 8). The detailed error is visible in the vLLM pod logs.
Example log:
ERROR [fa_utils.py:82] Cannot use FA version 2 is not supported due to FA2 is only supported on devices with compute capability >= 8
ResolutionΒΆ
In the model deployment workflow, go to Inference Engine > Extra Engine Arguments.
Add the following argument:
--enforce-eager
This forces eager execution and allows the model to deploy successfully on the GPU.
