Best Practices
Apache Airflow, when scaled using KEDA (Kubernetes Event-Driven Autoscaler) has a few considerations and limitations users should be aware of. These mostly arise due to Airflow’s architectural assumptions and the nature of event-driven autoscaling.
Best Practices¶
- Use cooldownPeriod and pollingInterval settings in ScaledObject to avoid flapping.
- Configure resource limits to avoid starvation of scheduler or webserver.
- Always test autoscaling behavior under load with representative DAGs.
Considerations¶
1. Worker Lifecycle Awareness¶
Airflow workers spun up via KEDA may not always gracefully shut down, especially if they’re scaled down during a running task. This can result in task retries or missed heartbeats if workers are terminated before finishing their assigned DAG tasks.
Mitigation
Ensure proper terminationGracePeriodSeconds and use KEDA’s cooldownPeriod to prevent premature scale-down.
2. Task Queue Visibility¶
Airflow uses Celery, KubernetesExecutor, or CeleryKubernetesExecutor. KEDA typically monitors external queues (like Redis or RabbitMQ) for scaling signals. As a result, KEDA may not see task queue depth in real time if not properly configured with a Scaler compatible with Airflow’s executor backend.
Mitigation
Use a KEDA scaler (e.g. Redis or Prometheus) tailored to the executor’s backend.
3. No Native Awareness of DAG State¶
KEDA scales based on external metrics (e.g., queue length, CPU usage). It does not understand Airflow-specific concepts like DAG concurrency limits, task dependencies, or execution windows.
Impact
Over-aggressive scaling can overload the system or exhaust resources, especially if DAGs are not parallelizable.
4. Cold Start Overhead¶
Scaling from zero or very few replicas introduces latency (cold start) before tasks start executing. This can affect SLA guarantees, especially for short-lived or high-frequency DAGs.
Mitigation
Set a minReplicaCount > 0 to keep a baseline capacity alive at all times.
5. Metrics Availability¶
KEDA relies on metrics for autoscaling decisions. If metrics (like queue depth) are delayed or noisy, scaling can be slow or inaccurate.
Mitigation
Use stable, low-latency metrics sources and fine-tune polling intervals.