Optimize Argo Workflows Performance and Resilience
When working with Argo Workflows, it's essential to ensure that workflows are efficient, reliable, and make the best use of available resources. To achieve this, you need to consider a variety of performance-enhancing techniques and best practices, such as implementing proper time management with
activeDeadlineSeconds, configuring retry strategies for error handling, and optimizing resource allocation, among others.
This documentation provides an overview of key concepts and strategies to optimize Argo Workflows. Incorporating them into a workflow design can ensure that workflows run smoothly, recover from transient issues, and make the most of their resources.
To improve the reliability of workflows or steps, we recommend implementing a retry strategy. This strategy helps handle transient errors or failures by automatically retrying the failed step. Here's our recommended retry strategy:
Explanation of the Retry Strategy Fields
limit: The maximum number of times to retry a failed step. In this example, the failed step will be retried up to 2 times.
retryPolicy: Determines the conditions under which the step should be retried. In this example, the
Alwayspolicy means that the step will be retried regardless of the failure reason.
backoff: Configures the backoff strategy for retries, which determines the waiting time between retries.
duration: The initial duration to wait before retrying a failed step. In this example, the first retry will be attempted after 15 seconds.
factor: The multiplier applied to the duration for each subsequent retry. In this example, the factor is set to 2, which means that the waiting time will double with each retry (i.e., 15s, 30s, etc.).
affinity: Configures the affinity rules for the pod during retries.
nodeAntiAffinity: Defines a node anti-affinity rule, which prevents the pod from being scheduled on the same node as the previous failed attempt. This can help avoid recurring issues caused by node-specific problems.
Incorporating this retry strategy in workflows or steps increases their resilience to failures and ensure that transient issues are automatically resolved.
To improve the efficiency of workflows and prevent steps from taking an unreasonably long time to finish, it's recommended to set the
activeDeadlineSecondsfield for each step. This configuration should be applied to individual steps rather than the entire workflow, allowing steps to retry while still enforcing a time limit on their execution.
activeDeadlineSecondsfield sets a duration after which a step is considered to have failed and will be terminated. This helps to prevent steps from running indefinitely in case of issues or unexpected circumstances.
Here's an example of how to set
activeDeadlineSecondsfor a step:
- name: example-step
In this example, the
activeDeadlineSecondsvalue of 300 seconds (5 minutes). If the step does not complete within this time, it will be terminated and considered as failed.
Combining with Retry Strategy
When combined with a retry strategy,
activeDeadlineSecondsensures that each retry attempt of a step has a time limit, preventing the step from taking too long to complete. This is particularly useful when handling external services or resources that may be temporarily unavailable or slow to respond.
activeDeadlineSecondsin conjunction with a retry strategy efficiently manages the execution time of the workflow steps and ensures that they don't consume excessive resources due to unforeseen issues.
Workflows often have outputs that are expensive to compute. Step level memoization reduces cost and workflow execution time by memoizing previously run steps, storing the outputs of a template into a specified cache with a variable key.
Follow the Argo Workflows cost optimization recommendations for setting workflow pod resource requests, using node selectors to leverage more cost-effective instances, considering alternative storage solutions like Volume Claim Templates or Volumes instead of Artifacts, and limiting the total number of workflows and pods to manage resource usage. Also consider their best practices for Argo Workflows operators, such as setting appropriate resource requests and limits, and configuring executor resource requests to ensure efficient use of infrastructure resources. Following these guidelines achieves a balance between performance and cost for Argo Workflows deployments.
Configure the TTL strategy for workflows to automatically delete completed workflows and release resources after a specified amount of time. This helps prevent resource exhaustion and keeps the system running smoothly.
For more information, please see these Argo Workflows resources: