Reliability response

Configure Optimize Live to respond to OOM events with a temporary memory bump-up

As changes in memory usage occur, Optimize Live detects and reports out-of-memory (OOM) events on all workload types as it collects metrics. You can configure Optimize Live to respond to OOM events by applying a temporary memory bump-up and indicate when the bump-up is to be applied.

Out-of-memory events response

Defines the values that Optimize Live uses to calculate the temporary memory bump-up added to memory requests after an OOM event, and indicates when to apply the bump-up.

Annotation Default value
live.stormforge.io/reliability.oom.memory-bump-up.period "P4D"
live.stormforge.io/reliability.oom.memory-bump-up.percent "0"
live.stormforge.io/reliability.oom.memory-bump-up.min "0Mi"
live.stormforge.io/reliability.oom.memory-bump-up.max "2Gi"
live.stormforge.io/reliability.oom.memory-bump-up.apply-immediately "Never"
Description

To prevent unexpected results, this feature is disabled by default.

When a container in a workload is OOMkilled, Optimize Live immediately generates a recommendation using the configured memory bump-up percent (default value is "0"), and if configured to do so, applies this recommendation immediately (default value is "Never").

These two default values disable memory bump-ups in order to prevent unexpected results or downtime from applying recommendations outside of a schedule.

The live.stormforge.io/reliability.oom.memory-bump-up.* annotations can be configured based on the workload resource kind, using the following syntax:

"100Mi,resource:daemonsets=0Mi"
"100Mi,resource:daemonsets=0Mi"

and

"Always,resource:jobs=Never"

The bump-up is calculated for—and applied to—OOMKilled containers only. The calculated bump-up respects the optimization policy and limits-to-requests ratio (LRR) configured for containers in that workload.

How to configure bump-ups

You can configure memory bump-ups at the cluster, namespace, or workload level.

To configure bump-ups, complete these high-level steps:

  1. Set live.stormforge.io/reliability.oom.memory-bump-up.percent to an integer value greater than 0.
  2. Set live.stormforge.io/reliability.oom.memory-bump-up.apply-immediately to Always or IfAutoDeployEnabled (see Valid values below).
  3. Optional: Set the bump-up min and max values.
  4. At this time, the following additional configuration is required. You must:
    • Disable memory bump-ups for Daemonsets.
    • Disable auto-deploy for DaemonSets and StatefulSets.

A recommended default configuration is provided in the examples section later in this topic.

Bump-up period

The bump-up period defines how long Optimize Live calculates and applies memory bump-ups for each OOMKilled container in a workload. When a container in a workload is OOMKilled, the container’s bump-up period begins and recommendations contain the memory bump-up values. The container’s bump-up period resets on subsequent OOMKills.

When a container’s bump-up period ends, bump-ups are no longer calculated and recommendations are based on the observed resource usage.

There is typically no need to change this value. You might choose to extend the bump-up period if you want to be particularly conservative about memory, or you might lower it for workloads that you expect to see occasional OOM activity from due to intentional design or known issues.

Reporting and tracking

All OOM events are tracked and counted, regardless of the workload type.

You can view OOM events as follows:

  • By estate and cluster: On the Reports page, see the OOM Events graph. You can filter by cluster and time period at the top of the page.

  • By container: On the workload details page, click the Recommendation tab and then click a container name. OOM events are shown as timeline markers (to within 5 minutes of accuracy) along the x-axis of a container’s Average Memory Usage, Requests, and Limits graph.

    After a bump-up is applied, you can see the immediate increase in the recommended and current requests on a container’s Average Memory Usage, Requests, and Limits graph on the workload details page. If configured to apply the bump-up immediately after an OOM event, this increase also corresponds to the OOM event marker on the graph’s x-axis.

Valid values
  • live.stormforge.io/oom.memory-bump-up.period
    • ISO-8601 duration string defining how long Optimize Live will add a memory bump-up to recommendations after an OOM event. Typically measured in days; default is “P4D”. There is typically no need to change this value (see Bump-up period above).
  • live.stormforge.io/oom.memory-bump-up.percent
    • String representing a percentage to increase the memory setting as a response to an OOM event
    • Valid values: “0” to “100
  • live.stormforge.io/oom.memory-bump-up.min and live.stormforge.io/oom.memory-bump-up.max:
    • A positive quantity with a Kubernetes memory unit, such as "2Gi"
    • Setting a minimum > 0 reduces pod churn
  • live.stormforge.io/oom.memory-bump-up.apply-immediately:
    • Never: Memory bump-ups are never applied immediately. If auto-deploy is enabled for the workload, the bump-up is included in the next scheduled recommendation (as long as the bump-up period is in effect) to prevent unexpected downtime during peak hours.
    • IfAutoDeployEnabled: Memory bump-ups are applied immediately if auto-deploy is enabled for the workload. If auto-deploy is not enabled, any recommendation applied on demand within the bump-up-period has the bump-up percentage added to it.
    • Always: Memory bump-ups are always applied immediately and automatically after OOM events. This ensures the workload recovers quickly and decreases the possibility of future OOM events.
Examples
  • Recommended default OOM response: Bump up memory by 20%, minimum increase 100Mi, maximum increase 2Gi, apply the bump-up immediately if auto-deploy is enabled for the workload:

    • If auto-deploy is not enabled for the workload, the bump-up is applied the next time a recommendation is applied on demand.
    live.stormforge.io/reliability.oom.memory-bump-up.percent: "20,resource:daemonsets=0"
    live.stormforge.io/reliability.oom.memory-bump-up.min: "100Mi,resource:daemonsets=0Mi"
    live.stormforge.io/reliability.oom.memory-bump-up.max: "2Gi"
    live.stormforge.io/reliability.oom.apply-memory-bump-up-immediately: "IfAutoDeployEnabled,resource:daemonsets=Never,resource:statefulsets=Never"
    live.stormforge.io/oom.memory-bump-up.period: "P4D" 
    
  • Bump up memory by 20%, minimum increase 100Mi, maximum increase 2Gi, never apply the new recommendation immediately:

    • If auto-deploy is enabled, the next scheduled recommendation will contain the bump-up values if the bump-up period is still in effect.
    • If auto-deploy is not enabled and the recommendation is applied on demand during the bump-up period, the bump-up values are applied. If the bump-up period is over, recommended values are based on observed resource usage.
    live.stormforge.io/reliability.oom.memory-bump-up.percent: "20,resource:daemonsets=0"
    live.stormforge.io/reliability.oom.memory-bump-up.min: "100Mi,resource:daemonsets=0Mi"
    live.stormforge.io/reliability.oom.memory-bump-up.max: "2Gi"
    live.stormforge.io/reliability.oom.apply-memory-bump-up-immediately: "Never,resource:daemonsets=Never,resource:statefulsets=Never"
    live.stormforge.io/oom.memory-bump-up.period: "P4D"
    
  • Disable bump-ups

    live.stormforge.io/reliability.oom.memory-bump-up.percent: "0"
    live.stormforge.io/reliability.oom.memory-bump-up.min: "0"
    live.stormforge.io/reliability.oom.memory-bump-up.max: "2Gi"
    live.stormforge.io/reliability.oom.apply-memory-bump-up-immediately: "Never"
    
Last modified November 20, 2024