Concepts

Understand concepts and terminology used in Optimize Live

Cluster

To produce recommendations for workloads, the StormForge Agent must be installed on the Kubernetes cluster where the workloads are deployed to and then registered with the StormForge Optimize platform.

Workloads

A workload is a component that runs inside one or more Kubernetes pods. In the context of Optimize Live, a workload is a named Workload resource from a namespace in a cluster. Optimize Live observes the resource utilization of workloads in order to produce recommendations for them.

Optimize Live can produce recommendations for the following Workload types: DaemonSet, Deployment, ReplicaSet, ReplicationController, and StatefulSet.

StormForge Agent

To minimize the footprint in a cluster, Optimize Live requires only the StormForge Agent to be installed on a cluster. By default, the Agent is installed in the stormforge-system namespace.

The StormForge Agent leverages the Kubernetes view role, granting read-only permissions on all resources in the cluster.

StormForge Applier

The StormForge Applier is a Kubernetes controller that manages request and limit settings for each workload and applies the recommendations generated by StormForge’s machine learning. The Applier uses the same StormForge credentials file as the Agent and runs in the same namespace.

Install the Applier if you plan to:

  • Deploy recommendations automatically on a schedule of your choosing. This option enables you to skip manually reviewing the recommended settings and ensures your settings track closely to actual CPU and memory use.
  • Deploy recommendations on demand. For example, you can apply a single recommendation in any environment as you experiment with recommendations or if you need to quickly deploy a recommendation outside of a schedule.

Applying recommendations and validating rollout

Applying recommendations

The Applier applies recommendations as patches. As soon as a recommendation is applied, the recommendation status is updated to either Applied or FailedToApply.

Icon Status Validation Rolled back
✔️ Applied -- --
❗️ FailedToApply -- --

If all patches were applied successfully, StormForge then attempts to validate the rollout.

Validating the rollout

After successfully applying all patches in a recommendation, StormForge monitors the workload status for 5 minutes (default value). If the workload enters an error state while being monitored, StormForge rolls back the applied recommendation and sets the recommendation status to FailedToApply (❗️).

During this monitoring period, StormForge checks for conditions such as CPU throttling, OOMKills, CrashLookBackOff, pod start-up time, and so on.

Monitoring terminates when one of the following validation conditions is met:

  • Timed out
    The recommendation was applied but rollout validation didn’t complete within the monitoring period. No pods are in CrashLoopBackoff, so changes have not been rolled back.
  • Workload Became Ready
    The recommendation is applied and the rollout is validated (all pods are healthy after applying the recommendation) within the monitoring period.
  • Workload Became Unhealthy
    The rollout didn’t complete within the monitoring period. StormForge detected that either the rollout stalled with an error (such as ProgressDeadlineExceeded) or at least one pod was in CrashLoopBackoff. If the workload became unhealthy, by default StormForge will roll back its changes.
Icon Status Validation Rolled back
✔️ Applied Timed out No
✔️✔️ Applied Workload Became Ready No
❗️ FailedToApply Workload Became Unhealthy Yes
What Does “Rollback” Mean?

Before StormForge applies a recommendation, it takes an in-memory snapshot of the resources it is about to modify. If StormForge determines it’s necessary to roll back the changes, StormForge force-applies that in-memory snapshot of the resource, reverting it to exactly the state it was in before the recommendation was applied.

You can disable this rollback feature. If rollback is disabled, you might see the following icon and status in lieu of FailedToApply.

Icon Status Validation Rolled back
⚠️ Applied Workload Became Unhealthy No

Reconciling workload drift

You can control whether the Applier automatically reconciles workload drift, ensuring that recommended settings are maintained and not overwritten during CI/CD or deployment activity on the cluster. For details, see Continuous reconciliation in the Applier configuration topic.

Permissions

The Applier leverages the Kubernetes edit role, enabling it to update and patch all optimizable workloads (and HPA, if enabled). You can grant additional permissions by specifying additional RBAC in the Helm install command.

Recommendations

An Optimize Live recommendation is the set of resource requests and limits that the machine learning algorithm has determined to be optimal for a workload, based on historical utilization observations.

During this initial 7-day metrics collection period, you can view preliminary recommendations based on the metrics collected so far:

  • 10 minutes after installation, you’ll see the first preliminary recommendation.
  • For the first 24 hours after installation, metrics collection continues and preliminary recommendations are generated hourly. You might notice the recommendations becoming more refined.
  • On day 2 to 7 after installation, metrics collection continues and preliminary recommendations are generated once daily.
  • On day 7, complete recommendations are available to apply, and continue to be generated on the schedule of your choosing (or once daily by default if you don’t set a schedule).

It typically takes about 7 days’ worth of metrics to generate a recommendation that you can apply, often referred to as a complete recommendation. For this reason, we recommend waiting 7 days before applying recommendations.

How we generate recommendations

Optimize Live generates recommendations using our patent pending machine learning. Our machine learning examines the metrics collected* (including CPU and memory requests and usage) and monitors usage patterns and scaling behavior to come up with the optimal settings for:

  • CPU requests and limits
  • memory requests and limits
  • HPA target utilization, if a workload is scaling on the HPA

When generating a recommendation, the machine learning generates 3 candidate recommendations, one for each possible “optimization goal”:

  • savings (most aggressive candidate)
  • reliability (least aggressive)
  • balanced (default, falls between the other 2 candidates)

The more data that we collect, the better the recommendation that we generate, and our machine learning weights recent data more heavily. The recommendation schedule defines how often a recommendation is deployed and for how long that recommendation is considered “not stale.” For example, a recommendation with a daily schedule (the default value and best practice) should be deployed daily.

Our machine learning detects spikes in a workload and considers them when generating recommendations. To realize the most savings, consider deploying recommendations frequently (again, a best practice). When the machine learning detects that a workload has been scaled down to zero replicas, it does not provide a recommendation for that workload.

*For the full list of metrics, run:

helm show readme oci://registry.stormforge.io/library/stormforge-agent \
| grep "## Workload Metrics" -A 18

Learn more:

Last modified November 1, 2024