Optimize Live
Agent version 2.7.0
Added
-
OpenShift support
You can now optimize workloads on clusters managed by OpenShift. When you install the StormForge Agent on a cluster, include the--set openshift=true
argument. For details, see Install on Red Hat OpenShift Container Platform in the product docs. -
Support for automatic rightsizing of workloads scaled by a KEDA-owned HPA
Optimize Live can now provide recommendations and patch workloads that have an HPA that is managed by KEDA. Previously, Optimize Live could only make recommendations for these workloads.Note: In order to apply KEDA patches, you must grant the StormForge Applier additional RBAC permissions. You can get the corresponding YAML file and instructions in the Rightsize workloads scaled by a KEDA-owned HPA guide in the product docs.
Applier version 2.1.0
Added
-
Improved patch rollback and workload health checks
The Applier now performs additional health checks to ensure that a workload is always in a healthy state before applying patches. If a patch fails, the Applier can now roll back all patches to reach the previous healthy state.We also improved the Applier logging, making it easier to understand what a patch is doing, patch application progress, and the health of a workload after applying patches.
-
Support for automatic rightsizing of workloads scaled by a KEDA-owned HPA
Although this is a new Agent (version 2.7.0) feature, you must grant the StormForge Applier additional RBAC permissions. You can get the corresponding YAML file and instructions in the Rightsize workloads scaled by a KEDA-owned HPA guide in the product docs.
Agent version 2.6.0
Added
-
Optimize Live 30-day free trial
If you’re not already using Optimize Live, sign up at app.stormforge.io/signup and within minutes, you’ll have Optimize Live running on the Kubernetes cluster you specify.Need to see it to believe it? This 3-min Getting Started video walks you through setup (which takes less than 2 minutes) and gives you a quick overview of the insights you’ll get in just 1 hour after installation.
-
New StormForge Agent installation wizard
If you’re just starting out with Optimize Live, you can now install the StormForge Agent by using the Get Started wizard. Log in to app.stormforge.io with your StormForge login, and in the left navigation, click Overview. -
Recommendation schedules now support for Cron format
When you configure workloads using Kubernetes annotations, you can now set a schedule using Cron format. Previously, only macros and ISO 8601 Duration strings were supported.Examples:
-
Once daily (default value and best practice):
live.stormforge.io/schedule: "H H * * *"
-
Every morning at approximately 0800h (exact time is not guaranteed):
live.stormforge.io/schedule: "00 08 * * *"
To learn more about using annotations, check out the Configure by using annotations guide in the product docs.
-
Applier version 2.0.6
Changed
- Applier Helm chart updates
We updated the Applier’s Helm chart to use new Agent secret names and values. No action is required unless you have changed Agent secret values.
Agent version 2.5.1
Fixed
- We fixed an encoding problem with the manageAuthSecret feature released in version 2.5.0.
Agent version 2.5.0
Changed
- Helm installation:
--set stormforge.clusterName=CLUSTER_NAME
must be changed to--set clusterName=CLUSTER_NAME
. - We now consume the
prometheus
image from quay.io instead from Docker Hub. To fallback to Docker Hub, pass the following parameter as part of the Helm installation:--set prom.image.repository=prom/prometheus
. - To manage or rotate authorization credentials outside of Helm, set the Helm value manageAuthSecret to
false
. If you set this value tofalse
, make sure that the stormforge-agent-auth secret exists before installing or upgrading.
Known Issues
- In certain environments, changing namespace-level StormForge annotations might not trigger reconciliation. For these scenarios, the annotations will be granted only on the following Agent restart.
Agent version 2.4.1
Fixed
- The controller no longer crashes when reconciling WorkloadOptimizer custom resources.
Agent version 2.4.0
Added
-
Support for setting default workload values at the cluster level
You can now use a configuration file or command line arguments as part of thehelm install
command to set default values for all workloads in a cluster. See the examples below.Configuration file: Create a .yaml file and set the workload- and container-level values.
clusterDefaultConfig: schedule: P1D containersCpuRequestsMin: 100m,istio-proxy=50m
Command line arguments:
--set clusterDefaultConfig.schedule=P1D \ --set clusterDefaultConfig.containersCpuRequestsMin=100m,istio-proxy=50m
-
Support for setting default workload values at the namespace level
You can now use annotations to set default workload values at the namespace level. Namespace-level values override cluster-level values.
Add the annotations to the
metadata.annotations
section of the namespace values file.apiVersion: v1 kind: Namespace metadata: annotations: live.stormforge.io/schedule: "P1D" live.stormforge.io/containers.cpu.requests.min: "100m,istio-proxy=50m" creationTimestamp: "2023-07-20T17:28:42Z" labels: kubernetes.io/metadata.name: kube-public name: kube-public resourceVersion: "9" uid: <UID> spec: ...
Changed
-
Updated the Helm installation to simplify proxy configuration
-
You can now add
--set proxyUrl=http://proxy.example.com
to thehelm install
command to specify a proxy server. -
The corresponding
no_proxy
variable is now seeded as follows with several RFC 1918 addresses:no_proxy:127.0.0.0/8,10.0.0.0/8,172.16.0.0/12,192.168.0.0/16
If these seeded values meet your organizaton’s needs, you no longer have to create a separate proxy.yaml file as described in this advanced installation scenario in the product docs.
-
-
Updated the Helm installation to expose Prometheus scrape interval
In typical installations, you don’t need to change this interval — by default, the Prometheus scrapes for metrics every 15s.If StormForge Support requests that you change this interval, you can change it by adding
--set prom.scrapeInterval=INTERVAL
to thehelm install
command, replacing INTERVAL with the suggested frequency.
Fixed
- We now handle workload updates more gracefully, correcting an issue that caused controller panic.
Applier version 2.0.5
This release contained internal enhancements only. No action is required.
Agent version 2.3.0
Added
-
Support for ingesting workload labels
By default, Optimize Live now ingests labels on Kubernetes workload objects, ensuring that StormForge workload labels match your workload labels, making it easier for you to search for workloads.To change this default behavior, setting
collectLabels
tofalse
in ahelm install
orhelm upgrade
command:--set collectLabels=false
-
Support for setting container default values by using annotations
You can now set default values for all containers in a workload by using annotations in pod objects.In the Pod template metadata (
spec.template.metadata.annotations
) of a Deployment object, use the following syntax:live.stormforge.io/containers.CONTAINER_PARAMETER: "DEFAULT_VALUE"
Example: To set
cpu.requests.min
to20m
for all containers in a workload:
live.stormforge.io/containers.cpu.requests.min: “20m”
For the complete list of all annotations, see the Configure by using annotations guide in the product docs.
Changed
- Prometheus agent version bumped to version 2.45 (Prometheus Long-Term Support release)
Applier version 2.0.4
-
Additional custom Prometheus metrics
You can now get more Applier performance details with the following Prometheus metrics:-
The following new metrics are vectored and have
workload_namespace
andworkload_resource
labels:- sf_patches_processed_total
- sf_patches_failed_total
- sf_patches_rolled_back_total
- sf_patches_rollback_failures_total
-
The following new metric does not have the
workload_namespace
andworkload_resource
labels - it applies only to the Applier, not a specific workload:- sf_applier_api_disconnections_total
-
Agent version 2.2.0
-
Workload garbage collection
We now do garbage collection (by default, every hour) to ensure that when workloads are removed from the cluster, they’re also removed from Optimize Live. To change this reconciliation interval, edit theworkload.workloadSyncInterval
in the workload’svalues.yaml
file. -
Regex support
We added regex support for the optional allowNamespaces and denyNamespaces parameters that you can provide when you install the StormForge agent.
Applier version 2.0.3
- Patch rollback support
We added rollback logic: If an error occurs when applying a patch for a recommendation, successful patches are rolled back.
Version 2.0
Be sure to check out our press release!
Added
-
New install method: Helm chart for stormforge-agent
You no longer need to download the StormForge CLI to install. Get up and running within minutes, and download the CLI later to run StormForge commands to manage your cluster. -
Optional Applier installation using a single Helm command
We separated the Agent and the Applier to simplify the permissions required at install.
If you plan to apply configurations on demand (outside of a regular schedule) or automatically outside of a CI/CD workflow, be sure to install the Applier component. -
Control workload metrics collection with allow and deny lists
By default, Optimize Live collects metrics on all the workloads in the cluster.
To restrict metrics collection to specific namespaces, you can provide a namespace allow list or deny list to when you install the agent. -
New UI workflows
Workload List View: See all workloads and the recommendations for those workloads, view cost estimates, and search against workload name, namespace, or cluster.
Workload Detail View: View the impact of the recommendations on the workload, drill into the recommendation details, and get container-specific details. Export or download the patch from the UI.
Removed
- In-cluster components
We no longer install the recommender or time-series database (TSDB).
Version 0.7.8
Fixed
-
Permissions issue during upgrade
This release fixes a permissions issue that sometimes caused the TSDB to crash when upgrading an existing Optimize Live installation.
Version 0.7.7
Added
-
All components now run as non-root
Individual components (TSDB, Applier, Recommender, Grafana) now run with
runAsNonRoot: true
set in theirPodSecurityContext
. The Controller continues to run as non-root by default. This feature is helpful if you deploy Optimize Live in clusters that have security policies that require all containers to run as non-root. -
Improved handling of Datadog rate limit errors
The TSDB now gracefully handles HTTP 429 responses from the Datadog API. If Datadog is your metrics provider, you’ll see better performance when the Datadog rate limit is reached.
Version 0.7.6
Controller
Added
-
Support for DaemonSet optimization
Optimize Live can now optimize DaemonSets in workloads, resulting in even more resource savings.
Fixed
-
You can now specify any Grafana image or version
The Controller can now install Grafana using the image repository and tag that you specify in the Helm chart
values.yaml
file. Previously, the Controller installed the latest version of Grafana from the official registry only.In the
values.yaml
file, use this format:grafana: image: repository: docker.io/grafana/grafana pullPolicy: IfNotPresent tag: 8.2.0
Version 0.7.5
Recommender
Added
-
Support for workloads that scale based on custom metrics in the HorizontalPodAutoscaler
Optimize Live now produces a recommendation to size the workload to best align with the currently configured HorizontalPodAutoscaler custom metric. Previously, CPU utilization metrics were the only supported HorizontalPodAutoscaler metric.
Version 0.7.4
Recommender and Controller
Updated
-
Show recommendations even if some workloads in an application fail
Optimize Live now, by default, shows recommendations even if it couldn’t generate recommendations for all discovered workloads (for example, when workloads crash or fail, or when new workloads don’t yet have enough metrics data).
Previously, recommendations were shown only if they were computed for all discovered workloads. To preserve this behavior, set
FF_ONLY_COMPLETE_RECOMMENDATIONS=true
in the extraEnvVars section of the Helm chart.
UI enhancements
-
Launch from the left navigation
Launch or switch between Optimize Live and Optimize Pro from the left navigation rather than from the tabs within an application. This update takes you to your applications and recommendations faster.
Version 0.7.3
Controller
Added
-
Deleting a Live object now deletes the corresponding application
When you delete a Live object from your cluster, Optimize Live now also deletes the application from the UI and the API. To restore the original behavior (in which the application isn’t deleted from the UI and API), label the Live object by running this command:
kubectl label -n stormforge-system live/my-applive.optimize.stormforge.io/skipSync=skip
-
Grafana cleanup when uninstalling Optimize Live
When you uninstall Optimize Live, we now ensure all Grafana processes are also deleted.
TSDB
Fixed
-
Backfill duration of 0s now kicks off metrics collection
We now start collecting metrics when you configure the TSDB to skip backfilling (
TSDB_BACKFILL_DURATION=0s
). In previous releases, this setting didn’t kick off metrics collection.
Version 0.7.2
TSDB
Added
-
Expose recommendation count, recommendation tx/timestamp metrics
The following optimize live metrics are available via
/metrics
endpoint:optimize_live_recommendation_count
, which displays a count of the most recent number of recommendations receivedoptimize_live_recommendation_timestamp
, which displays a timestamp of when the last set of recommendations were madeoptimize_live_tsdb_series_timestamp
, which displays a timestamp for each top level metric we ingest (limits
,requests
,usage
, etc.)
Fixed
-
Limit Datadog query length when querying HPA metrics
We now ensure that queries sent via the Datadog API don’t exceed Datadog’s maximum query length of 8000 characters. Previously, this check was not in place when we added support for HPA recommendations.
Controller
Added
-
Support for pvc-less TSDB
You can now configure the TSDB to run without a PV/PVC by setting
TSDB_NO_PVC="true"
. Because this makes the TSDB data ephemeral, you should do it only in specific situations. TheTSDB_PVC_SIZE
setting can still be used to set a size limit when there is no PVC. -
Support for
limitRequestRatio
configuration parameterYou can now configure how much headroom to add to the request recommendation for the limit. As the name suggests, this is a ratio between the limit and request. By default, this ratio is set to
1.2
, which means that the limit recommendation is set to the requests recommendation plus 20%. -
Reducing the number of reconciliations via a feature flag
In large environments, you might choose to reduce the number of watches on the API server. To configure the controller to no longer watch components that it owns, set the
FF_NO_OWNS
environment variable. When this is set, the controller no longer watches for events from the TSDB, recommender, or applier resources. -
Add
diff
on tsdb and recommender ConfigMaps when debug mode is enabledWhen
DEBUG
is enabled, you’ll see a diff of the tsdb and recommender ConfigMaps in the logs, making it easier to discover what was changed during a reconcile.
Fixed
-
Sort discovered HPAs
When multiple HPAs are configured for a target, we now sort this list to prevent unnecessary configuration churn.
-
One lookup for CPU and Memory targets
We now do one lookup for both CPU and memory targets. Previously we did separate lookups for CPU and Memory targets, which created situations where we would have unequal targets matched for CPU and Memory recommendations.
Misc
-
Change the log level to
error
when no targets foundFor easier troubleshooting, we now set the log level to
error
when no targets are found. Previously, because we would query again, we would log this at theinfo
level. -
Use
Interval
instead of deprecatedUpdateInterval
We now use
Interval
only.
Version 0.7.1
Controller and Applier
Added
-
Support for reducing the resources used by a cluster
If you have many applications (for example, upwards of a couple hundred) and apply recommendations conservatively (for example, every few days), you can set
FF_PATCHER="true"
in theextraEnvVar
section of your Helm chart. This consolidates and simplifies the cluster component stack and does not negatively affect cluster performance. -
Support for persisting patches to ConfigMaps
When you set
FF_PATCHER="true"
, you can now have the Controller write a patch to a ConfigMap by settingFF_PERSIST_PATCH="true"
. Writing a patch to a ConfigMap is useful for troubleshooting cluster resource use. -
Improved HPA logging
If no HPA targets are discovered, this information is now logged at the
info
level. Previously, it was logged at theerror
level.
Workaround
-
No recommendations in an HPA setup
In some HPA setups, the Recommender might not discover HPA targets and therefore cannot generate recommendations. Sometimes this scenario occurs because the Kubernetes version and kube-state-metrics version are not compatible.
Workaround: Try downgrading your kube-state-metrics version.
Version 0.7.0
- Updated Helm chart value:
DEBUG=false
, and setDEBUG
to Boolean in thevalues
schema - Grafana updates simplify the information you see on dashboards
UI enhancements
On the Configure Recommendations page:
- In the Optional Settings section, you can now specify the CPU target utilization of the HPA recommendation.
- In the Advanced Settings section, you can choose either of the following:
- Enable Guaranteed Quality of Service.
- Exclude Memory Limits, CPU limits, or both from recommendations.
To access the Advanced Settings, contact your StormForge sales rep.
Version 0.6.0
- Added support for HPA constraints for min and max target CPU utilization
- Added support for collecting min and max replica metrics to provide better recommendations
Version 0.5.0 (HPA support)
Applier
Added
- Support for jsonpath custom patches
- Support for generating HPA patches
Controller
Added
- Support for bidimensional autoscaling
- Support for providing target utilization recommendations alongside CPU and memory
- Enabled HPA lookup by default
- Added recommendation labels to the dashboard to better filter results
Fixed
- We now correctly look up existing Live resources when syncing from the API
UI enhancements
- Added a progress bar that shows the progress of the TSDB backfill
- Added support for maximum CPU and memory limits
- Added a clusters list page
Version 0.4.0
Applier
Added
- Support for custom patches:
- You can now create a Live custom resource definition (CRD) to provision and configure a new Optimize Live instance
- You can now apply recommendations via a Live object
- Support for pods with multiple containers
- Support for arm64 architecture
- Updated Grafana dashboards:
- You can choose low, medium, or high risk tolerance for both CPU and memory when viewing recommendation summaries
- You can now see HPA-related data
- The Recommender now provides recommendations that honor the maximum limits that you specify
- You can now specify the following values in a Live object:
- Maximum bound for CPU and memory requests
- Minimum and maximum CPU and memory limits
TSDB
Added
- Significantly reduced TSDB ConfigMap size, allowing now up to 700+ targets per Live object (from previously only 100+ targets). For testing and troubleshooting, you can still add raw queries to the Controller’s configuration file, but you must add them manually.
Version 0.3.0
UI enhancements
- You can now set CPU and memory minimum limits when you configure recommendations
- Deleting an application in the UI now also deletes the application from the cluster
- A progress bar now displays data backfilling progress
- New search capability helps you to find your applications faster
Version 0.2.2
- Qualify Datadog metrics with cluster name
- Suppress log messages during backfill of data
- Fixed bug that could cause the recommender to stall
Version 0.2.1
- Added support for non-standard replicaset owners (e.g., rollouts)
- The Grafana dashboard has been updated to highlight the containers’ maximum usage
- The recommender now supports varying number of replicas
- The TSDB allows for customization of the persistent volume
- Added DEBUG log level for all the components
- The Controller supports proxies
- Beta support for Datadog as a metrics provider
Version 0.1.6
Optimize Live Launch
- Controller deploys the TSDB, the recommender, the applier and Grafana deployment
- Support for metrics stores in Prometheus