Monitoring the health of Optimize Live

View status metrics and log data for Optimize Live health insights

2 minute read

Status Metrics

StormForge Agent components expose health and status metrics that are valuable for monitoring the health of Optimize Live. Health and status metrics are different from the workload metrics Optimize Live needs to create request and limit recommendations. (See the Security FAQ for the list and descriptions of the workload metrics.)

Health and status metrics for each StormForge component are listed in the table below, under the component that produces them.

Workload Controller Metrics	Description
sf_workloads_failed_processed_total	Total number of workloads for which recommendations failed to generate
sf_workloads_inactive_deletion_total	Total number of inactive workloads that have been deleted from StormForge
sf_workloads_inactive_failed_total	Total number of inactive workloads that have failed to be deleted from StormForge
sf_workloads_processed_namespace	Number of workloads processed by StormForge per namespace
sf_workloads_processed_namespace_resource	Number of StormForge workloads processed per namespace per resource
sf_workloads_processed_resource	Number of StormForge workloads processed per resource
sf_workloads_processed_total	Number of StormForge workloads processed

Applier Metrics	Description
sf_applier_api_disconnections_total	Total number of times the applier loses connection to the API
sf_applier_patches_expired_total	Total number of StormForge-generated patches that have expired
sf_applier_patches_failed_total	Total number of StormForge-generated patches that have failed to apply
sf_applier_patches_processed_total	Total number of StormForge-generated patches
sf_applier_patches_rolled_back_total	Total number of StormForge-generated patches that have been rolled back to the previous resource values

You can scrape these metrics in OpenMetrics format from the following component pods and endpoints:

Component	Pod Selector Labels	Port	Endpoint
Workload Controller	`app.kubernetes.io/name: stormforge-agent` and `app.kubernetes.io/component: agent`	8080	/metrics
Applier	`app.kubernetes.io/name: stormforge-agent` and `app.kubernetes.io/component: applier`	8080	/metrics

The Agent and Applier are implemented as Kubernetes Deployments:

The Agent is comprised of two Deployments: stormforge-agent-workload-controller and stormforge-agent-metrics-forwarder
The Applier is comprised of one Deployment: stormforge-applier

As such, output and errors related to the Agent and Applier workloads are available in the pod logs. To view logs, complete the steps in the appropriate section of the Troubleshooting topic:

The Troubleshooting topic also describes several of the warnings and errors to watch for in the logs, which can be ingested into a SIEM tool for analysis and event-driven automation.

Last modified November 15, 2024