Monitoring the health of Optimize Live

View status metrics and log data for Optimize Live health insights

Status Metrics

StormForge Agent components expose health and status metrics that are valuable for monitoring the health of Optimize Live. Health and status metrics are different from the workload metrics Optimize Live needs to create request and limit recommendations. (See the Security FAQ for the list and descriptions of the workload metrics.)

Health and status metrics for each StormForge component are listed in the table below, under the component that produces them.

Workload Controller Metrics Description
sf_workloads_failed_processed_total Total number of workloads for which recommendations failed to generate
sf_workloads_inactive_deletion_total Total number of inactive workloads that have been deleted from StormForge
sf_workloads_inactive_failed_total Total number of inactive workloads that have failed to be deleted from StormForge
sf_workloads_processed_namespace Number of workloads processed by StormForge per namespace
sf_workloads_processed_namespace_resource Number of StormForge workloads processed per namespace per resource
sf_workloads_processed_resource Number of StormForge workloads processed per resource
sf_workloads_processed_total Number of StormForge workloads processed
   
Applier Metrics Description
sf_applier_api_disconnections_total Total number of times the applier loses connection to the API
sf_applier_patches_expired_total Total number of StormForge-generated patches that have expired
sf_applier_patches_failed_total Total number of StormForge-generated patches that have failed to apply
sf_applier_patches_processed_total Total number of StormForge-generated patches
sf_applier_patches_rolled_back_total Total number of StormForge-generated patches that have been rolled back to the previous resource values

 

You can scrape these metrics in OpenMetrics format from the following component pods and endpoints:

Component Pod Selector Labels Port Endpoint
Workload Controller app.kubernetes.io/name: stormforge-agent and app.kubernetes.io/component: agent 8080 /metrics
Applier app.kubernetes.io/name: stormforge-agent and app.kubernetes.io/component: applier 8080 /metrics

 

The Agent and Applier are implemented as Kubernetes Deployments:

  • The Agent is comprised of two Deployments: stormforge-agent-workload-controller and stormforge-agent-metrics-forwarder
  • The Applier is comprised of one Deployment: stormforge-applier

As such, output and errors related to the Agent and Applier workloads are available in the pod logs. To view logs, complete the steps in the appropriate section of the Troubleshooting topic:

The Troubleshooting topic also describes several of the warnings and errors to watch for in the logs, which can be ingested into a SIEM tool for analysis and event-driven automation.

Last modified November 15, 2024