Monitoring the health of Optimize Live

View status metrics and log data for Optimize Live health insights

Status Metrics

Though the StormForge Agent collects workload-level metrics (the sources and descriptions of which can be found in the Security FAQ), it can also expose metrics that are valuable for monitoring the health of Optimize Live.

The Agent and Applier produce various status metrics related to the processing of workloads and to the status of resource recommendation patches, respectively. These metrics are helpful for monitoring the health of Optimize Live.

For example, a relatively low value of sf_workloads_processed_total might indicate an inability to see cluster data. Similarly, a relatively high value of sf_applier_patches_failed_total might point to workload or cluster issues, or possibly the need to update applier configuration.

The following table lists the metrics under the endpoints at which they’re made available.

Agent Metrics Endpoint Description
sf_workloads_failed_processed_total Total number of workloads for which recommendations failed to generate
sf_workloads_inactive_deletion_total Total number of inactive workloads that have been deleted from StormForge
sf_workloads_inactive_failed_total Total number of inactive workloads that have failed to be deleted from StormForge
sf_workloads_processed_namespace Number of workloads processed by StormForge per namespace
sf_workloads_processed_namespace_resource Number of StormForge workloads processed per namespace per resource
sf_workloads_processed_resource Number of StormForge workloads processed per resource
sf_workloads_processed_total Number of StormForge workloads processed
   
Applier Metrics Endpoint Description
sf_applier_api_disconnections_total Total number of times the applier loses connection to the API
sf_applier_patches_expired_total Total number of StormForge-generated patches that have expired
sf_applier_patches_failed_total Total number of StormForge-generated patches that have failed to apply
sf_applier_patches_processed_total Total number of StormForge-generated patches
sf_applier_patches_rolled_back_total Total number of StormForge-generated patches that have been rolled back to the previous resource values

To view these metrics, use a port-forward in Kubernetes, as shown below:

#View agent metrics
kubectl port-forward deploy/stormforge-agent-workload-controller 8080:8080 -n stormforge-system

#View applier metrics
kubectl port-forward deploy/stormforge-applier 8080:8080 -n  stormforge-system

Logs

To view output and errors related to the Agent workloads, run:

kubectl logs -n stormforge-system -l app.kubernetes.io/component=agent --tail=-1

To describe the Pod and Events for the Agent, run:

kubectl describe pod -n stormforge-system -l app.kubernetes.io/component=agent

To view output and errors related to the Applier, run:

kubectl logs -n stormforge-system -l app.kubernetes.io/component=applier --tail=-1

To describe the Pod and Events for the Applier, run:

kubectl describe pod -n stormforge-system -l app.kubernetes.io/component=applier

The Troubleshooting documentation describes several warnings and errors that you might see in the logs, which can be ingested into a SIEM tool for analysis and event-driven automation.

Last modified October 31, 2024