Monitoring the health of Optimize Live
2 minute read
Status Metrics
Though the StormForge Agent collects workload-level metrics (the sources and descriptions of which can be found in the Security FAQ), it can also expose metrics that are valuable for monitoring the health of Optimize Live.
The Agent and Applier produce various status metrics related to the processing of workloads and to the status of resource recommendation patches, respectively. These metrics are helpful for monitoring the health of Optimize Live.
For example, a relatively low value of sf_workloads_processed_total
might indicate an inability to see cluster data. Similarly, a relatively high value of sf_applier_patches_failed_total
might point to workload or cluster issues, or possibly the need to update applier configuration.
The following table lists the metrics under the endpoints at which they’re made available.
Agent Metrics Endpoint | Description |
---|---|
sf_workloads_failed_processed_total | Total number of workloads for which recommendations failed to generate |
sf_workloads_inactive_deletion_total | Total number of inactive workloads that have been deleted from StormForge |
sf_workloads_inactive_failed_total | Total number of inactive workloads that have failed to be deleted from StormForge |
sf_workloads_processed_namespace | Number of workloads processed by StormForge per namespace |
sf_workloads_processed_namespace_resource | Number of StormForge workloads processed per namespace per resource |
sf_workloads_processed_resource | Number of StormForge workloads processed per resource |
sf_workloads_processed_total | Number of StormForge workloads processed |
Applier Metrics Endpoint | Description |
sf_applier_api_disconnections_total | Total number of times the applier loses connection to the API |
sf_applier_patches_expired_total | Total number of StormForge-generated patches that have expired |
sf_applier_patches_failed_total | Total number of StormForge-generated patches that have failed to apply |
sf_applier_patches_processed_total | Total number of StormForge-generated patches |
sf_applier_patches_rolled_back_total | Total number of StormForge-generated patches that have been rolled back to the previous resource values |
To view these metrics, use a port-forward in Kubernetes, as shown below:
#View agent metrics
kubectl port-forward deploy/stormforge-agent-workload-controller 8080:8080 -n stormforge-system
#View applier metrics
kubectl port-forward deploy/stormforge-applier 8080:8080 -n stormforge-system
Logs
To view output and errors related to the Agent workloads, run:
kubectl logs -n stormforge-system -l app.kubernetes.io/component=agent --tail=-1
To describe the Pod and Events for the Agent, run:
kubectl describe pod -n stormforge-system -l app.kubernetes.io/component=agent
To view output and errors related to the Applier, run:
kubectl logs -n stormforge-system -l app.kubernetes.io/component=applier --tail=-1
To describe the Pod and Events for the Applier, run:
kubectl describe pod -n stormforge-system -l app.kubernetes.io/component=applier
The Troubleshooting documentation describes several warnings and errors that you might see in the logs, which can be ingested into a SIEM tool for analysis and event-driven automation.