Monitoring the health of Optimize Live
2 minute read
Status Metrics
StormForge Agent components expose health and status metrics that are valuable for monitoring the health of Optimize Live. Health and status metrics are different from the workload metrics Optimize Live needs to create request and limit recommendations. (See the Security FAQ for the list and descriptions of the workload metrics.)
Health and status metrics for each StormForge component are listed in the table below, under the component that produces them.
Workload Controller Metrics | Description |
---|---|
sf_workloads_failed_processed_total | Total number of workloads for which recommendations failed to generate |
sf_workloads_inactive_deletion_total | Total number of inactive workloads that have been deleted from StormForge |
sf_workloads_inactive_failed_total | Total number of inactive workloads that have failed to be deleted from StormForge |
sf_workloads_processed_namespace | Number of workloads processed by StormForge per namespace |
sf_workloads_processed_namespace_resource | Number of StormForge workloads processed per namespace per resource |
sf_workloads_processed_resource | Number of StormForge workloads processed per resource |
sf_workloads_processed_total | Number of StormForge workloads processed |
Applier Metrics | Description |
sf_applier_api_disconnections_total | Total number of times the applier loses connection to the API |
sf_applier_patches_expired_total | Total number of StormForge-generated patches that have expired |
sf_applier_patches_failed_total | Total number of StormForge-generated patches that have failed to apply |
sf_applier_patches_processed_total | Total number of StormForge-generated patches |
sf_applier_patches_rolled_back_total | Total number of StormForge-generated patches that have been rolled back to the previous resource values |
You can scrape these metrics in OpenMetrics format from the following component pods and endpoints:
Component | Pod Selector Labels | Port | Endpoint |
---|---|---|---|
Workload Controller | app.kubernetes.io/name: stormforge-agent and app.kubernetes.io/component: agent |
8080 | /metrics |
Applier | app.kubernetes.io/name: stormforge-agent and app.kubernetes.io/component: applier |
8080 | /metrics |
The Agent and Applier are implemented as Kubernetes Deployments:
- The Agent is comprised of two Deployments:
stormforge-agent-workload-controller
andstormforge-agent-metrics-forwarder
- The Applier is comprised of one Deployment:
stormforge-applier
As such, output and errors related to the Agent and Applier workloads are available in the pod logs. To view logs, complete the steps in the appropriate section of the Troubleshooting topic:
- View the Agent logs
- Describe the Agent Pod and Events
- View the Applier logs
- Describe the Applier Pod and Events
The Troubleshooting topic also describes several of the warnings and errors to watch for in the logs, which can be ingested into a SIEM tool for analysis and event-driven automation.