Troubleshooting

Get help with common questions, errors, and warnings

Most troubleshooting involves checking the Agent status, the Agent logs, or the ConfigMap, as described under Key Points below.

To help you troubleshoot faster, here are some common questions:

Errors and warnings

Key points:

Most troubleshooting involves the commands listed below.

To view high-level information such as Agent status and cluster connection status:
helm test stormforge-agent -n stormforge-system --logs

Examine the output to determine:

  • Whether the cluster name is valid
  • Whether the application can connect to the api.stormforge.io endpoint
  • Whether the Agent is able to start
To check whether the Agent is running:
kubectl get pods -n stormforge-system -l app.kubernetes.io/component=agent
To view the Agent logs:
kubectl logs -n stormforge-system -l app.kubernetes.io/component=agent --tail=-1
To describe the Pod and Events for the Agent:
kubectl describe pod -n stormforge-system -l app.kubernetes.io/component=agent

Examine the output:

  • Check the pod status, state and reason, conditions, and events.
  • For each container in the .containers spec, check the arguments, state, liveness, and readiness.
To view the ConfigMaps:
kubectl get configmaps -n stormforge-system -l app.kubernetes.io/name=stormforge-agent -o yaml

Scan the relevant ConfigMaps for any unusual or unexpected settings. If you need to make any changes, you may edit the ConfigMaps for a quick fix, but manual edits will be overwritten on upgrade. It is therefore preferable to identify the correct Helm values needed to change the ConfigMap and use them by running helm upgrade.

Restart the agent

In some scenarios, restarting the Agent solves the problem. Remember that this will result in a few moments of downtime while new pods scale up.

kubectl rollout restart deployment stormforge-agent -n stormforge-system

A successful restart generates this message:
deployment.apps/stormforge-agent restarted

Common questions

Check the status field of a recommendation. Check the applier logs.

The Agent installation failed. What can I check?

If the Agent didn’t install correctly, you might not be able to:

Here are a few things you can check:

The Applier installation failed. What can I check?

The Applier uses the secret provided by the Agent. Check for the existence of the StormForge Agent and its secret by running:

kubectl get deployment,secret stormforge-agent -n stormforge-system

The expected output is similar to this:

NAME                               READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/stormforge-agent   1/1     1            1           31d

NAME                      TYPE     DATA   AGE
secret/stormforge-agent   Opaque   5      31d

If one or both of the Agent deployment or the secret are missing, reinstall the Agent and then try installing the Applier again.

How do I troubleshoot the Agent after it’s installed?

If you know the Agent is installed but something doesn’t quite seem right, this might be because:

  • The cluster you installed on might not have any workloads (this is common).
  • The Agent doesn’t have Kubernetes permissions to view workloads.

Check whether there are any workloads on the cluster
Run kubectl get deployments (include the -n NAMESPACE argument if you’re not checking the default namespace). If there are no deployments listed in the output, Optimize Live won’t collect any metrics.

  • If you need a sample workload, you can download the Optimize Live sample application, which installs one workload on your cluster. From a command line, run:
    helm install optlive-showcase-app oci://registry.stormforge.io/examples/optlive-showcase-app
    
    Learn more about the sample application in this guide.

I don’t see the workloads (or namespaces) that I expect on the dashboard. What can I check?

Reasons why you might not see some workloads or namespaces include:

  • The namespace exists but it is empty - no pods, no built-in types.

  • The namespace contains only batch jobs (which are not supported) or pods without an owner reference (this is rare).

  • The namespace has no workloads in a healthy state: for example, a DaemonSet isn’t fully operational (which occurs when not all are healthy).

  • The namespace allowNamespaces or denyNamespaces Helm chart values constrain which namespaces to optimize.

    Note: By default, workloads in the kube-system and openshift-* namespaces are in the denyNamespaces list, and you won’t see these namespaces or their workloads on the dashboard.

There are a few things you can check: pods on the cluster, Agent logs, and the allow/deny namespaces lists.

Get a list of all running pods on the cluster
Run:

kubectl get all -A

You might choose to modify this command or add other arguments, based on your familiarity with Kubernetes.

In the output:

  • In the NAME column, look at the pod list (pod/POD_NAME rows) and make sure the stormforge-agent-POD_SUFFIX pod and your application pods in other namespaces are running.
  • In the NAME column, look for batch.apps: If it exists, make sure that the corresponding namespace contains other Kubernetes types (deployment.apps, replicaset.apps, daemonset.apps).
  • For each item in the STATUS column that has a status of RUNNING, check the READY column: Make sure all the nodes are ready. For example, 7/7 means the workload is healthy (all 7 pods are running); 3/7 is unhealthy.

Check the Agent logs

  1. Get the Agent logs as described above.
  2. Scan the output for errors and any unusual or unexpected settings.

Check whether the allowList or denyList constraints are too restrictive

  1. Get the Optimize Live ConfigMap as described above.
  2. Find workload-agent.yml in the output, and look for the allowNamespaces or denyNamespaces argument.
    • If either list is present and doesn’t match what you expect, adjust it as needed. If both are present, the allowList has precedence and the denyList is ignored.
    • If neither list is included in this spec, then neither list has been set, and something else is causing the problem.

How do I grant RBAC permissions on a patch target?

To apply or export the recommended workload settings, Optimize Live must:

  • Support the patch target type.
  • Have RBAC access permissions on the patch target.

When these permissions are missing, you’ll see a message similar to this:

please grant rbac access permissions on name/namespace: %s/%s, kind: %s, apiversion: %s

To grant RBAC access permissions:

  1. Find the stormforge-agent values.yaml file from the Helm chart that was used to install the StormForge Agent.

    • If you don’t have the file:
      1. Download the chart:

        helm show values oci://registry.stormforge.io/library/stormforge-agent
        

        If you use a proxy server or private container registry, follow the steps in Install the StormForge Agent - advanced scenarios.

      2. Copy the .rbac section into a new file named rbac-update.yaml. Proceed to the next step.

  2. In the rbac.additional section, add a .apiGroups section that includes the resources listed in the message.

    • Remember to replace API_VERSION and KIND with the values from the please grant rbac access permissions ... message.

      rbac: 
          additional: 
          - apiGroups: 
            - API_VERSION
            resources:
            - KIND
            verbs:
            - get
            - list
            - watch
      
  3. Apply the changes by running one of the following commands:

    • If you edited the values.yaml file, run:

      helm upgrade --values.yaml 
      
    • If you created a separate file named rbac-update.yaml, run:

      helm upgrade --values.yaml -f rbac-update.yaml 
      

I didn’t get a recommendation for a workload. What can I check?

Check the basics:

  • Look for more information in a message near the top of the page in the Optimize Live UI.
  • On the Config tab on the workload details page, check that the recommendation schedule is set to the frequency that you expect.

If an HPA is enabled on the workload, you won’t get a recommendation in these scenarios:

  • The HPA scales on CPU and memory metrics.
  • The workload scales down to 0 replicas for more than 75% of the time during a 7-day period. Sufficient metrics aren’t available when a workload scales down for this much time.

Errors and warnings

Failed to send batch, retrying

Reason: Post “https://in.stormforge.io/prometheus/write": oauth2: “too_many_attempts” We have detected suspicious login behavior and further attempts will be blocked. Please contact the administrator.

Explanation: If enough incorrect client_ID/client_secret combinations originate from one IP address, any further attempts to connect to Optimize Live are blocked.

Troubleshooting steps:

  1. Determine which authentication setup you’re using in your estate. An administrator typically makes this decision, usually before installing Optimize Live on an initial cluster.

    • One clientID/clientSecret pair used by all clusters
    • One clientID/clientSecret pair per cluster
  2. On each cluster, locate your access credentials file (the .yaml file that contains your clientID and clientSecret). This file was created when you either:

    • Added a cluster by clicking Add Cluster in the Optimize Live UI
    • Ran stormforge auth create CREDENTIAL_NAME > CREDENTIAL_FILE from the command line before you installed the StormForge Agent

Applied patch but timed out waiting for workload to be ready

If a workload doesn’t restart or function as expected after a recommendation is applied, the Optimize Live Applier rolls back the workload to its last known good state and logs one of the following messages in the Applier logs.

  • Workload was rolled back and returned to ready:
      applied patch but timed out waiting for workload to become ready
      rolled back to previous state
      workload returned to ready
    
  • Workload was rolled back but did not return to ready:
    applied patch but timed out waiting for workload to become ready
    rolled back to previous state
    timed out waiting for pods to be ready
    

While this is sometimes a desirable safety feature to have enabled, at scale and in large environments these errors often happen when the Applier’s health check doesn’t wait long enough for workloads to become ready before considering the change a failure and rolling back to the previous state. If this is happening frequently, disable the rollback feature.

To disable the rollback feature:

  1. Add the following code to a .yaml file, for example, applier-values.yaml.
    # applier-values.yaml
    extraEnvVars:
    - name: STORMFORGE_SKIP_ROLLBACK
      value: "true"
    
  2. Upgrade the applier using the new values:
    helm upgrade stormforge-applier oci://registry.stormforge.io/library/stormforge-applier \
      -n stormforge-system \
      -f applier-values.yaml
    

If you see the following message, the workload wasn’t healthy even before applying any changes, and you should investigate the workload directly.

  • Workload is unhealthy:
    applied patch but timed out waiting for workload to become ready
    rolled back to previous state
    not waiting for workload to become ready, workload was unhealthy prior to patch
    

Start by describing the pods relating to the workload resource and examine the Events section for clues.

kubectl describe pod NAME

You can learn more about troubleshooting common issues with Kubernetes resources in the Kubernetes docs.

Last modified February 2, 2024