Troubleshooting

A guide to help decipher common errors and common questions

Common Errors & Questions

How do I access logs?

What is a failed trial?

Where do I get a load test?

How do I troubleshoot login or authentication?

Why is my experiment status idle?

Why is the experiment showing me “no results”?

The experiments keep failing. What do I do?

Why did the baseline fail?

Why did the metric or parameter fail?

How do I ensure that the experiment is fully deleted?

How do I check whether I wrote the experiment correctly?

What does “pods unschedulable” mean?

Is this a role-based access control (RBAC) issue or a service account issue?

List of troubleshooting utility tools


How do I access logs?

Many issues can be debugged using logs. Logs can be retrieved from 3 environments: the load test, the controller, and the application being optimized.

To access load test logs:

  • If you are using StormForge load tests, see the Debugging Test Cases guide.

  • If you are using a different load test, visit the test’s documentation for more information.

To access controller logs:

  • To get the logs, run: kubectl logs -f -l app.kubernetes.io/name=optimize-pro -n stormforge-system

To access application logs:

  • This is dependent on the application.

What is a failed trial?

Failed trials are a normal part of the experiment lifecycle and do not necessarily indicate a problem with the experiment. Failures are useful signals to the machine learning service and help it generate better suggestions as it learns. However, if most or all trials in the experiment are failing, the experiment may need adjustments.

Good Failures: If a few trials fail throughout the experiment, then probably not much or anything needs to be changed. Failures occur normally under these circumstances:

  • The ML algorithm is exploring and learning, especially at the beginning of an experiment. It may take some time to produce successful trials.
  • There are metrics constraints due to insufficient ranges that were set.

How do I troubleshoot? Run stormforge check controller to help identify the error.

Bad Failures: If most or all of an experiment’s trials are failing, it may need to be modified.

To start troubleshooting, here are two options:

Option 1: Start by checking the logs of the trial pod:

  1. stormforge check optimize-pro
  2. kubectl get trials to list trials
  3. kubectl get pods
  4. kubectl logs -l stormforge.io/trial=<trial-name> to get for the trial pod

Option 2:

Run the linter: stormforge check experiment /path/to/experiment.yaml


Where do I get a load test?

A load test is usually (but not always) needed to run application performance experiments with StormForge Optimize Pro. Check out these resources to learn how to create a load test.

StormForge Performance Testing (Recommended)

Locust


How do I troubleshoot login or authentication?

First, start the login process by running stormforge login. This will redirect you to the login page for authentication. This will populate the configuration file, which you can learn more about by running stormforge config -h. After logging in, run stormforge install optimize-pro again. This will create a new set of credentials for the controller and re-deploy the controller.

You might have to de-register the cluster’s old credentials when you reinstall. You can review existing cluster registrations by running stormforge get clusters, and if necessary, remove the registrations by running stormforge delete cluster <name>.

If you encounter issues when you run stormforge install optimize-pro, run stormforge ping or stormforge check connect to verify the connection from the CLI.


Why is my experiment status idle?

Start by checking the controller logs for the message, “Experiments API is unavailable.” This indicates that the connection is idle. When the controller cannot access the Experiments API, it will not automatically suggest trials.

This can also happen if stormforge install optimize-pro ran to completion without valid credentials. Log in and run stormforge install optimize-pro to update the secret with new API credentials (this is the secret in the stormforge-system namespace).


Why is the experiment showing me “no results”?

The experiment is most likely still running. To check the status of the experiment, run kubectl get experiments and kubectl get trials in the appropriate namespaces.


The experiments keep on failing. What do I do?

Check the following:

  1. Is the test file written correctly?
  2. Are all of the prerequisites set up?
  3. Is there connectivity between the controller and the metric source?
  4. Is the trial within the set constraints?
    • Resources: Did the memory or CPU run out?
    • Metric: Is the metric within the allowed set range?
    • Parameter: Are the parameters within an appropriate range?

Why did the baseline fail?

  1. The baseline value is missing from the parameter

  2. The baseline is out of range of the parameter (ex. baseline: 100, min: 25, max: 75).

  3. The previous experiment might not have been fully deleted when the new experiment was created, resulting in an incorrect baseline. Be sure to delete both the Kubernetes artifacts and the StormForge experiment to start over.

     kubectl delete experiment <experiment>
     stormforge delete experiment <experiment>
    

Why did the metric or parameter fail?

  1. The result is most likely outside of the allowed set range. This can happen when using max or min with metrics.

  2. There may be an issue with the metrics query.

  3. There may have been an issue fetching or collecting the metrics:

    • Check controller logs.

    • Look at the trial state by running a command such as:

        kubectl get trial <trial> -o jsonpath='{.status.conditions} { .spec.values}' | jq '.'
      

How do I ensure that the experiment is fully deleted?

Delete both the Kubernetes artifacts and the StormForge experiment:

kubectl delete experiment <experiment>
stormforge delete experiment <experiment>

How do I check whether I wrote the experiment correctly?

  1. Run the following command:

     stormforge check experiment /path/to/experiment.yaml
    
  2. Check to see if the baseline, metric, parameters are within bounds.


What is does “pods unschedulable” mean?

A pod can be unschedulable when the cluster is too small for the defined parameter bounds. Check node taints, tolerations, and scheduler affinity, if defined in your environment.


Is this a role-based access control (RBAC) issue or service account issue?

  1. Review controller logs for relevant errors.

  2. If interacting with a resource that is not a StatefulSet or Deployment, additional RBAC for the controller can be generated by running:

     stormforge rbac /path/to/experiment.yaml
    

    This is most commonly needed when working with custom resource definitions (CRDs).

  3. If the experiment uses a setup task, a new service account with elevated permissions may be needed. This may happen when trying to deploy a Helm chart with many different types of resources.

  4. The setup task service account may have been deleted before the rest of the resources that depend on it.

  5. If necessary, ask the admin to set up the correct permission.


List of troubleshooting utility tools

  1. stormforge debug: Shows the available debug subcommands
  2. stormforge debug metric /path/to/experiment.yaml: Shows debug metrics
  3. stormforge check optimize-pro: Checks the controller
  4. stormforge check experiment /path/to/experiment.yaml: Checks the experiment

Don’t see your issue here? Contact us at support@stormforge.io

Last modified January 13, 2023