Troubleshooting Guide

A guide to help decipher common errors and common questions

6 minute read

Common Errors & Questions

How do I access logs?

What is a failed trial?

Where do I get a load test?

How do I troubleshoot login or authentication?

Why is my experiment status idle?

Why is the experiment showing me “no results”?

The experiments keep on failing. What do I do?

Why did the baseline fail?

Why did the metric or parameter fail?

How do I ensure that the experiment is fully deleted?

How do I check if I wrote the experiment correctly?

What is pods unschedulable?

Is this a RBAC (role based access control) or service account issue?

List of Troubleshooting Utility Tools


How do I access logs?

Many issues can be debugged using logs. Logs can be retrieved from 3 environments: the load test, the controller, and the application being optimized.

To access load test logs:

  • If you are using StormForge load tests, see the Debugging Test Cases guide.

  • If you are using a different load test, visit the test’s documentation for more information.

To access controller logs:

  • Get the logs using kubectl: kubectl logs -f -l app.kubernetes.io/name=optimize -n stormforge-system

To accessing application logs

  • This is dependent on the application.

What is a failed trial?

Failed trials are an normal part of the experiment lifecycle and do not necessarily indicate a problem with the experiment. Failures are useful signals to the machine learning service and help it generate better suggestions as it learns. However, if most or all trials in the experiment are failing, the experiment may need adjustments.

Good Failures: If a few trials fail throughout the experiment, then probably not much or anything needs to be changed. Failures occur normally under these circumstances:

  • The ML algorithm is exploring and learning, especially at the beginning of an experiment. It may take some time to produce successful trials.
  • There are metrics constraints due to insufficient ranges that were set.
  • How do I troubleshoot? Use stormforge check controller to help identify the error

Bad Failures: If most or all of an experiment’s trials are failing, it may need to be modified.

To start troubleshooting, here are two options:

Option 1: Start by checking the logs of the trial pod:

  1. stormforge check controller
  2. kubectl get trials - to get trial pods
  3. kubectl -n accounts get pods
  4. kubectl logs -n accounts :id --follow - log for the application

Option 2:

Run the linter stormforge check experiment -f experiment.yaml


Where do I get a load test?

A load test is necessary to use StormForge Optimize. There are several resources to learn how to create a load test.

StormForge Performance Testing (Recommended)

Locust


How do I troubleshoot login or authentication?

First, let’s start the login process by running stormforge login. This should redirect you to our login page for authentication. This will populate ~/.config/stormforge/config. Next, run stormforge init. We will create a new set of credentials for the controller and deploy the controller to the appropriate clusters.

If there is an issue with when running stormforge init, run stormforge ping or stormforge check config to verify the connection from the CLI.


Why is my experiment status idle?

Start by checking the controller logs for the message, “Experiments API is unavailable.” This indicates that the connection is idle. When the controller cannot access the Experiments API, it will not automatically suggest trials. Please create the trials manually using stormforge generate trial.

This can also happen if stormforge init was ran successfully without loggin in. Please login and run stormforge authorize-cluster to update the secret with the API credentials (this is the secret in the stormforge-system namespace). If you do not have an account, the controller will require parameter inputs via stormforge generate trial.


Why is the experiment showing me “no results”?

The experiment is most likely still running. To check the status of the experiment, use kubectl get experiments and kubectl get trials in the appropriate name spaces.


The experiments keep on failing. What do I do?

Check the following:

  1. Is the test file written correctly?
  2. Is all of the prerequisites set up?
  3. Is there connectivity between the controller and the metric source?
  4. Is the trial within the set constraints?
    • Resources: Did the memory or CPU run out?
    • Metric: Is the metric within the allowed set range?
    • Parameter: Are the parameters within an appropriate range?

Why did the baseline fail?

  1. The baseline value is missing from the parameter
  2. The baseline is out of range of the parameter (ex. baseline: 100, min: 25, max: 75)
  3. The previous experiment may not be fully deleted when the new experiment was created. Thus, giving an incorrect baseline. Make sure to delete both the Kubernetes artifacts and the Stormforge experiment to start over.

Why did the metric or parameter fail?

  1. The result is most likely outside of the allowed set range. This can happen when using max or min with metrics.

  2. There may be an issue with the metrics query.

  3. There may have been an issue fetching or collecting the metrics:

    • Check controller logs
    • Look at the trial state using the following command: kubectl get trial postgres-example-beamzilla-023 -o jsonpath='{.status.conditions} { .spec.values}' | jq '.'

How do I ensure that the experiment is fully deleted?

Please delete both the Kubernetes artifacts and the StormForge experiment. kubectl delete experiment <experiment> stormforge delete experiment <experiment>


How do I check if I wrote the experiment correctly?

  1. Use this command to check if the experiment is written correctly: stormforge check experiment -f experiment.yaml
  2. Check to see if the baseline, metric, parameters are within bounds

What is pods unschedulable?

A pod can be unschedulable when the cluster is too small for the defined parameter bounds. It may also be worth checking node taints, tolerations, and scheduler affinity if defined in your environment.


Is this a RBAC (role based access control) or service account issue?

  1. Review controller for relevant errors
  2. If interacting with a resource that is not a StatefulSet or Deployment, additional RBAC for the controller can be generated via stormforge generate rbac -f experiment.yaml. This is most commonly needed when working with custom resource definitions (CRDs)
  3. If the experiment uses a setup task, a new service account with elevated permissions may be needed. This may happen when trying to deploy a helm chart with many different types of resources
  4. The setup task service account may have been deleted before the rest of the resources that depend on it
  5. If necessary, ask the admin to set up the correct permission

List of Troubleshooting Utility Tools

  1. stormforge debug
  2. stormforge debug metrics - Debug metrics
  3. stormforge check controller - Checks the controller
  4. stormforge check experiment - Checks the experiment

Don’t see your issue here? Contact us at support@stormforge.io

Last modified August 18, 2021