The experiment lifecycle consists largely of creating an experiment and iterating over trials, each operating within their own namespace on the cluster. The number of concurrently running trials is determined by the availability of namespaces to run trials in.
An experiment and its trials run in several steps, coordinated by the StormForge Optimize Controller: (click to view larger)
Note: If the StormForge Optimize Controller is not configured to use an API, you can generate trial configurations manually using
redskyctl generate trial.
The Experiment process then starts with the Create Trial step instead of Reconcile Experiment.
redskyctl suggest can be used with the API configured, but the suggestion will be sent to the API to be queued.)
An experiment manifest is written and loaded into the cluster. When using the platform this will synchronize the cluster state with the API and begin requesting suggested parameter assignments; otherwise the system will be idle until suggestions are manually provided.
The definition of the experiment includes a trial template which will be combined with the parameter assignments to form a new trial resource in the cluster. Any failures during the remaining stages will cause the trial to marked as failed.
If the trial includes any setup tasks, a job is scheduled to run each setup task in individual containers. Setup tasks may incorporate parameter assignments, for example as a value in a Helm chart.
Using the patches from the experiment and the parameter assignments from the trial, an attempt is made to patch the cluster state. Empty patches are ignored, it may also be the case that parameter assignments established during setup tasks result in patch operations that do not result in changes.
Wait for Stabilization
For any deployment, stateful set or daemon set that was patched, a rollout status check will be performed. Once the patched objects are ready the trial can progress.
Run Trial Job
The trial resource includes a job template which will be used to schedule a new job.
If container list of the job is empty, a container that performs a “sleep” will be injected (the amount of sleep time is determined by the
approximateRuntime field on the trial).
The start and completion times of the job are recorded on the trial (the recorded start time will be adjusted by the value of the
startTimeOffset field on the trial).
When the trial job completes, the metrics are collected according to their type. The metric values are recorded on the trial resource. For Prometheus metrics, a check is made to ensure a final scrape has been performed before metric collection. Once all metrics have been collected the trial is marked as finished.
After the trial job is completed and the metrics have been collected, you can view the data by inspecting the Kubernetes trial object via
kubectl get trial.
Additionally, when using the Enterprise product, the metrics of finished trials are reported back to the API to improve the next round of suggested parameter assignments.
This can be viewed by visiting the stormforge app.
If the trial included setup tasks, a job is scheduled to delete the objects created during setup creation.