The Optimize Pro Lifecycle
6 minute read
Overview
You always begin the StormForge Optimize Pro experiment process by asking an optimization question, such as:
What are the best Kafka settings — such as compression, batch size, and linger time — for optimizing for throughput and latency when running on AWS r4.xlarge instances?
What are the optimal settings for the same application when running on d2.xlarge instances?
With your optimization question in mind, you can then craft an Optimize Pro Experiment resource and submit it for emperical evaluation.
StormForge Optimize Pro will initiate a progression of automated machine learning trials to discover an optimal solution for your question, leaving you free to focus on other work.
The following high-level steps describe what happens when you run an experiment.
Experiment Creation
The experiment is created when you run kubectl apply
to create an Experiment
resource.
The experiment definition includes the parameters you would like to experiment with, the metrics to measure trial results with, and a template for running trials.
The Optimize Pro controller will synchronize the parameter and metric information with the cloud API and begin requesting trial parameter values to run new trials.
Trial Creation
The definition of the experiment includes a trial template, which will be combined with the parameter assignments to form a new trial resource in the cluster.
After a new trial resource is created, the controller will run it and collect the metrics results.
Running a Trial
Tens or even hundreds of trials might be run while searching for an optimal configuration, depending on the experimentBudget
value.
The parameter values for each trial are different and are suggested by StormForge’s machine learning algorithms to quickly and efficiently identify optimal configuration while avoiding an exhaustive search.
Each phase of running a trial is described below.
Run setupTasks
Create Jobs
Setup tasks are built-in helpers that can make trials easier to run by creating ephemeral resources at the beginning of a trial and tearing them down after the trial completes.
If the trial template includes any setup tasks, run each setup task’s create job (for example, creating an ephemeral Prometheus and Pushgateway for trial jobs to send metrics to).
Some setup tasks might incorporate trial parameter assignments, for example, as values passed to a Helm chart setup task.
Apply patches
Many experiments patch, or modify, existing Kubernetes resources as part of a trial. For example, changing the value of an environment variable in an existing Deployment. Patches are templatized and might incorporate trial parameter assignments.
If the trial includes any patch templates, these patches are applied next.
Wait for Stabilization and readinessGates
Stabilization and readinessGates
ensure that resources in the cluster are ready and available before the trial’s Job is started.
For any deployment, stateful set, or daemon set that was patched, a rollout status check will be performed. This is called stabilization. The trial will proceed when the patched objects are ready.
If the trial includes any readinessGates
, the trial will proceed only when each of their conditions to be satisfied.
Run Trial Job
The trial resource includes a jobTemplate
, which will be used to schedule a new Job.
Typical kinds of trial jobs include:
- A batch job, which performs work directly and is measured on its performance.
- A load test job, which sends traffic to another application, measuring that application’s performance.
- An API client job, which calls out to external systems to do work and receives performance indicators back.
- A simple “sleep” job to give time for an external metrics system to collect data.
If no container is specified for the job in the jobTemplate
, an implicit “sleep” container will be used that waits the amount of time specified in the trial’s approximateRuntime
field.
Collect Metrics
When the trial job completes, the trial metrics are collected according to their type, and the metric values are recorded on the trial resource.
For Prometheus metrics, a check is made to ensure a final scrape has been performed before metric collection.
After all metrics have been collected, the trial is marked as finished.
Run setupTasks
Delete Jobs
If the trial included any setup tasks, run each setup task’s delete job (for example, deleting the ephemeral Prometheus and Pushgateway).
Record Trial Results
After the trial job is completed and the metrics have been collected, you can view the data by inspecting the Kubernetes trial object by running kubectl get trial
. Additionally, the metrics of finished trials are reported back to the API to inform the machine learning’s next round of suggested trial parameter assignments.
The machine learning will continue to suggest new trials as it explores the problem space, until it has exhausted the experiment’s configurable experimentBudget
number of trials.
Review Experiment Results
You can view the current experiment results at any time in the StormForge app. You’ll see all the completed trials, the parameters they used, and the metrics the trials returned.
By viewing the graphable trial results in aggregate, an optimal set of parameter values that balance the metrics you wanted to optimize for are presented. While the machine learning helps you get started by highlighting the set of values it believes to be the most balanced, you can use the views available in the application to select emperically tested parameters that best suit your desired outcomes.
Apply Optimal Configuration
The experiment results show you a specific combination of parameter values (such as partitions=7500
and log.segment.bytes=500MB
) that the machine learning found to be most optimal according to the experiment’s defined metrics and optimization goals (which might include maximize throughput, minimize cost).
Armed with these specific optimal values, all you need to do is apply them in GitOps version control, a configuration management database, or wherever and however else your application is configured.