Metrics

7 minute read

Optimize Pro has been deprecated.
But you can check out Optimize Live:

Sign up for a free 30-day trial
Try Optimize Live in our view-access sandbox - click Enter Sandbox. No installation required.

A metric is an expression to measure the results of a trial with a given set of parameter values. Trial metrics are collected immediately following completion of the trial run job, typically from a dedicated metric store such as Prometheus. A single 64-bit floating point number is collected for each metric defined on the experiment.

As a best practice, metrics should be selected with opposing goals. For example, choosing to minimize resource usage by itself will result in an application that does not start and therefore does not use any CPU or memory at all. An example of opposing goals is minimizing overall resource usage (a combined metric for both CPU and memory) while maximizing the throughput of some part of the application.

Metrics Spec

The metric spec must contain a unique name and a query. The query must evaluate to a number (integer or floating point). Optionally, the type, min/max values, minimize, and optimize may be specified.

For details, see the metric API reference.

Queries

Regardless of the query type, the query field is always preprocessed as Go templates, with Sprig template functions included to support a variety of operations.

For example, a PromQL query can be written to include a placeholder for the “range” (duration) of the trial run:

spec:
  metrics:
  - name: sample duration query
    type: prometheus
    query: avg_over_time(up[{{ .Range }}])

The following variables are defined for use in query processing:

Variable Name	Type	Description
`Trial.Name`	`string`	The name of the trial
`Trial.Namespace`	`string`	The namespace the trial ran in
`Values`	`map[string]interface`	The parameter assignments
`StartTime`	`time`	The adjusted start time of the trial run job
`CompletionTime`	`time`	The completion time of the trial run job
`Range`	`string`	The duration of the trial run job, e.g. “5s”
`Pods`	`PodList`	The list of pods in the trial namespace

The following additional template functions are also available:

Function	Usage	Example
percent	Return the integer percentage.	`{{ percent 9 50 }}`
duration	Returns the number of seconds between two times.	`{{ duration .StartTime .CompletionTime }}`
resourceRequests	Returns the weighted sum of resource requests for matched labels.	`{{ resourceRequests .Pods "cpu=22,memory=3" }}`

Metric Types

Metrics can be one of kubernetes, prometheus, datadog, or jsonpath.

Kubernetes Metric

A Kubernetes metric is evaluated against pod resources that are matched by the target selector. If no target is specified, the trial pod will be used. It is typically used to measure the duration of the trial or when measuring the resources of a static application.

The following example metric measures the duration of the trial pod and indicates that this metric should be minimized (to achieve the lowest possible value).

spec:
  metrics:
    - name: duration
      minimize: true
      query: "{{duration .StartTime .CompletionTime}}"

The following example metric highlights using a target to select the necessary objects. At minimum, the target selector must contain a valid apiVersion and kind for the resource, ex v1 and Pod.

spec:
  metrics:
    # Using a target selector
    - name: duration
      minimize: true
      type: kubernetes
      query: '{{duration .StartTime .CompletionTime}}'
      target:
        apiVersion: v1
        kind: Pod
        matchLabels:
          app: foo

A Kubernetes metric is the default metric type if one is not specified.

Prometheus Metric

A Prometheus metric treats the query field as a PromQL query to execute against a Prometheus instance identified using a service selector. The Range template variable can be used when writing the PromQL to produce queries over the time interval during which the trial job was running — for example, [{{ .Range }}].

All Prometheus metrics must evaluate to scalar that is a single floating point number. You might often need to write a query that produces a single-element instant vector and extract that value using the scalar function.

The scalar function produces a NaN result when the size of the instant vector is not 1, causing the trial to fail during metrics collection.

When using the Prometheus collection type, the url field will be used to identify the Prometheus instance to query. Provide the HTTP URL to your own Prometheus setup here. The prometheus setup task can be used to provision a Prometheus setup automatically, in which case the url field defaults to the address for the temporarily provisioned Prometheus instance.

When using Prometheus metrics, the following additional template functions are available:

Function	Usage	Example
cpuUtilization	Returns the average CPU utilization as a percentage.	`{{ cpuUtilization . "app=foo,component=bar" }}`
memoryUtilization	Returns the average memory utilization as a percentage.	`{{ memoryUtilization . "app=foo,component=bar" }}`
cpuRequests	Returns the average CPU requests in cores.	`{{ cpuRequests . "app=foo,component=bar" }}`
memoryRequests	Returns the average memory requests in bytes.	`{{ memoryRequests . "app=foo,component=bar" }}`
GB	Helper function to format output as `Giga`.	`{{ memoryRequests . "app=foo,component=bar" \| GB }}`
MB	Helper function to format output as `Mega`.	`{{ memoryRequests . "app=foo,component=bar" \| MB }}`
KB	Helper function to format output as `Kilo`.	`{{ memoryRequests . "app=foo,component=bar" \| KB }}`
GiB	Helper function to format output as `Gibi`.	`{{ memoryRequests . "app=foo,component=bar" \| GiB }}`
MiB	Helper function to format output as `Mebi`.	`{{ memoryRequests . "app=foo,component=bar" \| MiB }}`
KiB	Helper function to format output as `Kibi`.	`{{ memoryRequests . "app=foo,component=bar" \| KiB }}`

In the following example we calculate the sum of all CPU usage seconds.

spec:
  metrics:
  - name: cpu seconds
    minimize: true
    type: prometheus
    query: |
      scalar(
        sum(
          process_cpu_seconds_total{job="prometheus"}
        )
      )      
    url: http://prometheus-server.default.svc.cluster.local:9090

Datadog Metric

A Datadog metric can be used to execute metric queries against the Datadog API.

If using a Datadog metric, there is additional setup needed to authenticate against the Datadog API. To authenticate to the Datadog API, the DATADOG_API_KEY and DATADOG_APP_KEY environment variables must be set on the manager deployment. You can populate these settings by passing the following values during Optimize Pro installation:

stormforge install optimize-pro \
  --set datadog.apiKey=xxx-yyy-zzz \
  --set datadog.appKey=aaa-bbb-ccc

Datadog metrics are subject to further aggregation (in addition to the aggregation method of the query); this is similar to the Query Value widget. By default, the avg aggregator is used, however this can be overridden by setting the scheme field of the metric to any of the supported aggregator values (avg, last, max, min, sum).

Datadog queries are automatically scoped to the time frame of the relevant trial job.

spec:
  metrics:
  - name: p50
    minimize: true
    type: datadog
    query: "avg:trace.http.request.duration.by.resource_service.50p{env:stormforge,service:ples,resource_name:get_/ples}"

JSONPath Metric

A JSONPath metric fetches a JSON payload from an arbitrary HTTP endpoint and evaluates a Kubernetes JSONPath expression from the query field against it.

The result of the JSONPath expression must be a numeric value (or a string that can be parsed as floating point number). This typically means that the value of the metric query field should start and end with curly braces, as in "{.example.foobar}" (since the $ operator is optional).

When using a JSONPath metric, the selector field is used to determine the HTTP endpoint to query. Conversely, the scheme, port and path fields can be used to refine the resulting URL. Note that query parameters are allowed in the path field if necessary: In general, a request for the URL constructed from the template {scheme}://{selectedServiceClusterIP}:{port}/{path} is used with an Accept: application/json header to retrieve the JSON entity body.

JSONPath Example

spec:
  metrics:
  - name: latency
    minimize: true
    type: jsonpath
    query: '{.current_response_time_percentile_95}'
    url: http://myjson.default.svc.cluster.local:8089/stats/requests

Example

We’ll make use of JSONPath metrics for this example.

JSONPath metrics will be used to measure request latency, throughput, and failure ratio from our load generator.

We’ll optimize for latency and throughput, and track failures ( via optimize: false ). Non-optimized metrics can be useful when interpreting the results by adding additional context. We’ve also set a max allowed latency to 700ms. This means any trial that exceeds the threshold will be marked as failed (with the exception of the baseline).

apiVersion: optimize.stormforge.io/v1beta2
kind: Experiment
metadata:
  name: shopping
spec:
  metrics:
  - name: latency
    minimize: true
    type: jsonpath
    query: '{.current_response_time_percentile_95}'
    max: "700"
    url: http://locust.default.svc.cluster.local/stats/requests
  - name: throughput
    type: jsonpath
    query: '{.total_rps}'
    url: http://locust.default.svc.cluster.local/stats/requests
  - name: failures
    type: jsonpath
    optimize: false
    query: '{.fail_ratio}'
    url: http://locust.default.svc.cluster.local/stats/requests

Last modified February 12, 2024