Metrics

A metric is an expression to measure the results of a trial with a given set of parameter values. Trial metrics are collected immediately following completion of the trial run job, typically from a dedicated metric store like Prometheus. A single 64-bit floating point number is collected for each metric defined on the experiment.

As a general strategy, metrics should be selected with opposing goals. For example, choosing to minimize resource usage by itself will result in an application that does not start and therefore does not use any CPU or memory at all. An example of opposing goals would be to minimize overall resource usage (a combined metric for both CPU and memory) while maximizing throughput of some part of the application.

Metrics Spec

The metric spec must contain a unique name and a query. The query must evaluate to a number ( integer or floating point ). Optionally, the type, min/max values, minimize, and optimize may be specified.

Additional details can be found in the metric reference.

Concepts

Queries

Regardless of the query type, the query field is always preprocessed as Go templates, with Sprig template functions included to support a variety of operations.

For example, a PromQL query can be written to include a placeholder for the “range” (duration) of the trial run:

spec:
  metrics:
  - name: sample duration query
    type: prometheus
    query: avg_over_time(up[{{ .Range }}])

The following variables are defined for use in query processing:

Variable Name Type Description
Trial.Name string The name of the trial
Trial.Namespace string The namespace the trial ran in
Values map[string]interface The parameter assignments
StartTime time The adjusted start time of the trial run job
CompletionTime time The completion time of the trial run job
Range string The duration of the trial run job, e.g. “5s”
Pods PodList The list of pods in the trial namespace

The following additional template functions are also available:

Function Usage Example
percent Return the integer percentage. {{ percent 9 50 }}
duration Returns the number of seconds between two times. {{ duration .StartTime .CompletionTime }}
resourceRequests Returns the weighted sum of resource requests for matched labels. {{ resourceRequests .Pods "cpu=22,memory=3" }}

Metric Types

Metrics can be one of local, pods, prometheus, datadog, or jsonpath.

Local Metric

A local metric is evaluated against the trial pod and has access to the trial pod spec. It is typically used to measure the duration of the trial.

The following example metric measures the duration of the trial pod and indicates that this metric should be minimized ( we want to achieve the lowest possible value ).

spec:
  metrics:
    - name: duration
      minimize: true
      query: "{{duration .StartTime .CompletionTime}}"

A local metric is the default metric type if one is not specified.

Pods Metric

A pods metric is evaluated against pod resources that are matched from a given selector. It is similar to local metrics in that it has access to the pod spec for all pods matched by the selector. This metric type is typically used when measuring the resources of a static application.

The following example metric calculates the resources for all pods with the label app=foo using different weights for cpu and memory.

spec:
  metrics:
  - name: resources
    minimize: true
    type: pods
    query: '{{resourceRequests .Pods "cpu=22,memory=3"}}'
    selector:
      matchLabels:
        app: foo

Prometheus Metric

A prometheus metric treats the query field as a PromQL query to execute against a Prometheus instance identified using a service selector. The Range template variable can be used when writing the PromQL to produce queries over the time interval during which the trial job was running; e.g. [{{ .Range }}].

All Prometheus metrics must evaluate to scalar that is a single floating point number. Often times it may be necessary to write a query that produces a single-element instant vector and extract that value using the scalar function.

When using the Prometheus collection type, the selector field is used to determine the instance of Prometheus to use. A search will be performed for services matching the selector in the trial namespace. In the case of multiple matched services, each service returned by the API is tried until the first successful attempt to capture the metric value.

Prometheus connection information can be further refined using the scheme (must be "https" or "http", the later of which is used by default), the port (a port number or name specified on the service, if the service only specifies one port this can be omitted) and the path (the context root of the Prometheus API).

When using Prometheus metrics, the following additional template functions are available:

Function Usage Example
cpuUtilization Returns the average CPU utilization as a percentage. {{ cpuUtilization . "app=foo,component=bar" }}
memoryUtilization Returns the average memory utilization as a percentage. {{ memoryuUtilization . "app=foo,component=bar" }}
cpuRequests Returns the average CPU requests in cores. {{ cpuRequests . "app=foo,component=bar" }}
memoryRequests Returns the average memory requests in bytes. {{ memoryRequests . "app=foo,component=bar" }}
GB Helper function to format output as Giga. {{ memoryRequests . "app=foo,component=bar" | GB }}
MB Helper function to format output as Mega. {{ memoryRequests . "app=foo,component=bar" | MB }}
KB Helper function to format output as Kilo. {{ memoryRequests . "app=foo,component=bar" | KB }}
GiB Helper function to format output as Gibi. {{ memoryRequests . "app=foo,component=bar" | GiB }}
MiB Helper function to format output as Mebi. {{ memoryRequests . "app=foo,component=bar" | MiB }}
KiB Helper function to format output as Kibi. {{ memoryRequests . "app=foo,component=bar" | KiB }}

In the following example we calculate the sum of all cpu usage seconds.

spec:
  metrics:
  - name: cpu seconds
    minimize: true
    type: prometheus
    query: |
      scalar(
        sum(
          process_cpu_seconds_total{job="prometheus"}
        )
      )      
    selector:
      matchLabels:
        app: prometheus

Datadog Metric

A Datadog metric can be used to execute metric queries against the Datadog API.

If using a Datadog metric, there is additional setup needed to authenticate against the Datadog API. In order to authenticate to the Datadog API, the DATADOG_API_KEY and DATADOG_APP_KEY environment variables must be set on the manager deployment. You can populate these environment variables during initialization by adding them to your configuration:

redskyctl config set controller.default.env.DATADOG_API_KEY xxx-yyy-zzz
redskyctl config set controller.default.env.DATADOG_APP_KEY xxx-yyy-zzz

Alternately you can manually edit your ~/.config/redsky/config configuration file to include the following snippet:

controllers:
  - name: default
    controller:
      env:
        - name: DATADOG_API_KEY
          value: xxx-yyy-zzz
        - name: DATADOG_APP_KEY
          value: xxx-yyy-zzz

Datadog metrics are subject to further aggregation (in addition to the aggregation method of the query); this is similar to the Query Value widget. By default, the avg aggregator is used, however this can be overridden by setting the scheme field of the metric to any of the supported aggregator values (avg, last, max, min, sum).

JSONPath Metric

A JSONPath metric fetches a JSON payload from an arbitrary HTTP endpoint and evaluates a Kubernetes JSONPath expression from the query field against it.

The result of the JSONPath expression must be a numeric value (or a string that can be parsed as floating point number), this typically means that the value of the metric query field should start and end with curly braces, e.g. "{.example.foobar}" (since the $ operator is optional).

When using a JSONPath metric, the selector field is used to determine the HTTP endpoint to query. Conversely, the scheme, port and path fields can be used to refine the resulting URL. Note that query parameters are allowed in the path field if necessary: in general a request for the URL constructed from the template {scheme}://{selectedServiceClusterIP}:{port}/{path} is used with an Accept: application/json header to retrieve the JSON entity body.

JSONPath Example

spec:
  metrics:
  - name: latency
    minimize: true
    type: jsonpath
    query: '{.current_response_time_percentile_95}'
    path: '/stats/requests'
    port: 8089
    selector:
      matchLabels:
        component: locust

Example

We’ll make use of JSONPath metrics for this example.

JSONPath metrics will be used to measure request latency, throughput, and failure ratio from our load generator.

We’ll optimize for latency and throughput, and track failures ( via optimize: false ). Non-optimized metrics can be useful when interpreting the results by adding additional context. We’ve also set a max allowed latency to 700ms. This means any trial that exceeds the threshold will be marked as failed (with the exception of the baseline).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
apiVersion: redskyops.dev/v1beta1
kind: Experiment
metadata:
  name: shopping
spec:
  parameters:
  - name: frontendCpu
    min: 50
    max: 1000
    baseline: 100
  - name: frontendMemory
    min: 16
    max: 512
    baseline: 64
  - name: catalogCpu
    min: 50
    max: 1000
    baseline: 100
  - name: catalogMemory
    min: 16
    max: 512
    baseline: 64
  patches:
  - targetRef:
      kind: Deployment
      apiVersion: apps/v1
      name: frontend
    patch: |
      spec:
        template:
          spec:
            containers:
            - name: server
              resources:
                limits:
                  cpu: "{{ .Values.frontendCpu }}m"
                  memory: "{{ .Values.frontendMemory }}Mi"
                requests:
                  cpu: "{{ .Values.frontendCpu }}m"
                  memory: "{{ .Values.frontendMemory }}Mi"      
  - targetRef:
      kind: Deployment
      apiVersion: apps/v1
      name: productcatalogservice
    patch: |
      spec:
        template:
          spec:
            containers:
            - name: server
              resources:
                limits:
                  cpu: "{{ .Values.catalogCpu }}m"
                  memory: "{{ .Values.catalogMemory }}Mi"
                requests:
                  cpu: "{{ .Values.catalogCpu }}m"
                  memory: "{{ .Values.catalogMemory }}Mi"      
  metrics:
  - name: latency
    minimize: true
    type: jsonpath
    query: '{.current_response_time_percentile_95}'
    path: '/stats/requests'
    max: "700"
    port: 80
    selector:
      matchLabels:
        app: loadgenerator
  - name: throughput
    type: jsonpath
    query: '{.total_rps}'
    path: '/stats/requests'
    port: 80
    selector:
      matchLabels:
        app: loadgenerator
  - name: failures
    type: jsonpath
    optimize: false
    query: '{.fail_ratio}'
    path: '/stats/requests'
    port: 80
    selector:
      matchLabels:
        app: loadgenerator

Now that we’ve got our metrics defined, we can add our trial job to perform our load test.


Last modified January 26, 2021