Metrics
7 minute read
A metric is an expression to measure the results of a trial with a given set of parameter values. Trial metrics are collected immediately following completion of the trial run job, typically from a dedicated metric store such as Prometheus. A single 64-bit floating point number is collected for each metric defined on the experiment.
As a best practice, metrics should be selected with opposing goals. For example, choosing to minimize resource usage by itself will result in an application that does not start and therefore does not use any CPU or memory at all. An example of opposing goals is minimizing overall resource usage (a combined metric for both CPU and memory) while maximizing the throughput of some part of the application.
Metrics Spec
The metric spec must contain a unique name and a query. The query must evaluate to a number (integer or floating point). Optionally, the type, min/max values, minimize, and optimize may be specified.
For details, see the metric API reference.
Queries
Regardless of the query type, the query
field is always preprocessed as Go templates, with Sprig template functions included to support a variety of operations.
For example, a PromQL query can be written to include a placeholder for the “range” (duration) of the trial run:
spec:
metrics:
- name: sample duration query
type: prometheus
query: avg_over_time(up[{{ .Range }}])
The following variables are defined for use in query processing:
Variable Name | Type | Description |
---|---|---|
Trial.Name |
string |
The name of the trial |
Trial.Namespace |
string |
The namespace the trial ran in |
Values |
map[string]interface |
The parameter assignments |
StartTime |
time |
The adjusted start time of the trial run job |
CompletionTime |
time |
The completion time of the trial run job |
Range |
string |
The duration of the trial run job, e.g. “5s” |
Pods |
PodList |
The list of pods in the trial namespace |
The following additional template functions are also available:
Function | Usage | Example |
---|---|---|
percent | Return the integer percentage. | {{ percent 9 50 }} |
duration | Returns the number of seconds between two times. | {{ duration .StartTime .CompletionTime }} |
resourceRequests | Returns the weighted sum of resource requests for matched labels. | {{ resourceRequests .Pods "cpu=22,memory=3" }} |
Metric Types
Metrics can be one of kubernetes, prometheus, datadog, or jsonpath.
Kubernetes Metric
A Kubernetes metric is evaluated against pod resources that are matched by the target selector. If no target is specified, the trial pod will be used. It is typically used to measure the duration of the trial or when measuring the resources of a static application.
The following example metric measures the duration of the trial pod and indicates that this metric should be minimized (to achieve the lowest possible value).
spec:
metrics:
- name: duration
minimize: true
query: "{{duration .StartTime .CompletionTime}}"
The following example metric highlights using a target to select the necessary objects.
At minimum, the target selector must contain a valid apiVersion
and kind
for the resource, ex v1
and Pod
.
spec:
metrics:
# Using a target selector
- name: duration
minimize: true
type: kubernetes
query: '{{duration .StartTime .CompletionTime}}'
target:
apiVersion: v1
kind: Pod
matchLabels:
app: foo
A Kubernetes metric is the default metric type if one is not specified.
Prometheus Metric
A Prometheus metric treats the query
field as a PromQL query to execute against a Prometheus instance identified using a service selector.
The Range
template variable can be used when writing the PromQL to produce queries over the time interval during which the trial job was running — for example, [{{ .Range }}]
.
All Prometheus metrics must evaluate to scalar that is a single floating point number.
You might often need to write a query that produces a single-element instant vector and extract that value using the scalar
function.
scalar
function produces a NaN
result when the size of the instant vector is not 1, causing the trial to fail during metrics collection.
When using the Prometheus collection type, the url
field will be used to identify the Prometheus instance to query.
When using Prometheus metrics, the following additional template functions are available:
Function | Usage | Example |
---|---|---|
cpuUtilization | Returns the average CPU utilization as a percentage. | {{ cpuUtilization . "app=foo,component=bar" }} |
memoryUtilization | Returns the average memory utilization as a percentage. | {{ memoryUtilization . "app=foo,component=bar" }} |
cpuRequests | Returns the average CPU requests in cores. | {{ cpuRequests . "app=foo,component=bar" }} |
memoryRequests | Returns the average memory requests in bytes. | {{ memoryRequests . "app=foo,component=bar" }} |
GB | Helper function to format output as Giga . |
{{ memoryRequests . "app=foo,component=bar" | GB }} |
MB | Helper function to format output as Mega . |
{{ memoryRequests . "app=foo,component=bar" | MB }} |
KB | Helper function to format output as Kilo . |
{{ memoryRequests . "app=foo,component=bar" | KB }} |
GiB | Helper function to format output as Gibi . |
{{ memoryRequests . "app=foo,component=bar" | GiB }} |
MiB | Helper function to format output as Mebi . |
{{ memoryRequests . "app=foo,component=bar" | MiB }} |
KiB | Helper function to format output as Kibi . |
{{ memoryRequests . "app=foo,component=bar" | KiB }} |
In the following example we calculate the sum of all CPU usage seconds.
spec:
metrics:
- name: cpu seconds
minimize: true
type: prometheus
query: |
scalar(
sum(
process_cpu_seconds_total{job="prometheus"}
)
)
url: http://prometheus-server.default.svc.cluster.local:9090
Datadog Metric
A Datadog metric can be used to execute metric queries against the Datadog API.
If using a Datadog metric, there is additional setup needed to authenticate against the Datadog API.
To authenticate to the Datadog API, the DATADOG_API_KEY
and DATADOG_APP_KEY
environment variables must be set on the manager deployment.
You can populate these settings by passing the following values during Optimize Pro installation:
stormforge install optimize-pro \
--set datadog.apiKey=xxx-yyy-zzz \
--set datadog.appKey=aaa-bbb-ccc
Datadog metrics are subject to further aggregation (in addition to the aggregation method of the query); this is similar to the Query Value widget.
By default, the avg
aggregator is used, however this can be overridden by setting the scheme
field of the metric to any of the supported aggregator values (avg, last, max, min, sum).
Datadog queries are automatically scoped to the time frame of the relevant trial job.
spec:
metrics:
- name: p50
minimize: true
type: datadog
query: "avg:trace.http.request.duration.by.resource_service.50p{env:stormforge,service:ples,resource_name:get_/ples}"
JSONPath Metric
A JSONPath metric fetches a JSON payload from an arbitrary HTTP endpoint and evaluates a Kubernetes JSONPath expression from the query
field against it.
The result of the JSONPath expression must be a numeric value (or a string that can be parsed as floating point number). This typically means that the value of the metric query
field should start and end with curly braces, as in "{.example.foobar}"
(since the $
operator is optional).
When using a JSONPath metric, the selector
field is used to determine the HTTP endpoint to query.
Conversely, the scheme
, port
and path
fields can be used to refine the resulting URL.
Note that query parameters are allowed in the path
field if necessary: In general, a request for the URL constructed from the template {scheme}://{selectedServiceClusterIP}:{port}/{path}
is used with an Accept: application/json
header to retrieve the JSON entity body.
JSONPath Example
spec:
metrics:
- name: latency
minimize: true
type: jsonpath
query: '{.current_response_time_percentile_95}'
url: http://myjson.default.svc.cluster.local:8089/stats/requests
Example
We’ll make use of JSONPath metrics for this example.
JSONPath metrics will be used to measure request latency, throughput, and failure ratio from our load generator.
We’ll optimize for latency and throughput, and track failures ( via optimize: false
).
Non-optimized metrics can be useful when interpreting the results by adding additional context.
We’ve also set a max allowed latency to 700ms.
This means any trial that exceeds the threshold will be marked as failed (with the exception of the baseline).
|
|