Flagger
Search…
Metrics Analysis
As part of the analysis process, Flagger can validate service level objectives (SLOs) like availability, error rate percentage, average response time and any other objective based on app specific metrics. If a drop in performance is noticed during the SLOs analysis, the release will be automatically rolled back with minimum impact to end-users.

Builtin metrics

Flagger comes with two builtin metric checks: HTTP request success rate and duration.
1
analysis:
2
metrics:
3
- name: request-success-rate
4
interval: 1m
5
# minimum req success rate (non 5xx responses)
6
# percentage (0-100)
7
thresholdRange:
8
min: 99
9
- name: request-duration
10
interval: 1m
11
# maximum req duration P99
12
# milliseconds
13
thresholdRange:
14
max: 500
Copied!
For each metric you can specify a range of accepted values with thresholdRange and the window size or the time series with interval. The builtin checks are available for every service mesh / ingress controlle and are implemented with Prometheus queries.

Custom metrics

The canary analysis can be extended with custom metric checks. Using a MetricTemplate custom resource, you configure Flagger to connect to a metric provider and run a query that returns a float64 value. The query result is used to validate the canary based on the specified threshold range.
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: my-metric
5
spec:
6
provider:
7
type: # can be prometheus, datadog, etc
8
address: # API URL
9
insecureSkipVerify: # if set to true, disables the TLS cert validation
10
secretRef:
11
name: # name of the secret containing the API credentials
12
query: # metric query
Copied!
The following variables are available in query templates:
    name (canary.metadata.name)
    namespace (canary.metadata.namespace)
    target (canary.spec.targetRef.name)
    service (canary.spec.service.name)
    ingress (canary.spec.ingresRef.name)
    interval (canary.spec.analysis.metrics[].interval)
A canary analysis metric can reference a template with templateRef:
1
analysis:
2
metrics:
3
- name: "my metric"
4
templateRef:
5
name: my-metric
6
# namespace is optional
7
# when not specified, the canary namespace will be used
8
namespace: flagger
9
# accepted values
10
thresholdRange:
11
min: 10
12
max: 1000
13
# metric query time window
14
interval: 1m
Copied!

Prometheus

You can create custom metric checks targeting a Prometheus server by setting the provider type to prometheus and writing the query in PromQL.
Prometheus template example:
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: not-found-percentage
5
namespace: istio-system
6
spec:
7
provider:
8
type: prometheus
9
address: http://prometheus.istio-system:9090
10
query: |
11
100 - sum(
12
rate(
13
istio_requests_total{
14
reporter="destination",
15
destination_workload_namespace="{{ namespace }}",
16
destination_workload="{{ target }}",
17
response_code!="404"
18
}[{{ interval }}]
19
)
20
)
21
/
22
sum(
23
rate(
24
istio_requests_total{
25
reporter="destination",
26
destination_workload_namespace="{{ namespace }}",
27
destination_workload="{{ target }}"
28
}[{{ interval }}]
29
)
30
) * 100
Copied!
Reference the template in the canary analysis:
1
analysis:
2
metrics:
3
- name: "404s percentage"
4
templateRef:
5
name: not-found-percentage
6
namespace: istio-system
7
thresholdRange:
8
max: 5
9
interval: 1m
Copied!
The above configuration validates the canary by checking if the HTTP 404 req/sec percentage is below 5 percent of the total traffic. If the 404s rate reaches the 5% threshold, then the canary fails.
Prometheus gRPC error rate example:
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: grpc-error-rate-percentage
5
namespace: flagger
6
spec:
7
provider:
8
type: prometheus
9
address: http://flagger-prometheus.flagger-system:9090
10
query: |
11
100 - sum(
12
rate(
13
grpc_server_handled_total{
14
grpc_code!="OK",
15
kubernetes_namespace="{{ namespace }}",
16
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
17
}[{{ interval }}]
18
)
19
)
20
/
21
sum(
22
rate(
23
grpc_server_started_total{
24
kubernetes_namespace="{{ namespace }}",
25
kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"
26
}[{{ interval }}]
27
)
28
) * 100
Copied!
The above template is for gRPC services instrumented with go-grpc-prometheus.

Prometheus authentication

If your Prometheus API requires basic authentication, you can create a secret in the same namespace as the MetricTemplate with the basic-auth credentials:
1
apiVersion: v1
2
kind: Secret
3
metadata:
4
name: prom-basic-auth
5
namespace: flagger
6
data:
7
username: your-user
8
password: your-password
Copied!
Then reference the secret in the MetricTemplate:
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: my-metric
5
namespace: flagger
6
spec:
7
provider:
8
type: prometheus
9
address: http://prometheus.monitoring:9090
10
secretRef:
11
name: prom-basic-auth
Copied!

Datadog

You can create custom metric checks using the Datadog provider.
Create a secret with your Datadog API credentials:
1
apiVersion: v1
2
kind: Secret
3
metadata:
4
name: datadog
5
namespace: istio-system
6
data:
7
datadog_api_key: your-datadog-api-key
8
datadog_application_key: your-datadog-application-key
Copied!
Datadog template example:
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: not-found-percentage
5
namespace: istio-system
6
spec:
7
provider:
8
type: datadog
9
address: https://api.datadoghq.com
10
secretRef:
11
name: datadog
12
query: |
13
100 - (
14
sum:istio.mesh.request.count{
15
reporter:destination,
16
destination_workload_namespace:{{ namespace }},
17
destination_workload:{{ target }},
18
!response_code:404
19
}.as_count()
20
/
21
sum:istio.mesh.request.count{
22
reporter:destination,
23
destination_workload_namespace:{{ namespace }},
24
destination_workload:{{ target }}
25
}.as_count()
26
) * 100
Copied!
Reference the template in the canary analysis:
1
analysis:
2
metrics:
3
- name: "404s percentage"
4
templateRef:
5
name: not-found-percentage
6
namespace: istio-system
7
thresholdRange:
8
max: 5
9
interval: 1m
Copied!

Amazon CloudWatch

You can create custom metric checks using the CloudWatch metrics provider.
CloudWatch template example:
1
apiVersion: flagger.app/v1alpha1
2
kind: MetricTemplate
3
metadata:
4
name: cloudwatch-error-rate
5
spec:
6
provider:
7
type: cloudwatch
8
region: ap-northeast-1 # specify the region of your metrics
9
query: |
10
[
11
{
12
"Id": "e1",
13
"Expression": "m1 / m2",
14
"Label": "ErrorRate"
15
},
16
{
17
"Id": "m1",
18
"MetricStat": {
19
"Metric": {
20
"Namespace": "MyKubernetesCluster",
21
"MetricName": "ErrorCount",
22
"Dimensions": [
23
{
24
"Name": "appName",
25
"Value": "{{ name }}.{{ namespace }}"
26
}
27
]
28
},
29
"Period": 60,
30
"Stat": "Sum",
31
"Unit": "Count"
32
},
33
"ReturnData": false
34
},
35
{
36
"Id": "m2",
37
"MetricStat": {
38
"Metric": {
39
"Namespace": "MyKubernetesCluster",
40
"MetricName": "RequestCount",
41
"Dimensions": [
42
{
43
"Name": "appName",
44
"Value": "{{ name }}.{{ namespace }}"
45
}
46
]
47
},
48
"Period": 60,
49
"Stat": "Sum",
50
"Unit": "Count"
51
},
52
"ReturnData": false
53
}
54
]
Copied!
The query format documentation can be found here.
Reference the template in the canary analysis:
1
analysis:
2
metrics:
3
- name: "app error rate"
4
templateRef:
5
name: cloudwatch-error-rate
6
thresholdRange:
7
max: 0.1
8
interval: 1m
Copied!
Note that Flagger need AWS IAM permission to perform cloudwatch:GetMetricData to use this provider.

New Relic

You can create custom metric checks using the New Relic provider.
Create a secret with your New Relic Insights credentials:
1
apiVersion: v1
2
kind: Secret
3
metadata:
4
name: newrelic
5
namespace: istio-system
6
data:
7
newrelic_account_id: your-account-id
8
newrelic_query_key: your-insights-query-key
Copied!
New Relic template example:
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: newrelic-error-rate
5
namespace: ingress-nginx
6
spec:
7
provider:
8
type: newrelic
9
secretRef:
10
name: newrelic
11
query: |
12
SELECT
13
filter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /
14
sum(nginx_ingress_controller_requests) * 100
15
FROM Metric
16
WHERE metricName = 'nginx_ingress_controller_requests'
17
AND ingress = '{{ ingress }}' AND namespace = '{{ namespace }}'
Copied!
Reference the template in the canary analysis:
1
analysis:
2
metrics:
3
- name: "error rate"
4
templateRef:
5
name: newrelic-error-rate
6
namespace: ingress-nginx
7
thresholdRange:
8
max: 5
9
interval: 1m
Copied!

Graphite

You can create custom metric checks using the Graphite provider.
Graphite template example:
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: graphite-request-success-rate
5
spec:
6
provider:
7
type: graphite
8
address: http://graphite.monitoring
9
query: |
10
target=summarize(
11
asPercent(
12
sumSeries(
13
stats.timers.httpServerRequests.app.{{target}}.exception.*.method.*.outcome.{CLIENT_ERROR,INFORMATIONAL,REDIRECTION,SUCCESS}.status.*.uri.*.count
14
),
15
sumSeries(
16
stats.timers.httpServerRequests.app.{{target}}.exception.*.method.*.outcome.*.status.*.uri.*.count
17
)
18
),
19
{{interval}},
20
'avg'
21
)
Copied!
Reference the template in the canary analysis:
1
analysis:
2
metrics:
3
- name: "success rate"
4
templateRef:
5
name: graphite-request-success-rate
6
thresholdRange:
7
min: 90
8
interval: 1min
Copied!

Graphite authentication

If your Graphite API requires basic authentication, you can create a secret in the same namespace as the MetricTemplate with the basic-auth credentials:
1
apiVersion: v1
2
kind: Secret
3
metadata:
4
name: graphite-basic-auth
5
namespace: flagger
6
data:
7
username: your-user
8
password: your-password
Copied!
Then, reference the secret in the MetricTemplate:
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: my-metric
5
namespace: flagger
6
spec:
7
provider:
8
type: graphite
9
address: http://graphite.monitoring
10
secretRef:
11
name: graphite-basic-auth
Copied!

Google CLoud Monitoring (Stackdriver)

Enable Workload Identity on your cluster, create a service account key that has read access to the Cloud Monitoring API and then create an IAM policy binding between the GCP service account and the Flagger service account on Kubernetes. You can take a look at this guide
Annotate the flagger service account ```shell script kubectl annotate serviceaccount flagger --namespace iam.gke.io/[email protected]
1
Alternatively, you can download the json keys and add it to your secret with the key `serviceAccountKey` (This method is not recommended).
2
3
Create a secret that contains your project-id (and, if workload identity is not enabled on your cluster,
4
your [service account json](https://cloud.google.com/docs/authentication/production#create_service_account)).
Copied!
kubectl create secret generic gcloud-sa --from-literal=project=
1
Then reference the secret in the metric template.
2
Note: The particular MQL query used here works if [Istio is installed on GKE](https://cloud.google.com/istio/docs/istio-on-gke/installing).
3
```yaml
4
apiVersion: flagger.app/v1beta1
5
kind: MetricTemplate
6
metadata:
7
name: bytes-sent
8
namespace: test
9
spec:
10
provider:
11
type: stackdriver
12
secretRef:
13
name: gcloud-sa
14
query: |
15
fetch k8s_container
16
| metric 'istio.io/service/server/response_latencies'
17
| filter
18
(metric.destination_service_name == '{{ service }}-canary'
19
&& metric.destination_service_namespace == '{{ namespace }}')
20
| align delta(1m)
21
| every 1m
22
| group_by [],
23
[value_response_latencies_percentile:
24
percentile(value.response_latencies, 99)]
Copied!
The reference for the query language can be found here

Influxdb

The influxdb provider uses the flux scripting language.
Create a secret that contains your authentication token that can be gotthen from the InfluxDB UI.
1
kubectl create secret generic gcloud-sa --from-literal=token=<token>
Copied!
Then reference the secret in the metric template.qq Note: The particular MQL query used here works if Istio is installed on GKE.
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: not-found
5
namespace: test
6
spec:
7
provider:
8
type: influxdb
9
secretRef:
10
name: influx-token
11
query: |
12
from(bucket: "default")
13
|> range(start: -2h)
14
|> filter(fn: (r) => r["_measurement"] == "istio_requests_total")
15
|> filter(fn: (r) => r[" destination_workload_namespace"] == "{{ namespace }}")
16
|> filter(fn: (r) => r["destination_workload"] == "{{ target }}")
17
|> filter(fn: (r) => r["response_code"] == "500")
18
|> count()
19
|> yield(name: "count")
Copied!

Dynatrace

You can create custom metric checks using the Dynatrace provider.
Create a secret with your Dynatrace token:
1
apiVersion: v1
2
kind: Secret
3
metadata:
4
name: dynatrace
5
namespace: istio-system
6
data:
7
dynatrace_token: ZHQwYz...
Copied!
Dynatrace metric template example:
1
apiVersion: flagger.app/v1beta1
2
kind: MetricTemplate
3
metadata:
4
name: response-time-95pct
5
namespace: istio-system
6
spec:
7
provider:
8
type: dynatrace
9
address: https://xxxxxxxx.live.dynatrace.com
10
secretRef:
11
name: dynatrace
12
query: |
13
builtin:service.response.time:filter(eq(dt.entity.service,SERVICE-ABCDEFG0123456789)):percentile(95)
Copied!
Reference the template in the canary analysis:
1
analysis:
2
metrics:
3
- name: "response-time-95pct"
4
templateRef:
5
name: response-time-95pct
6
namespace: istio-system
7
thresholdRange:
8
max: 1000
9
interval: 1m
Copied!
Last modified 1mo ago