As part of the analysis process, Flagger can validate service level objectives (SLOs) like availability, error rate percentage, average response time and any other objective based on app specific metrics. If a drop in performance is noticed during the SLOs analysis, the release will be automatically rolled back with minimum impact to end-users.
Flagger comes with two builtin metric checks: HTTP request success rate and duration.
analysis:metrics:- name: request-success-rateinterval: 1m# minimum req success rate (non 5xx responses)# percentage (0-100)thresholdRange:min: 99- name: request-durationinterval: 1m# maximum req duration P99# millisecondsthresholdRange:max: 500
For each metric you can specify a range of accepted values with thresholdRange
and the window size or the time series with interval
. The builtin checks are available for every service mesh / ingress controlle and are implemented with Prometheus queries.
The canary analysis can be extended with custom metric checks. Using a MetricTemplate
custom resource, you configure Flagger to connect to a metric provider and run a query that returns a float64
value. The query result is used to validate the canary based on the specified threshold range.
apiVersion: flagger.app/v1beta1kind: MetricTemplatemetadata:name: my-metricspec:provider:type: # can be prometheus or datadogaddress: # API URLsecretRef:name: # name of the secret containing the API credentialsquery: # metric query
The following variables are available in query templates:
name
(canary.metadata.name)
namespace
(canary.metadata.namespace)
target
(canary.spec.targetRef.name)
service
(canary.spec.service.name)
ingress
(canary.spec.ingresRef.name)
interval
(canary.spec.analysis.metrics[].interval)
A canary analysis metric can reference a template with templateRef
:
analysis:metrics:- name: "my metric"templateRef:name: my-metric# namespace is optional# when not specified, the canary namespace will be usednamespace: flagger# accepted valuesthresholdRange:min: 10max: 1000# metric query time windowinterval: 1m
You can create custom metric checks targeting a Prometheus server by setting the provider type to prometheus
and writing the query in PromQL.
Prometheus template example:
apiVersion: flagger.app/v1beta1kind: MetricTemplatemetadata:name: not-found-percentagenamespace: istio-systemspec:provider:type: prometheusaddress: http://prometheus.istio-system:9090query: |100 - sum(rate(istio_requests_total{reporter="destination",destination_workload_namespace="{{ namespace }}",destination_workload="{{ target }}",response_code!="404"}[{{ interval }}]))/sum(rate(istio_requests_total{reporter="destination",destination_workload_namespace="{{ namespace }}",destination_workload="{{ target }}"}[{{ interval }}])) * 100
Reference the template in the canary analysis:
analysis:metrics:- name: "404s percentage"templateRef:name: not-found-percentagenamespace: istio-systemthresholdRange:max: 5interval: 1m
The above configuration validates the canary by checking if the HTTP 404 req/sec percentage is below 5 percent of the total traffic. If the 404s rate reaches the 5% threshold, then the canary fails.
Prometheus gRPC error rate example:
apiVersion: flagger.app/v1beta1kind: MetricTemplatemetadata:name: grpc-error-rate-percentagenamespace: flaggerspec:provider:type: prometheusaddress: http://flagger-prometheus.flagger-system:9090query: |100 - sum(rate(grpc_server_handled_total{grpc_code!="OK",kubernetes_namespace="{{ namespace }}",kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"}[{{ interval }}]))/sum(rate(grpc_server_started_total{kubernetes_namespace="{{ namespace }}",kubernetes_pod_name=~"{{ target }}-[0-9a-zA-Z]+(-[0-9a-zA-Z]+)"}[{{ interval }}])) * 100
The above template is for gRPC services instrumented with go-grpc-prometheus.
If your Prometheus API requires basic authentication, you can create a secret in the same namespace as the MetricTemplate
with the basic-auth credentials:
apiVersion: v1kind: Secretmetadata:name: prom-basic-authnamespace: flaggerdata:username: your-userpassword: your-password
Then reference the secret in the MetricTemplate
:
apiVersion: flagger.app/v1beta1kind: MetricTemplatemetadata:name: my-metricnamespace: flaggerspec:provider:type: prometheusaddress: http://prometheus.monitoring:9090secretRef:name: prom-basic-auth
You can create custom metric checks using the Datadog provider.
Create a secret with your Datadog API credentials:
apiVersion: v1kind: Secretmetadata:name: datadognamespace: istio-systemdata:datadog_api_key: your-datadog-api-keydatadog_application_key: your-datadog-application-key
Datadog template example:
apiVersion: flagger.app/v1beta1kind: MetricTemplatemetadata:name: not-found-percentagenamespace: istio-systemspec:provider:type: datadogaddress: https://api.datadoghq.comsecretRef:name: datadogquery: |100 - (sum:istio.mesh.request.count{reporter:destination,destination_workload_namespace:{{ namespace }},destination_workload:{{ target }},!response_code:404}.as_count()/sum:istio.mesh.request.count{reporter:destination,destination_workload_namespace:{{ namespace }},destination_workload:{{ target }}}.as_count()) * 100
Reference the template in the canary analysis:
analysis:metrics:- name: "404s percentage"templateRef:name: not-found-percentagenamespace: istio-systemthresholdRange:max: 5interval: 1m
You can create custom metric checks using the CloudWatch metrics provider.
CloudWatch template example:
apiVersion: flagger.app/v1alpha1kind: MetricTemplatemetadata:name: cloudwatch-error-ratespec:provider:type: cloudwatchregion: ap-northeast-1 # specify the region of your metricsquery: |[{"Id": "e1","Expression": "m1 / m2","Label": "ErrorRate"},{"Id": "m1","MetricStat": {"Metric": {"Namespace": "MyKubernetesCluster","MetricName": "ErrorCount","Dimensions": [{"Name": "appName","Value": "{{ name }}.{{ namespace }}"}]},"Period": 60,"Stat": "Sum","Unit": "Count"},"ReturnData": false},{"Id": "m2","MetricStat": {"Metric": {"Namespace": "MyKubernetesCluster","MetricName": "RequestCount","Dimensions": [{"Name": "appName","Value": "{{ name }}.{{ namespace }}"}]},"Period": 60,"Stat": "Sum","Unit": "Count"},"ReturnData": false}]
The query format documentation can be found here.
Reference the template in the canary analysis:
analysis:metrics:- name: "app error rate"templateRef:name: cloudwatch-error-ratethresholdRange:max: 0.1interval: 1m
Note that Flagger need AWS IAM permission to perform cloudwatch:GetMetricData
to use this provider.
You can create custom metric checks using the New Relic provider.
Create a secret with your New Relic Insights credentials:
apiVersion: v1kind: Secretmetadata:name: newrelicnamespace: istio-systemdata:newrelic_account_id: your-account-idnewrelic_query_key: your-insights-query-key
New Relic template example:
apiVersion: flagger.app/v1beta1kind: MetricTemplatemetadata:name: newrelic-error-ratenamespace: ingress-nginxspec:provider:type: newrelicsecretRef:name: newrelicquery: |SELECTfilter(sum(nginx_ingress_controller_requests), WHERE status >= '500') /sum(nginx_ingress_controller_requests) * 100FROM MetricWHERE metricName = 'nginx_ingress_controller_requests'AND ingress = '{{ ingress }}' AND namespace = '{{ namespace }}'
Reference the template in the canary analysis:
analysis:metrics:- name: "error rate"templateRef:name: newrelic-error-ratenamespace: ingress-nginxthresholdRange:max: 5interval: 1m