Canaries with Helm charts and GitOps

This guide shows you how to package a web app into a Helm chart, trigger canary deployments on Helm upgrade and automate the chart release process with Weave Flux.


You'll be using the podinfo chart. This chart packages a web app made with Go, it's configuration, a horizontal pod autoscaler (HPA) and the canary configuration file.

├── Chart.yaml
├── templates
│ ├── NOTES.txt
│ ├── _helpers.tpl
│ ├── canary.yaml
│ ├── configmap.yaml
│ ├── deployment.yaml
│ ├── hpa.yaml
│ ├── service.yaml
│ └── tests
│ ├── test-config.yaml
│ └── test-pod.yaml
└── values.yaml

You can find the chart source here.


Create a test namespace with Istio sidecar injection enabled:

export REPO=
kubectl apply -f ${REPO}/artifacts/namespaces/test.yaml

Add Flagger Helm repository:

helm repo add flagger

Install podinfo with the release name frontend (replace with your own domain):

helm upgrade -i frontend flagger/podinfo \
--namespace test \
--set nameOverride=frontend \
--set backend=http://backend.test:9898/echo \
--set canary.enabled=true \
--set canary.istioIngress.enabled=true \
--set canary.istioIngress.gateway=public-gateway.istio-system.svc.cluster.local \

Flagger takes a Kubernetes deployment and a horizontal pod autoscaler (HPA), then creates a series of objects (Kubernetes deployments, ClusterIP services and Istio virtual services). These objects expose the application on the mesh and drive the canary analysis and promotion.

# generated by Helm
# generated by Flagger

When the frontend-primary deployment comes online, Flagger will route all traffic to the primary pods and scale to zero the frontend deployment.

Open your browser and navigate to the frontend URL:

Podinfo Frontend

Now let's install the backend release without exposing it outside the mesh:

helm upgrade -i backend flagger/podinfo \
--namespace test \
--set nameOverride=backend \
--set canary.enabled=true \
--set canary.istioIngress.enabled=false

Check if Flagger has successfully deployed the canaries:

kubectl -n test get canaries
backend Initialized 0 2019-02-12T18:53:18Z
frontend Initialized 0 2019-02-12T17:50:50Z

Click on the ping button in the frontend UI to trigger a HTTP POST request that will reach the backend app:

Jaeger Tracing

We'll use the /echo endpoint (same as the one the ping button calls) to generate load on both apps during a canary deployment.


First let's install a load testing service that will generate traffic during analysis:

helm upgrade -i flagger-loadtester flagger/loadtester \

Install Flagger's helm test runner in the kube-system using tiller service account:

helm upgrade -i flagger-helmtester flagger/loadtester \
--namespace=kube-system \
--set serviceAccountName=tiller

Enable the load and helm tester and deploy a new frontend version:

helm upgrade -i frontend flagger/podinfo/ \
--namespace test \
--reuse-values \
--set canary.loadtest.enabled=true \
--set canary.helmtest.enabled=true \
--set image.tag=3.1.1

Flagger detects that the deployment revision changed and starts the canary analysis:

kubectl -n istio-system logs deployment/flagger -f | jq .msg
New revision detected! Scaling up frontend.test
Halt advancement frontend.test waiting for rollout to finish: 0 of 2 updated replicas are available
Starting canary analysis for frontend.test
Pre-rollout check helm test passed
Advance frontend.test canary weight 5
Advance frontend.test canary weight 10
Advance frontend.test canary weight 15
Advance frontend.test canary weight 20
Advance frontend.test canary weight 25
Advance frontend.test canary weight 30
Advance frontend.test canary weight 35
Advance frontend.test canary weight 40
Advance frontend.test canary weight 45
Advance frontend.test canary weight 50
Copying frontend.test template spec to frontend-primary.test
Halt advancement frontend-primary.test waiting for rollout to finish: 1 old replicas are pending termination
Promotion completed! Scaling down frontend.test

You can monitor the canary deployment with Grafana. Open the Flagger dashboard, select test from the namespace dropdown, frontend-primary from the primary dropdown and frontend from the canary dropdown.

Flagger Grafana Dashboard

Now trigger a canary deployment for the backend app, but this time you'll change a value in the configmap:

helm upgrade -i backend flagger/podinfo/ \
--namespace test \
--reuse-values \
--set canary.loadtest.enabled=true \
--set canary.helmtest.enabled=true \
--set httpServer.timeout=25s

Generate HTTP 500 errors:

kubectl -n test exec -it flagger-loadtester-xxx-yyy sh
watch curl http://backend-canary:9898/status/500

Generate latency:

kubectl -n test exec -it flagger-loadtester-xxx-yyy sh
watch curl http://backend-canary:9898/delay/1

Flagger detects the config map change and starts a canary analysis. Flagger will pause the advancement when the HTTP success rate drops under 99% or when the average request duration in the last minute is over 500ms:

kubectl -n test describe canary backend
ConfigMap backend has changed
New revision detected! Scaling up backend.test
Starting canary analysis for backend.test
Advance backend.test canary weight 5
Advance backend.test canary weight 10
Advance backend.test canary weight 15
Advance backend.test canary weight 20
Advance backend.test canary weight 25
Advance backend.test canary weight 30
Advance backend.test canary weight 35
Halt backend.test advancement success rate 62.50% < 99%
Halt backend.test advancement success rate 88.24% < 99%
Advance backend.test canary weight 40
Advance backend.test canary weight 45
Halt backend.test advancement request duration 2.415s > 500ms
Halt backend.test advancement request duration 2.42s > 500ms
Advance backend.test canary weight 50
ConfigMap backend-primary synced
Copying backend.test template spec to backend-primary.test
Promotion completed! Scaling down backend.test
Flagger Grafana Dashboard

If the number of failed checks reaches the canary analysis threshold, the traffic is routed back to the primary, the canary is scaled to zero and the rollout is marked as failed.

kubectl -n test get canary
backend Succeeded 0 2019-02-12T19:33:11Z
frontend Failed 0 2019-02-12T19:47:20Z

If you've enabled the Slack notifications, you'll receive an alert with the reason why the backend promotion failed.

GitOps automation

Instead of using Helm CLI from a CI tool to perform the install and upgrade, you could use a Git based approach. GitOps is a way to do Continuous Delivery, it works by using Git as a source of truth for declarative infrastructure and workloads. In the GitOps model, any change to production must be committed in source control prior to being applied on the cluster. This way rollback and audit logs are provided by Git.

Helm GitOps Canary Deployment

In order to apply the GitOps pipeline model to Flagger canary deployments you'll need a Git repository with your workloads definitions in YAML format, a container registry where your CI system pushes immutable images and an operator that synchronizes the Git repo with the cluster state.

Create a git repository with the following content:

├── namespaces
│ └── test.yaml
└── releases
└── test
├── backend.yaml
├── frontend.yaml
├── loadtester.yaml
└── helmtester.yaml

Define the frontend release using Flux HelmRelease custom resource:

kind: HelmRelease
name: frontend
namespace: test
annotations: "true" semver:~3.1
releaseName: frontend
ref: master
path: charts/podinfo
repository: stefanprodan/podinfo
tag: 3.1.0
backend: http://backend-podinfo:9898/echo
enabled: true
enabled: true
gateway: public-gateway.istio-system.svc.cluster.local
enabled: true
enabled: true

In the chart section I've defined the release source by specifying the Helm repository (hosted on GitHub Pages), chart name and version. In the values section I've overwritten the defaults set in values.yaml.

With the annotations I instruct Flux to automate this release. When an image tag in the sem ver range of 3.1.0 - 3.1.99 is pushed to Docker Hub, Flux will upgrade the Helm release and from there Flagger will pick up the change and start a canary deployment.

Install Flux and its Helm Operator by specifying your Git repo URL:

helm repo add fluxcd
helm install --name flux \
--namespace fluxcd \
helm upgrade -i helm-operator fluxcd/helm-operator \
--namespace fluxcd \
--set git.ssh.secretName=flux-git-deploy

At startup Flux generates a SSH key and logs the public key. Find the SSH public key with:

kubectl -n fluxcd logs deployment/flux | grep | cut -d '"' -f2

In order to sync your cluster state with Git you need to copy the public key and create a deploy key with write access on your GitHub repository.

Open GitHub, navigate to your fork, go to Setting > Deploy keys click on Add deploy key, check Allow write access, paste the Flux public key and click Add key.

After a couple of seconds Flux will apply the Kubernetes resources from Git and Flagger will launch the frontend and backend apps.

A CI/CD pipeline for the frontend release could look like this:

  • cut a release from the master branch of the podinfo code repo with the git tag 3.1.1

  • CI builds the image and pushes the podinfo:3.1.1 image to the container registry

  • Flux scans the registry and updates the Helm release image.tag to 3.1.1

  • Flux commits and push the change to the cluster repo

  • Flux applies the updated Helm release on the cluster

  • Flux Helm Operator picks up the change and calls Tiller to upgrade the release

  • Flagger detects a revision change and scales up the frontend deployment

  • Flagger runs the helm test before routing traffic to the canary service

  • Flagger starts the load test and runs the canary analysis

  • Based on the analysis result the canary deployment is promoted to production or rolled back

  • Flagger sends a Slack or MS Teams notification with the canary result

If the canary fails, fix the bug, do another patch release eg 3.1.2 and the whole process will run again.

A canary deployment can fail due to any of the following reasons:

  • the container image can't be downloaded

  • the deployment replica set is stuck for more then ten minutes (eg. due to a container crash loop)

  • the webooks (acceptance tests, helm tests, load tests, etc) are returning a non 2xx response

  • the HTTP success rate (non 5xx responses) metric drops under the threshold

  • the HTTP average duration metric goes over the threshold

  • the Istio telemetry service is unable to collect traffic metrics

  • the metrics server (Prometheus) can't be reached

If you want to find out more about managing Helm releases with Flux here are two in-depth guides: gitops-helm and gitops-istio.