Etcd Recovery

Overview

Etcd pods for hosted clusters run as part of a statefulset (etcd). The statefulset relies on persistent storage to store etcd data per member. In the case of a HighlyAvailable control plane, the size of the statefulset is 3 and each member (etcd-N) has its own PersistentVolumeClaim (etcd-data-N).

Automatic Recovery of Removed Members

In certain circumstances, an etcd member is removed from the cluster. This could be due to networking issues (sdn or dns of management cluster). The hypershift operator can automatically recover from this situation by enabling automatic etcd recovery (--enable-etcd-recovery), which is set to true by default.

If this is enabled, then the HyperShift operator will attempt to recover the health of an etcd cluster if the following conditions are met:

The hosted cluster is configured to run HighlyAvailable (spec.controllerAvailabilityPolicy = HighlyAvailable)
Etcd is managed by HyperShift (spec.etcd.managementType = Managed)
Only one member of the etcd cluster is failing (quorum is not lost)

The recovery procedure consists of the following: * If a member has been removed from the etcd cluster, re-add the missing member by executing the member add command * Delete the etcd member's pod and pvc

Once this is done, the reset-member init container of the removed pod should be able to complete the recovery.

To disable this default behavior, install HyperShift with --enable-etcd-recovery=false

Checking cluster health

Execute into a running etcd pod:

$ oc rsh -n ${CONTROL_PLANE_NAMESPACE} -c etcd etcd-0

Setup the etcdctl environment:

export ETCDCTL_API=3
export ETCDCTL_CACERT=/etc/etcd/tls/etcd-ca/ca.crt
export ETCDCTL_CERT=/etc/etcd/tls/client/etcd-client.crt
export ETCDCTL_KEY=/etc/etcd/tls/client/etcd-client.key
export ETCDCTL_ENDPOINTS=https://etcd-client:2379

Print out endpoint health for each cluster member:

etcdctl endpoint health --cluster -w table

Single Node Recovery

If a single etcd member of a 3-node cluster has corrupted data, it will most likely start crash looping, as in:

$ oc get pods -l app=etcd -n ${CONTROL_PLANE_NAMESPACE}
NAME     READY   STATUS             RESTARTS     AGE
etcd-0   2/2     Running            0            64m
etcd-1   2/2     Running            0            45m
etcd-2   1/2     CrashLoopBackOff   1 (5s ago)   64m

To recover the etcd member, delete its persistent volume claim (data-etcd-N) as well as the pod (etcd-N):

oc delete pvc/data-etcd-2 pod/etcd-2 --wait=false

When the pod restarts, the member should get re-added to the etcd cluster and become healthy again:

$ oc get pods -l app=etcd -n $CONTROL_PLANE_NAMESPACE
NAME     READY   STATUS    RESTARTS   AGE
etcd-0   2/2     Running   0          67m
etcd-1   2/2     Running   0          48m
etcd-2   2/2     Running   0          2m2s

Recovery from Quorum Loss

If multiple members of the etcd cluster have lost data or are in a crashloop state, then etcd must be restored from a snapshot. The following procedure requires down time for the control plane as the etcd database is restored.

NOTE: The following instructions require the oc and jq binaries.

Setup environment variables that point to your hosted cluster:

CLUSTER_NAME=my-cluster
CLUSTER_NAMESPACE=clusters
CONTROL_PLANE_NAMESPACE="${CLUSTER_NAMESPACE}-${CLUSTER_NAME}"

Pause reconciliation on the HostedCluster (setting CLUSTER_NAME and CLUSTER_NAMESPACE to values that correspond to your hosted cluster):

oc patch -n ${CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} -p '{"spec":{"pausedUntil":"true"}}' --type=merge

Scale down API servers:

oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/kube-apiserver --replicas=0
oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/openshift-apiserver --replicas=0
oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/openshift-oauth-apiserver --replicas=0

Take a snapshot of etcd data using one of the following methods:

a. Use a previously backed up snapshot

b. Take a snapshot from a running etcd pod (PREFERRED but requires available etcd pod):

```
# List etcd pods
oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd

# If a pod is available:

# 1. take a snapshot of its database and save it locally
# Set ETCD_POD to the name of the pod that is available
ETCD_POD=etcd-0 
oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- env ETCDCTL_API=3 /usr/bin/etcdctl \
--cacert /etc/etcd/tls/etcd-ca/ca.crt \
--cert /etc/etcd/tls/client/etcd-client.crt \
--key /etc/etcd/tls/client/etcd-client.key \
--endpoints=https://localhost:2379 \
snapshot save /var/lib/snapshot.db

# 2. Verify that the snapshot is good
oc exec -n ${CONTROL_PLANE_NAMESPACE} -c etcd -t ${ETCD_POD} -- env ETCDCTL_API=3 /usr/bin/etcdctl -w table snapshot status /var/lib/snapshot.db

# 3. Make a local copy of the snapshot
oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/snapshot.db /tmp/etcd.snapshot.db
```

c. Make a copy of the snapshot db from etcd persistent storage:

# List etcd pods
oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd 

# Find a pod that is running and set its name as the value of ETCD_POD 
ETCD_POD=etcd-0

# Copy the snapshot db from it
oc cp -c etcd ${CONTROL_PLANE_NAMESPACE}/${ETCD_POD}:/var/lib/data/member/snap/db /tmp/etcd.snapshot.db

Scale down the etcd statefulset:

oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=0

Delete volumes for 2nd and 3rd members:

oc delete -n ${CONTROL_PLANE_NAMESPACE} pvc/data-etcd-1 pvc/data-etcd-2

Create pod to access the first etcd member's data:

# Save etcd image
ETCD_IMAGE=$(oc get -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd -o jsonpath='{ .spec.template.spec.containers[0].image }')

# Create pod that will allow access to etcd data:
cat << EOF | oc apply -n ${CONTROL_PLANE_NAMESPACE} -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: etcd-data
spec:
  replicas: 1
  selector:
    matchLabels:
      app: etcd-data
  template:
    metadata:
      labels:
        app: etcd-data
    spec:
      containers:
      - name: access
        image: $ETCD_IMAGE
        volumeMounts:
        - name: data
          mountPath: /var/lib
        command:
        - /usr/bin/bash
        args:
        - -c
        - |-
          while true; do
            sleep 1000
          done
      volumes:
      - name: data
        persistentVolumeClaim:
          claimName: data-etcd-0
EOF

Clear previous data and restore snapshot

# Wait for the etcd-data pod to start running
oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd-data 

# Get the name of the etcd-data pod
DATA_POD=$(oc get -n ${CONTROL_PLANE_NAMESPACE} pods --no-headers -l app=etcd-data -o name | cut -d/ -f2)

# Copy local snapshot into the pod
oc cp /tmp/etcd.snapshot.db ${CONTROL_PLANE_NAMESPACE}/${DATA_POD}:/var/lib/restored.snap.db

# Remove old data
oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- rm -rf /var/lib/data
oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- mkdir -p /var/lib/data

# Restore snapshot
oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- etcdutl snapshot restore /var/lib/restored.snap.db \
     --data-dir=/var/lib/data --skip-hash-check \
     --name etcd-0 \
     --initial-cluster-token=etcd-cluster \
     --initial-cluster etcd-0=https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-1=https://etcd-1.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380,etcd-2=https://etcd-2.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380 \
     --initial-advertise-peer-urls https://etcd-0.etcd-discovery.${CONTROL_PLANE_NAMESPACE}.svc:2380

# Remove snapshot from etcd-0 data directory
oc exec -n ${CONTROL_PLANE_NAMESPACE} ${DATA_POD} -- rm /var/lib/restored.snap.db

Delete data access deployment:

oc delete -n ${CONTROL_PLANE_NAMESPACE} deployment/etcd-data

Scale up etcd cluster:

oc scale -n ${CONTROL_PLANE_NAMESPACE} statefulset/etcd --replicas=3

Wait for the all etcd member pods to come up and report available:

oc get -n ${CONTROL_PLANE_NAMESPACE} pods -l app=etcd -w

Scale apiservers back up:

oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/kube-apiserver --replicas=3
oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/openshift-apiserver --replicas=3
oc scale -n ${CONTROL_PLANE_NAMESPACE} deployment/openshift-oauth-apiserver --replicas=3

Remove hosted cluster pause:

oc patch -n ${CLUSTER_NAMESPACE} hostedclusters/${CLUSTER_NAME} -p '{"spec":{"pausedUntil":""}}' --type=merge