Potential Failure Scenarios and their Mitigations

In Step 3 outlined above, when the nodes undergo a draining process, if any deployed entities prevent the node from being drained successfully, it may result in timeout failures.

AppViewX deployed components that can cause these failures and their mitigations are as follows:

PDB (Pod Disruption Budget)

Failure Cause: AppViewX infrastructure components are deployed with Pod Disruption Budgets (PDBs) to ensure that a minimum number of replicas for critical components remain available, thereby maintaining application availability. During node draining process, these PDBs may block the process if the required number of replicas are not yet online, as they are configured with a strict policy and have “allowed disruption” of 1.. The default PDBs are as follows:
Namespace Name Min Available Max Unavailable Allowed Disruption
avx-kafka avx-kafka-cluster-kafka 2 NA 1
avx-kafka avx-kafka-cluster-zookeeper 2 NA 1
avx consul-consul-server NA 1 1
avx vault / openbao NA 1 1
Note: In the table above, based on the AppViewX version, some PDBs may not be available (e.g.: consul). Ignore if not available.
Mitigation: To mitigate this potential upgrade issue, we can temporarily set the Min Available to 1 or Max Unavailable to 2. This will allow disruption of 2, allowing the draining process to have some extra room to reach completion and not get timed out waiting for enough replicas to be online. Execute the command:
kubectl edit pdb <pdb-name> -n <namespace>
Post this, edit the Min Available or Max Unavailable with the prescribed values.
Note: Revert the values once the cluster is upgraded.

Pods with istio Sidecar Termination Seconds Configured

Failure Cause: AppViewX deploys specific pods configured to keep the sidecar container alive for up to 1.5 hours after pod termination. This setup ensures that certain jobs complete successfully, even when pods are terminated due to external events like scaling. However, this behavior can delay the node draining process and may lead to upgrade timeouts.

Mitigation: To mitigate this, keep track of pods that are in a terminating state and forcefully kill it.
  • To check the terminating pods, execute the command
    kubectl get pods -n avx | grep -i terminating
  • To delete the terminating pods, execute the command
    kubectl delete pod $(kubectl get pods -n avx | grep -i terminating | awk '{print $1}') -n avx --force

Monitor the upgrade process, when a node is being drained i.e. the node is in scheduling disabled state, run the above commands to find and kill the terminating pods forcefully.

Alternatively, run the following snippet in a bash script while the upgrade is in process and this will periodically take care of the terminating pods. Stop this script after upgrade completion.
#!/bin/bash

while true; do
  pods=$(kubectl get pods -n avx | grep -i Terminating | awk '{print $1}')
  if [ -n "$pods" ]; then
    for pod in $pods; do
      echo "Force deleting pod: $pod"
      kubectl delete pod "$pod" -n avx --force --grace-period=0
    done
  fi
  sleep 30
done