Potential Failure Scenarios and their Mitigations
In Step 3 outlined above, when the nodes undergo a draining process, if any deployed entities prevent the node from being drained successfully, it may result in timeout failures.
AppViewX deployed components that can cause these failures and their mitigations are as follows:
PDB (Pod Disruption Budget)
| Namespace | Name | Min Available | Max Unavailable | Allowed Disruption |
|---|---|---|---|---|
| avx-kafka | avx-kafka-cluster-kafka | 2 | NA | 1 |
| avx-kafka | avx-kafka-cluster-zookeeper | 2 | NA | 1 |
| avx | consul-consul-server | NA | 1 | 1 |
| avx | vault / openbao | NA | 1 | 1 |
kubectl edit pdb <pdb-name> -n <namespace>Post
this, edit the Min Available or Max Unavailable with the prescribed values.Pods with istio Sidecar Termination Seconds Configured
Failure Cause: AppViewX deploys specific pods configured to keep the sidecar container alive for up to 1.5 hours after pod termination. This setup ensures that certain jobs complete successfully, even when pods are terminated due to external events like scaling. However, this behavior can delay the node draining process and may lead to upgrade timeouts.
- To check the terminating pods, execute the
command
kubectl get pods -n avx | grep -i terminating - To delete the terminating pods, execute the
command
kubectl delete pod $(kubectl get pods -n avx | grep -i terminating | awk '{print $1}') -n avx --force
Monitor the upgrade process, when a node is being drained i.e. the node is in scheduling disabled state, run the above commands to find and kill the terminating pods forcefully.
#!/bin/bash while true; do pods=$(kubectl get pods -n avx | grep -i Terminating | awk '{print $1}') if [ -n "$pods" ]; then for pod in $pods; do echo "Force deleting pod: $pod" kubectl delete pod "$pod" -n avx --force --grace-period=0 done fi sleep 30 done
