Disaster Recovery Plan

This section outlines the disaster recovery options available for AppViewX deployments. It describes recovery approaches designed to protect data and restore services in the event of a complete outage, where the primary data center or the deployed virtual machines cannot be recovered.

The procedures described in this document apply to scenarios involving unrecoverable infrastructure failures. For temporary outages, where services can be restored by bringing the existing virtual machines or infrastructure back online, the approaches described here are typically not required.

In addition to full cluster outages, this document also addresses partial infrastructure failures, such as operating system corruption, node disk failure, or scenarios where an individual cluster node becomes unrecoverable. Recovery methods for these situations depend on the availability of virtual machine snapshots and application-level backups. The approaches described here cover both infrastructure recovery and application data restoration to support service continuity.

Note: In Kubernetes-based environments, node-specific information—including certificates, runtime metadata, and configuration files—is stored on the virtual machine filesystem. If both the VM disk and its snapshot are lost, reconstructing the node state using only application-level backups may not be reliable. In such cases, rebuilding the cluster and restoring application data from backups provides a more predictable and consistent recovery approach.

Approach 1 - VM Snapshot Based Recovery

  1. As part of this disaster recovery approach, periodic snapshots of the virtual machines hosting the AppViewX nodes will be taken at a configurable interval, with a recommended default of every 2 hours.
  2. The snapshot interval can be customized by the customer based on business requirements, infrastructure capacity, and acceptable data loss tolerance.
  3. In the event of a cluster failure or complete outage, the most recent successful snapshot can be used to restore the system and recover the cluster to a previously stable state. In this approach, VM snapshots must be taken for all nodes in the AppViewX cluster and restored together during a disaster recovery scenario.
  4. Based on the configured snapshot interval, there is a possibility of data loss up to the duration of the last backup interval (for example, up to 2 hours when snapshots are taken every 2 hours).
  5. This approach provides a fast recovery mechanism with minimal turnaround time, enabling the AppViewX cluster to be brought back online at the earliest possible time.
  6. This approach can also be used to recover from partial VM failures, such as node disk corruption, operating system failure, or situations where an individual node becomes unavailable. In such cases, the affected virtual machine can be restored from the most recent snapshot, allowing the node to return to its previous operational state and rejoin the AppViewX cluster.
  7. The backup and restore mechanism is implemented at the virtual machine level. The customer’s infrastructure team is responsible for configuring, managing, monitoring, and maintaining the snapshot schedules and retention, storage, and restoration procedures.

Approach 2 - VM Snapshot with Application-Level Backup Restore

Prerequisites
  1. Regular application-level backups of MongoDB and Vault must be configured using scheduled cron jobs. The customer can choose and configure the backup frequency based on their requirements.
  2. The backup files for MongoDB and Vault must be securely transferred and stored in a customer-managed remote location to ensure availability during disaster recovery scenarios.
  3. The remote location should be accessible during recovery and have sufficient retention and storage capacity as per customer requirements.
Recovery Approach
  1. In this approach, virtual machine snapshots are taken at longer intervals (for example, weekly, monthly or after any patching/upgrade), or a single snapshot of the cluster in a known stable state is retained for restoration purposes.
  2. In the event of a cluster failure or complete outage, the VM snapshot can be used to restore the cluster to a working but potentially outdated state.
  3. To recover the cluster with the latest available data, the stored backup files of MongoDB and Vault are restored on top of the recovered cluster. This process brings the system back to its most recent operational state prior to the disaster.
  4. This approach can also be applied in scenarios involving partial VM failures, where a node can be restored from the available snapshot and the latest MongoDB and Vault backups can be applied to recover application data that may have been generated after the snapshot was taken.
  5. VM snapshot backups and remote storage must be managed by the customer’s infrastructure team, while MongoDB and Vault restoration should be performed using the AppViewX utilities.
    Note: In this approach, VM snapshots must be taken for all nodes in the AppViewX cluster and restored together during a disaster recovery scenario.

Approach 3 - New Cluster Rebuild with Data Restore

Prerequisites
  1. Regular application-level backups of MongoDB and Vault must be configured using scheduled cron jobs. The customer can choose and configure the backup frequency based on their requirements.
  2. The backup files for MongoDB and Vault must be securely transferred and stored in a customer-managed remote location to ensure availability during disaster recovery scenarios.
  3. The remote location should be accessible during recovery and have sufficient retention and storage capacity as per customer requirements.
Recovery Approach
  1. In the event of a disaster where the existing cluster is unavailable or unrecoverable, a new AppViewX cluster is installed. The steps for performing a fresh installation are detailed in the AppViewX Installation Guide.
  2. After the new cluster is successfully deployed and verified to be operational, the MongoDB and Vault backups from the remote location are restored to recover application data.
  3. Since this is a fresh installation, all customer-specific configurations and customizations including ConfigMaps, environment variables, integrations, certificates, and any other environment-specific changes must be reapplied manually.
  4. This approach involves a full rebuild and reconfiguration of the environment and therefore has the highest turnaround time among the disaster recovery options.
  5. This approach must be used in scenarios involving IP migration or infrastructure changes, where AppViewX is deployed on nodes with new IP addresses, making snapshot-based recovery approaches unsuitable.
  6. If VM snapshots are unavailable or node filesystems are lost due to disk or infrastructure failures, manually provisioning new nodes and attempting to rejoin them to the existing Kubernetes cluster using only application-level backups may not result in a reliable recovery. This approach can introduce inconsistencies, particularly for nodes running stateful or critical components such as MongoDB or secret management services like Vault/OpenBao, because node-specific configurations, certificates, and runtime metadata are stored on the underlying filesystem. In such situations, performing a fresh cluster installation and then restoring application backups is generally the recommended approach. This method ensures a clean and consistent environment and is often faster and more reliable than attempting to reconstruct and reattach partially failed nodes.

Approach 4 - Active-Standby Cluster with MongoSync utility

Prerequisites
  1. An active-standby AppViewX cluster setup must be deployed across two separate data centers (DCs).
  2. A primary (active) AppViewX cluster and a secondary (standby) AppViewX cluster must be provisioned, with network connectivity established between them.
  3. The MongoSync utility must be installed and configured to synchronize data from the active MongoDB cluster to the standby MongoDB cluster.
  4. Required network ports and firewall rules must be opened between the two DCs.
Recovery Approach
  1. In this approach, AppViewX runs in an active-standby mode across two data centers, with the active cluster serving production traffic and the standby cluster maintained in a ready state.
  2. The MongoSync utility continuously synchronizes data from the active MongoDB cluster to the standby MongoDB cluster, ensuring the standby environment remains up to date.
  3. Vault is configured as a one-time setup, and the Vault key configuration remains unchanged across both data centers. No Vault key rotation or reinitialization is required during failover.
  4. In the event of a disaster impacting the active data center, traffic can be switched to the standby cluster, which uses the synchronized MongoDB data and the same Vault configuration.
  5. This approach provides minimal data loss and a faster recovery, depending on the MongoSync configuration and synchronization lag.
  6. Since the standby cluster is already deployed and running, the overall recovery time is significantly reduced compared to rebuilding or restoring from backups.