Recover failing disk

This page documents the preview (v2.21) version. Preview includes features under active development and is for development and testing only. For production, use the stable (v2024.1) version.

YugabyteDB can be configured to use multiple storage disks by setting the --fs_data_dirs flag. This introduces the possibility of disk failure and recovery issues.

Cluster replication recovery

The YB-TServer service automatically detects disk failures and attempts to spread the data from the failed disk to other healthy nodes in the cluster. In a single-zone setup with a replication factor (RF) of 3: if you started with four nodes or more, then there would be at least three nodes left after one failed. In this case, re-replication is automatically started if a YB-TServer or disk is down for 10 minutes.

In a multi-zone setup with a replication factor (RF) of 3: YugabyteDB tries to keep one copy of data per zone. In this case, for automatic re-replication of data, a zone needs to have at least two YB-TServers so that if one fails, its data can be re-replicated to the other. Thus, this would mean at least a six-node cluster.

Failed disk replacement

To replace a failed disk, perform the following steps:

  1. Stop the YB-TServer node.
  2. Replace the disks that have failed.
  3. Restart the yb-tserver service.

On restart, the YB-TServer will see the new empty disk and start replicating tablets from other nodes.