Large-scale storage systems are designed with redundancy to prevent data loss after a disk failure, often by replicating data. Redundancy is necessary since a large number of disks implies a short mean time to a single disk failure. Systems are often designed to withstand up to two temporally adjacent failures without losing data, and return to the proper level of redundancy as soon as possible after a failure. Replicated storage systems often triple-replicate all data to prevent data loss after two simultaneous failures, resulting in high storage costs and greater energy and network usage. Other solutions reduce overheads by storing only two replicas. Even with fast reduplication times after an initial failure, storing two replicas cannot prevent data loss with simultaneous or temporally adjacent disk failures. Our storage system is comprised of two data replicas and one small, inexpensive disk add-on per machine, which we call the LSTOR. The LSTOR interfaces with the I/O controller on behalf of the disk, and communicates with the system even if its attached disk fails. The system offers failure tolerance in one of two operating modes: After a disk failure, the LSTOR can copy its attached non-redundant data to itself, thereby creating an additional copy. Alternatively, each disk is divided into blocks, where the LSTOR stores the parity (XOR or Erasure Codes) of the data blocks. This can be used to reconstruct lost blocks upon multiple simultaneous disk failures. Each LSTOR resides between the disk and the I/O controller, and so data transfers between the disk and the LSTOR do not burden the network, providing resiliency to temporally adjacent or even simultaneous disk failures. Thus the LSTOR allows the storage system to withstand two simultaneous failures while minimizing the overheads. We also ensure that no two disks share more than one data block, meaning every disk that shares with a failed disk must duplicate only one non-redundant block per failure, achieving maximum parallelism and load balancing in the recovery.
- Reduced costs ; reduced storage and energy consumption ; reduced network burden ; reduced time interval in which additional failure can lead to data loss
Applications and Opportunities
- Data center/cloud storage systems