Abstract:To help reliability of SSD arrays, Redundant Array of Independent Disks (RAID) are commonly employed. However, the conventional reliability models of HDD RAID cannot be applied to SSD arrays, as the nature of failures in SSDs are different from HDDs. Previous studies on the reliability of SSD arrays are based on the deprecated SSD failure data, and only focus on limited failure types, device failures, and page failures caused by the bit errors, while recent field studies have reported other failure types including bad blocks and bad chips, and a high correlation between failures. In this paper, we explore the reliability of SSD arrays using field storage traces and real-system implementation of conventional and emerging erasure codes. The reliability is evaluated by statistical fault injections that post-process the usage logs from the real-system implementation, while the fault/failure attributes are obtained from field data. As a case study, we examine conventional and emerging erasure codes in terms of both reliability and performance using Linux MD RAID and commercial SSDs. Our analysis shows that a) emerging erasure codes fail to replace RAID6 in terms of reliability, b) row-wise erasure codes are the most efficient choices for contemporary SSD devices, and c) previous models overestimate the SSD array reliability by up to six orders of magnitude, as they focus on the coincidence of bad pages and bad chips that roots the minority of Data Loss (DL) in SSD arrays. Our experiments show that the combination of bad chips with bad blocks is the major source of DL in RAID5 and emerging codes (contributing more than 54% and 90% of DL in RAID5 and emerging codes, respectively), while RAID6 remains robust under these failure combinations. Finally, the fault injection results show that SSD array reliability, as well as the failure breakdown is significantly correlated with SSD type.
Abstract:In recent years, high availability and reliability of Data Storage Systems (DSS) have been significantly threatened by soft errors occurring in storage controllers. Due to their specific functionality and hardware-software stack, error propagation and manifestation in DSS is quite different from general-purpose computing architectures. To our knowledge, no previous study has examined the system-level effects of soft errors on the availability and reliability of data storage systems. In this paper, we first analyze the effects of soft errors occurring in the server processors of storage controllers on the entire storage system dependability. To this end, we implemented the major functions of a typical data storage system controller, running on a full stack of storage system operating system, and developed a framework to perform fault injection experiments using a full system simulator. We then propose a new metric, Storage System Vulnerability Factor (SSVF), to accurately capture the impact of soft errors in storage systems. By conducting extensive experiments, it is revealed that depending on the controller configuration, up to 40% of cache memory contains end-user data where any unrecoverable soft errors in this part will result in Data Loss (DL) in an irreversible manner. However, soft errors in the rest of cache memory filled by Operating System (OS) and storage applications will result in Data Unavailability (DU) at the storage system level. Our analysis also shows that Detectable Unrecoverable Errors (DUEs) on the cache data field are the major cause of DU in storage systems, while Silent Data Corruptions (SDCs) in the cache tag and data field are mainly the cause of DL in storage systems.