Survival analysis of hard disk drive failure data.
Ross Lazarus, February 2016
Using a well established, objective analysis and data presentation method designed for right censored hard disk drive failure data provides insights which are not provided by simple descriptive statistics or charts. The Kaplan-Meier statistics and plots are recommended for routine use with hard drive failure data and their use is illustrated with 30M data points from the BackBlaze public data.
Hard disk drives are widely used for mass storage in servers, network attached storeage devices, laptops and desktop computers. Familiar and convenient as they are, these complex electro-mechanical devices are prone to sudden catastrophic failure, which can lead to very unpleasant consequences such as loss of data which was not securely backed up elsewhere. Selecting drive manufacturers and models for home or for commercial applications is complicated by the problem that objective and reliable measurements of the reliability of specific drive models or manufacturers is hard to find.
Subjective experience of individual consumers who purchase a few drives at a time is readily available in on-line product reviews at the larger retailers like Newegg or Amazon. These reviews are likely to be biased by negative reviews from those unlucky owners of a drive which happened to fail quickly – satisfied owners are less likely to take the time to share their experiences compared to unhappy owners who have just lost precious data.
Large commercial purchasers such as Google or Amazon probably do their own in-house testing, but rarely share their hard won findings or raw data. As drive capacities grow, new models are released on a regular basis but it takes at least 2 or 3 years of observation of a large number of sample drives under typical field operation conditions before robust conclusions can be drawn on the reliability over time for each new model.
The most recent analysis of about 50,000 hard disks deployed in a commercial on line storage facility over nearly 3 years run by Backblaze is one of the largest published studies and can be viewed at https://www.backblaze.com/blog/hard-drive-reliability-q3-2015/ and in other Backblaze blogs. Simple statistics, tables and bar graphs derived from 30,301,566 observations are presented and discussed. That’s a lot of data and the Backblaze engineers have done their best to make sense of it. Unfortunately I’m not sure that you can really see what’s going on from their presentation. For example, time is split into 3 year-long intervals in the main table, making it confusing and hard to figure out what’s really going on, and the summary bar charts hide an awful lot of interesting detail.
Failure time (or survival) analysis:
Part of the challenge in interpreting this type of data is the problem that at any point in time during the observation period, one or more drives (or patients or more generally, units of analysis) may fail, and one or more drives may be removed from any further study before failure because of firmware diagnostics or planned maintenance. In terms of statistical analysis, this problem is termed right censoring because no further information is available after a drive is removed. Right censoring must be taken into account in order to correctly calculate the instantaneous failure rate of drives in the context of drives removed from further observation at some point before they failed together with the remaining drives which have not failed (yet).
Epidemiologists and statisticians have established valid and robust methods for handling right censored data in the context of survival analysis, which are applicable to the Backblaze data. Survival rates are the inverse of failure rates, so survival and failure analysis are more or less mathematically equivalent, being two sides of the same technical coin although failure time analysis predominates in engineering circles whereas the survival analysis paradigm predominates in biology.
One popular method is the Kaplan-Meier (KM) plot and KM statistics, widely used to compare (for example) survival time after diagnosis for patients with the same cancer but different treatments. This kind of data is similar to the hard drive failure data because the reality is that it is almost inevitable that some patients in any clinical study will be lost to further follow up after a visit at which they were clearly alive. Those right censored patients, like the drives removed before failure, contribute no more information to the study, but do contribute useful information for the whole time they are being observed. Some details on where the data came from and how the analysis was performed are provided at the end of this article.
Application of survival analysis to hard disk drive failure data:
Here’s a KM plot showing the survival of each drive by the manufacturer.
The vertical axis represents the fraction of drives which survived at any given point in time and the horizontal axis represents days since time zero. Each individual disk drive’s history over time is "lined up" so the first day of observation is always at the far left, at time zero – like a race where each competitor starts at the same point, although in the raw data, drives were introduced to the pool continuously over the entire study period. Each manufacturer’s drives are grouped together and their survival in service over time is plotted as a single line. When one or more drives fail, there is a small vertical step in the curve. Each cross on each line represents a right censored observation removed from further study. Note that right censoring has no effect on the instantaneous survival rate – it simply changes the denominator for failure or conversely, survival rate calculations. Each downward step in each line represents one or more failures at that time.
Here is the Backblaze summary chart linked from their report at https://www.backblaze.com/blog/hard-drive-reliability-q3-2015/
To my eye, the KM curve provides a much more detailed and arguably more accurate summary of what happened during the observation period. Note the curve for ST500LM012 which is an obvious anomaly arising from an abberant manufacturer string in the data ("ST500LM012 HN") where the two space delimited components in the data field are reversed (see below) compared to the majority of the data where the model follows the manufacturer abbreviation. This does not seem to have been noticed in the Backblaze analysis but the KM plot makes it obvious. No attempt has been made to correct this anomaly because it is not clear whether the model number means that the "HN" is wrong and should be replaced by "ST" – I’ll leave that for the BackBlaze engineers to figure out and fix!
One example of a feature that was not at all obvious from the Backblaze analysis, but is clear from the KM plot, is the crossover in failure rate between ST (Seagate) and WDC (Western Digital). Initially, the WDC family failed slightly faster but the Seagate family of samples failed more quickly after about the first year of operation.
The KM statistical test estimates expected failure rates from mean failure rates and the number of units under observation at each time point and as shown below, suggests that drive survival is significantly different between manufacturers with some (eg HGST) having far fewer observed failures than expected and others (eg ST) having far more than expected, with a global Chisquared value of 2535 which is extremely unlikely to have arisen by chance alone :
N Observed Expected (O-E)^2/E (O-E)^2/V
manufact=HGST 10424 100 515.21 3.35e+02 4.08e+02
manufact=Hitachi 13244 385 1533.11 8.60e+02 1.53e+03
manufact=ST 32714 3266 1798.14 1.20e+03 2.21e+03
manufact=ST500LM012 377 22 8.89 1.93e+01 1.94e+01
manufact=TOSHIBA 254 9 9.15 2.59e-03 2.59e-03
manufact=WDC 3753 298 215.49 3.16e+01 3.34e+01
Chisq= 2535 on 5 degrees of freedom, p= 0
The KM plot pattern seems much easier to understand and at all obvious from the table or bar graphs shown in the original article.
For individual drive models, the KM curves are complex but even more revealing:
The KM curves show that one particular Seagate model failed at an unusually high rate over the entire period, whereas the curves at the top of the plot show a group of very reliable drive models which had very few failures over the entire period of observation. These individual drive model curves are made from the same data as the manufacturer curves but reveal a great deal of interesting variation within each manufacturer’s offerings – again suggesting that descriptive and summary statistics presented in the Backblaze blogs hide a lot of important and interesting complexity.
Again, the KM statistics show that the differences between models seen in the KM plot are statistically significant and unlikely to have arisen by chance alone.
N Observed Expected (O-E)^2/E (O-E)^2/V
model=HGST HMS5C4040ALE640 7168 73 285.7 1.58e+02 1.83e+02
model=HGST HMS5C4040BLE640 3115 21 194.6 1.55e+02 1.67e+02
model=Hitachi HDS5C3030ALA630 4662 98 519.4 3.42e+02 4.09e+02
model=Hitachi HDS5C4040ALE630 2719 63 298.9 1.86e+02 2.05e+02
model=Hitachi HDS722020ALA330 4774 175 530.7 2.38e+02 2.86e+02
model=Hitachi HDS723030ALA640 1048 45 115.3 4.29e+01 4.45e+01
model=ST3000DM001 4707 1705 305.5 6.41e+03 7.06e+03
model=ST31500341AS 787 216 45.1 6.47e+02 6.55e+02
model=ST31500541AS 2188 392 199.1 1.87e+02 1.98e+02
model=ST4000DM000 21671 695 1025.8 1.07e+02 1.52e+02
model=ST6000DX000 1906 26 27.6 9.20e-02 9.51e-02
model=WDC WD10EADS 550 53 54.7 5.38e-02 5.47e-02
model=WDC WD30EFRX 1267 114 73.6 2.22e+01 2.27e+01
Chisq= 8587 on 12 degrees of freedom, p= 0
More complex models:
The KM plot is a robust, non-parametric method which is attractive because of the lack of assumptions about the data. More sophisticated methods such as Cox proportional hazards models require distributional or other assumptions, but allow adjustment for additional variables such as the kind of storage pod (see the Backblaze blogs), drive capacity, number of platters or other factors of interest. My view is that this is not going to be at all useful until a lot more data becomes available.
Other than as a consumer, I don’t have any particular expertise on hard disk drives but I have made a successful career out of interpreting large scale data sets using appropriate statistical methods. I find the KM analysis much more clear and easy to interpret compared to the simple descriptive statistics presented by Backblaze and I hope they use more appropriate methods going forward. I’m happy to help if anyone cares to ask.
Technical details and data source:
The Backblaze folk have done a great service to the community by making their data freely available for anyone willing to poke at it at https://www.backblaze.com/hard-drive-test-data.html.The data release which includes the third quarter of 2015 was downloaded in early February 2016 and is reported here.
Here’s a small sample of the 30,301,566 rows of raw data available from Backblaze. There’s a separate CSV format file for each day of each year. These are stored under three year (eg 2013) directories. This is from the start of "2013/2013-04-10.csv"
Since I don’t trust the smartdrive stats, I threw all those columns away and split out the manufacturer code and model from the "model" field.
The Kaplan-Meier plot and test statistics are available in most worthwhile statistical packages and I used the npsurv function from the R survival package for the plots and statistics reported here. In order to improve the reliability of the model curves, drives with fewer than 500 observations were dropped.
A python script was used to read all the files, keeping track of the appearance and disappearance of each unique drive as defined by a combination of model and serial_number, while processing each day’s data in sequence. No database needed – python easily handles this data as an in memory dictionary, after dropping all the smartdrive columns. After reading all 30 million rows, a summary file containing a single row for each unique drive with the date it first appeared, the number of days it was under observation and a code indicating whether it failed or not was written. That script processed about 30,000 csv rows a second on my oldish desktop taking about 17 minutes for the entire dataset. The R script takes only a few seconds to perform the KM analysis and generate plots.