Hard Drive Failure Prediction in Ceph
Authors: Karan Chauhan (email@example.com), Michael Clifford (firstname.lastname@example.org)
Date Created: 11 December 2020
Date Updated: 17 December 2020
Tags: supervised learning, time series, classification, ceph
Many large-scale distributed storage systems, such as Ceph, use mirroring or erasure-coded redundancy to provide fault tolerance. Because of this, scaling storage up can potentially be resource-intensive. In this project we aim to mitigate this issue using machine learning. Our goal is to build an ML model to predict if a hard drive will fail within a predefined future time interval. These predictions can then be used by Ceph (or other similar systems) to create or destroy replicas accordingly, thus making it more resource efficient. The open source models we built outperformed the ProphetStor model currently used as default in Ceph. Additionally, we framed the problem in a Kaggle kernel format to catalyze community participation in this task.
More than a million terabytes of data gets generated every day, and every bit of it is valuable. That’s why modern storage systems need to be reliable, scalable, and efficient. To provide fault tolerance and system availability, Ceph uses policy based data replication. It tries to store replicas on hard disks whose failure is likely not correlated, based on heuristics like physical proximity, shared power sources, and shared networks.
However, we want to go beyond heuristics for determining failure. The probability of data loss is proportional to the probability of multiple concurrent device failures. So if we could accurately anticipate failure and add additional redundancy in the system before it occurs, we could increase data durability by up to an order of magnitude. Thus, our goal in this project is to build machine learning models to predict when a hard drive would fail.
Using the right data is critical for creating meaningful models. With that in mind, we used SMART metrics collected from hard drives as our training data. SMART is a monitoring system that reports various indicators of drive reliability such as read error rate, uncorrectable sector count, etc. We used the SMART metrics dataset that Backblaze has curated and made publicly available. In addition to SMART metrics, this dataset also contains labels that specify whether or not a drive failed on a given date. Having these labels is incredibly powerful as it opens the door for training supervised machine learning models.
|date||serial no.||smart 1 raw||smart 1 normalized||…||smart 250 normalized||failure|
Table 1: Illustrative rendition of the Backblaze dataset
Previous work done in this domain seemed to suggest that SMART metric behavior may vary across drive vendors. Also, only a handful of vendors collectively make up the majority of drives being used. For these two reasons, we decided to train one prediction model per vendor.
The first step in the training pipeline was to clean the data. We will omit explanations for the data cleaning methods we used, but curious readers can check it out here. Next, to determine which metrics to use as input features, we looked at the correlations of each metric with failure, and also referred to previous research done.
Then, to determine what kind of models to build, we focused our attention to the fact that these models are intended to be first-class citizens in Ceph. So we needed to be mindful of any constraints involved with upstream integration. Ceph already had a health prediction module that consumed 6-12 days of SMART metrics and used ProphetStor’s sample model to estimate drive health as “bad” (fail in 2 weeks), “warning” (fail in 2-6 weeks), or “good” (fail in >6 weeks). With this, our requirements boiled down to the following
- Model output should be “good”, “warning”, or “bad”
- At most 6 days of data can be guaranteed to be available at run time
Figure 1: Sample model input (only one SMART metric shown)
Figure 2: Sample model output
Considering all these facts, we decided that building multiclass time series classification models was the way to go. Our approach for training models can be summarized as follows: for each 6-day sliding window, create a feature vector characterizing the multidimensional time series, and then feed this feature vector to a classification model. The label for each sliding window would be good/warning/bad based on how many days away from failure the drive was on the 6th (i.e. most recent) day.
Figure 3: Creating feature vectors from each sliding window
To evaluate models, we first created train and test sets. We ensured that data in these two sets was from completely different drives, so that there is no “leakage” across the two sets. We trained and evaluated a few classical models such as decision trees, GBDTs, random forest, etc. Note that to ensure ease in downstream packaging, we limited ourselves to use only the python packages available in EPEL. Under this constraint, we found that a random forest classifier performed the best, and outperformed the SVM based sample model currently used as default on Ceph. Specifically, the results were as follows
Table 2a: Recall score on each class for two models
Table 2b: Precision score on each class for two models
In this blog we discussed an important practical use case of data science in the operations domain. Specifically we described how we used machine learning to predict hard drive failure for Ceph, to make it more efficient. As seen in Fig 3, the approach we used seems to outperform the sample ProphetStor model. Still, we believe there is a lot of room for improvement. This work is intended to serve a catalyst for community involvement. To drive that home, we have made the code available[8,9] and also created a short “getting started” jupyter notebook similar to a Kaggle kernel. Please feel free to explore this problem in detail, and open issues or submit PRs on our repo. If you’re a Ceph user, you can also contribute in improving the models by providing real usage data from your Ceph clusters via telemetry.
 GitHub repo
 Ceph Telemetry
- Visit https://jupyterhub-opf-jupyterhub.apps.zero.massopen.cloud/hub/login
- Login with your google account
- In this Spawn screen, select ceph-drive-failure:latest
- Once your server starts, go into the directory named ceph-drive-failure-yyyy-mm-dd-hh-mm
- Go to
notebooks. You can either run the step-by-step notebooks one after the other, or run the end-to-end notebooks.
- Github Repository: https://github.com/aicoe-aiops/ceph_drive_failure
- You can open up a PR on the Git Repository highlighting the feature or issue, and we will address it.
- You can also reach out to email@example.com for any questions.