The possibility of HDD failure, predicted by machine learning?

Hard disk drives and HDDs have the advantage that they are suitable for long-term storage by providing large-capacity storage at a low price, but they have the disadvantage of being weak to shock or heat, and physical failure can occur sufficiently because they use a lot of precision parts. Backblaze, an online storage service provider, draws attention by explaining a technical research paper that predicts the possibility of future failures through machine learning in the state of hard disks.

Backblaze collects data such as HDD model number, serial number, and SMART from data centers around the world every day, and has accumulated more than 266 million records as of April 2013. It is said that data was sent to Backblaze from 191,000 HDDs as of September 30, 2021.

The HDD self-diagnostic function, SMART, records data transfer speed, energization time, temperature, search error frequency, and the number of start and stop disk rotation motors. Attempts to predict HDD errors from these SMART data have been made since the 1990s. For example, in a study published in 2014 and 2016 by Backbraze and a study published by Google in 2007, among SMART information, 05: Sector replaced, BB: Number of uncorrectable errors, BC: Command time, C5: Substitution processing Pending sector, C6: Non-recoverable sector is analyzed by HDD failure correlation.

The paper that Backblaze paid attention to this time was published by an AI company (Interpretable AI) research team. The research team analyzed the SMART information collected daily from the first quarter of 2017 to the first quarter of 2020 from more than 35,000 ST12000NM0007, a helium-rechargeable HDD made by Seagate. In addition, disaster prediction was performed by calculating the remaining lifespan for each HDD and building a survival tree using the data to show how SMART and remaining life are affected by SMART attributes.

The survival tree for year-by-year long-term prediction identifies 05: substituted sector in node 1 at the top. If the result is less than 1.5, go to node 2 and check 03: spin-up time. If the result is 1.5 or more, go to node 15 and go to C5: verify pending replacement processing. Based on these verifications and results, the branch is predicted repeatedly.

For example, node 18 at the bottom predicts that at least half of the HDDs that have been verified so far will not occur within 2 years. Conversely, HDDs verified as Node 11 are predicted to fail within 50 days.

In the case of the survival tree for short-term prediction in the 90-day range, the HDD branching from the lowest node 21 and node 24 will almost certainly predict the problem within 90 days. On the other hand, it is said that HDDs branching to nodes 12 and 15 are unlikely to fail within 90 days.

In carrying out HDD long-term forecasting, the research team used data for three years from 2017 to 2020 and limited the data for one year from 2019 to 2020, reducing the number of observations to 557,936. The AI model was trained by randomly re-sampling observations from the first data set, and the rest were used for testing.

Backblaze said that it can predict drive failure, but it is clear that it is not perfect. Nevertheless, he said that the important thing is a backup strategy. have. Related information can be found here.



Through the monthly AHC PC and HowPC magazine era, he has watched 'technology age' in online IT media such as ZDNet, electronic newspaper Internet manager, editor of Consumer Journal Ivers, TechHolic publisher, and editor of Venture Square. I am curious about this market that is still full of vitality.

Add comment

Follow us

Don't be shy, get in touch. We love meeting interesting people and making new friends.

Most discussed

%d 블로거가 이것을 좋아합니다: