A little about SMART and monitoring utilities

There is a lot of information on the net about SMART and attribute values. But I have not come across a mention of several important points that I know about from people involved in the study of media.



When I once again told a friend about why the SMART readings should not be unconditionally trusted and why it is better not to use the classic “SMART monitors” all the time, the idea came to my mind to write down the spoken words in the form of a set of theses with explanations. To provide links, instead of retelling each time. And to familiarize a wide audience.



1) Programs for automatic monitoring of SMART attributes should be used with great care.



What you know as SMART attributes is not stored off the shelf, but generated the moment you request them. They are calculated based on internal statistics accumulated and used by the drive firmware during operation.



The device does not need some of this data to provide the basic functionality. And it is not stored, but generated every time it is required. Therefore, when a request for SMART attributes occurs, the firmware launches a large number of processes that are needed to retrieve the missing data.



But these processes are poorly compatible with the procedures performed when the drive is loaded with read-write operations.



In an ideal world, this shouldn't lead to any problems. But in reality, ordinary people write firmware for hard drives. Which can be wrong and are wrong. Therefore, if you query for SMART attributes while the device is actively performing read / write operations, the chances of something going wrong are dramatically increased. For example, data in a user read or write buffer will be corrupted.



The statement about the increase in risks is not a theoretical conclusion, but a practical observation. For example, there is a known bug that took place in the firmware of the Samsung 103UI HDD, where user data was damaged during the execution of a request for SMART attributes.



Therefore, do not configure automatic SMART attribute checking. Unless you know for sure that the Flush Cache command is issued before this. Or, if you cannot do without it, configure the check execution as rarely as possible. In many monitoring programs, the default time between scans is about 10 minutes. It's too common. Anyway, such checks are not a panacea for unexpected disk failure (a panacea is only redundancy). Once a day - I think it is quite sufficient.



The temperature request does not lead to the start of the attribute calculation processes and can be performed frequently. Because, if implemented correctly, this is done via the SCT protocol. Only what is already known is given through SCT. This data is updated automatically in the background.



2) SMART attribute data is often unreliable.



The hard drive firmware shows you what it sees fit, not what is actually happening. The most obvious example is the 5th attribute, the number of reassigned sectors. It is well known to data recovery specialists that a hard drive can show zero realokates in the fifth attribute, despite the fact that they exist and continue to appear.



I asked a question to a specialist who studies hard drives and examines their firmware. I asked about the principle by which the firmware of the device decides that now it is necessary to hide the fact of reassigning sectors, and now you can talk about it through the SMART attributes.



He replied that there is no general rule that devices show or hide the real picture. And the logic of programmers who write hard drive firmware looks very strange at times. Studying the firmware of different models, he saw that often the decision to "hide or show" is made on the basis of a set of parameters, which are generally not clear how they are related to each other and to the residual resource of the hard disk.



3) The interpretation of SMART metrics is vendor specific.



For example, on Seagates, you should not pay attention to the "bad" raw values ​​of attributes 1 and 7, while the rest are normal. On disks from this manufacturer, their absolute values ​​may increase during normal use.



image



To assess the condition and residual resource of a hard disk, first of all, it is recommended to pay attention to parameters 5, 196, 197, 198. Moreover, it makes sense to focus on the absolute, raw values ​​(raw), and not on the given ones. Attribute casting can be performed in non-obvious ways, which are different in different algorithms and firmware.



In general, among media specialists, when they talk about the value of an attribute, it is usually the absolute value that is meant.



All Articles