Diagnosis of a failing disk
Smartmontools
is a set of tools that controls and monitors a disk using the SMART standard (Self-Monitoring, Analysis, and Reporting Technology System).
It consists of two parts:
smartd
, a daemon that allows you to periodically check your hard drivessmartctl
, a command line tool to view the status of the hard disk
The tool supports the vast majority of modern hard drives.
Before you start
To complete the actions presented below, you must have:
- A Dedibox account logged into the console
- A Dedibox dedicated server
How to check a server with no RAID controller
- Log into your server using SSH.
- Run the following command from the root account (or precede it with
sudo
):smartctl -a /dev/sdx
How to check a Dell multi-disk server
Dell PERC H200 controller
On these servers, the physical disks are referred to as sg*
devices.
-
Log into your server using SSH.
-
Run the following commands:
smartctl -a -T permissive /dev/sg0 smartctl -a -T permissive /dev/sg1 smartctl -a -T permissive /dev/sg2
Dell PERC controller (H310, H700, H710, H730-P, LSI9361)
Two possibilities exist for this type of controller:
megaclisas-status
andmegacli
The first one displays the status of the RAID volume, whilst the second one displays the SMART status of the disks.
- Log into your server using SSH.
- Update the APT package lists cache, and install the required packages:
apt update apt install megaclisas-status megacli
- Run the following command to display the status of the RAID volume:
megaclisas-status
- Run the following little script to retrieve the SMART values of your disks:
DEVICE=/dev/sda for i in $(megacli -pdlist -a0 | grep Id | cut -d":" -f2); do echo "============================== $i ==============================" smartctl -s on -a -d megaraid,${i} ${DEVICE} -T permissive done
How to check an HP multi-disk server (P410, P420, P222)
- Log into your server using SSH.
- Run the following command to display the status of the RAID:
An output similar to the following example displays:
ssacli ctrl all show config
Smart Array P410 in Slot 1 (sn: PACCRID111003N3) array A (SATA, Unused Space: 0 MB) logicaldrive 1 (1.8 TB, RAID 1, OK) physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SATA, 2 TB, OK) physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SATA, 2 TB, OK)
- Run the following command to display the SMART values of the disks:
then run:
smartctl -a -d cciss,0 /dev/sg0
smartctl -a -d cciss,1 /dev/sg0
How to use SMARTD to monitor your disks
smartd
allows you to monitor your disks and to be alerted (depending on the configuration) by email in case of failure.
How to configure SMARTD
Below, you will find an example of a single-disk server installed on a Debian-like machine.
- Log into your server using SSH.
- Enable basic SMART options:
smartctl -s on -o on -S on /dev/sda
- Check that the disk is healthy:
smartctl -H /dev/sda
- Edit the file
/etc/smartd.conf
, to set up automated tests:-
Start by commenting out the following line:
DEVICESCAN -d removable -n standby -m root -M exec /usr/share/smartmontools/smartd-runner
-
Then add a line similar to the following example:
/dev/sda -a -d sat -o on -S on -s (S/../.././01|L/../1/03) -m root -M exec /usr/share/smartmontools/smartd-runner
-
The example above allows you to test your hard disk as follows:
- A short test (S) every day at 1am (01)
- A long test (L), every Monday (1) at 3am (03)
- Activate the daemon by uncommenting the line
start_smartd=yes
in the file/etc/default/smartmontools
. - Start the daemon by running the following command:
service smartmontools start
If a problem is detected, it will send a default mail to root (-m root).
You can redirect the mails sent to the root
user to your personal mailbox or send this mail directly to another address.
How to run tests manually
To run SMART tests manually, use the following commands:
smartctl -t short /dev/sda
to run a short test on your disksmartctl -t long /dev/sda
to run a long test on your disk
Once the tests are completed, you can check the results with the following command:
smartctl -l selftest /dev/sda
How to report disk failures
If you notice any errors when running a SMART diagnosis on your disk, open a support ticket and ask for the disk to be replaced, indicating the serial number with the result of the smartctl
command:
=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HD103UJ
Serial Number: S13PJ1KQ513170 <----------------------- Serial Number
Firmware Version: 1AA01113
User Capacity: 1 000 204 886 016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: ATA-8-ACS revision 3b
Local Time is: Fri Oct 29 11:20:27 2010 CEST