Portál AbcLinuxu, 12. května 2025 15:41
Ahoj,
mám takový problém, na stroji, který slouží jako nagios server a beží na něm CentOS 5.2 x86 se mi čas od času ( cca 1 za 10 dní )
rozpadne sw raid 1. Disky jsem otestoval utilitou výrobce a ta žadný problém nenašla. Obvykle se to stane u větší části cca 77 GB a
odpadne buď první nebo druhý disk - zdá se, že je to na náhodu. Nevíte co s tím ?
[root@nagios ~]# cat /proc/mdstat
Personalities : [raid1]
md0 : active raid1 hdc1[1] hda1[0]
200704 blocks [2/2] [UU]
md1 : active raid1 hda2[0]
77923200 blocks [2/1] [U_]
unused devices: <none>
Pomůže :
[root@nagios ~]# mdadm --re-add /dev/md1 /dev/hdc2
Disky se sesynchronizují a běží to dalších 7-14 dní ...
Nevíte co může být špatně ?
Ano, sytém běží celou dobu bez restartu.
Disky jsou na separátních kanálech řadiče ( ide disky na SiI680: IDE controller at PCI ... )
-----------------------
SiI680: chipset revision 2
SiI680: BASE CLOCK == 133
SiI680: 100% native mode on irq 177
ide0: MMIO-DMA , BIOS settings: hda:pio, hdb:pio
ide1: MMIO-DMA , BIOS settings: hdc:pio, hdd:pio
Probing IDE interface ide0...
hda: IC35L090AVV207-0, ATA DISK drive
ide0 at 0xf8802080-0xf8802087,0xf880208a on irq 177
Probing IDE interface ide1...
hdc: IC35L090AVV207-0, ATA DISK drive
ide1 at 0xf88020c0-0xf88020c7,0xf88020ca on irq 177
-----------------------
Spustil jsem " smartctl -a /dev/hda " - výsledek je, jestli to správně interpretuji, že disk má, předsi jen, vadné sektory ? ( Error: UNC 56 sectors at LBA ... atd. )
To by mě zajímalo, proč ta utilita nic nenašla , případně se situace stále zhoršuje..... No jdu udělat zálohu, o ty konfigurace bych nerad přišel ...
[root@nagios ~]# smartctl -a /dev/hda
smartctl version 5.36 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Model Family: IBM/Hitachi Deskstar GXP-180 family
Device Model: IC35L090AVV207-0
Serial Number: VNVC02G3GJEY6T
Firmware Version: V23OA69A
User Capacity: 80,000,000,000 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 6
ATA Standard is: ATA/ATAPI-6 T13 1410D revision 3a
Local Time is: Mon Mar 2 12:01:39 2009 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (2153) seconds.
Offline data collection
capabilities: (0x1b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 36) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 088 088 060 Pre-fail Always - 2097207
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 233
3 Spin_Up_Time 0x0007 112 112 024 Pre-fail Always - 225 (Average 266)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 72
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 2
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 121 121 000 Pre-fail Offline - 38
9 Power_On_Hours 0x0012 096 096 000 Old_age Always - 29086
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 66
192 Power-Off_Retract_Count 0x0032 100 100 050 Old_age Always - 386
193 Load_Cycle_Count 0x0012 100 100 050 Old_age Always - 386
194 Temperature_Celsius 0x0002 229 229 000 Old_age Always - 24 (Lifetime Min/Max 18/39)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 2
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
ATA Error Count: 830 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 830 occurred at disk power-on lifetime: 27912 hours (1163 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 e0 00 ea ef
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
fe a2 01 e0 00 0a e0 02 00:04:14.300 [VENDOR SPECIFIC]
fe a1 00 00 00 00 a0 02 00:04:14.300 [VENDOR SPECIFIC]
fe a6 50 60 eb c0 a0 02 00:04:14.100 [VENDOR SPECIFIC]
fe a6 50 60 eb c0 a0 02 00:04:14.000 [VENDOR SPECIFIC]
fe a6 50 60 eb c0 a0 02 00:04:13.900 [VENDOR SPECIFIC]
Error 829 occurred at disk power-on lifetime: 27912 hours (1163 days + 0 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 01 e0 00 ea ef
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
fe a2 01 e0 00 0a e0 02 00:01:50.900 [VENDOR SPECIFIC]
fe a1 00 00 00 00 a0 02 00:01:50.900 [VENDOR SPECIFIC]
fe a6 50 60 eb c0 a1 02 00:01:50.700 [VENDOR SPECIFIC]
fe a6 50 60 eb c0 a1 02 00:01:50.600 [VENDOR SPECIFIC]
fe a6 50 60 eb c0 a1 02 00:01:50.400 [VENDOR SPECIFIC]
Error 828 occurred at disk power-on lifetime: 27910 hours (1162 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 63 b5 1d 6b e0 Error: UNC 99 sectors at LBA = 0x006b1db5 = 7019957
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 78 a0 1d 6b e0 00 00:15:29.100 READ DMA EXT
25 00 80 98 1d 6b e0 00 00:15:20.700 READ DMA EXT
25 00 80 18 1d 6b e0 00 00:15:20.400 READ DMA EXT
25 00 80 98 1c 6b e0 00 00:15:20.400 READ DMA EXT
25 00 80 18 1c 6b e0 00 00:15:20.400 READ DMA EXT
Error 827 occurred at disk power-on lifetime: 27910 hours (1162 days + 22 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 63 b5 1d 6b e0 Error: UNC 99 sectors at LBA = 0x006b1db5 = 7019957
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 80 98 1d 6b e0 00 00:15:20.700 READ DMA EXT
25 00 80 18 1d 6b e0 00 00:15:20.400 READ DMA EXT
25 00 80 98 1c 6b e0 00 00:15:20.400 READ DMA EXT
25 00 80 18 1c 6b e0 00 00:15:20.400 READ DMA EXT
25 00 80 98 1b 6b e0 00 00:15:20.400 READ DMA EXT
Error 826 occurred at disk power-on lifetime: 27909 hours (1162 days + 21 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 38 e0 7c 66 e0 Error: UNC 56 sectors at LBA = 0x00667ce0 = 6716640
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
25 00 70 a8 7c 66 e0 00 00:12:19.200 READ DMA EXT
25 00 78 a0 7c 66 e0 00 00:12:12.800 READ DMA EXT
25 00 80 98 7c 66 e0 00 00:12:06.500 READ DMA EXT
25 00 80 18 7c 66 e0 00 00:12:06.200 READ DMA EXT
25 00 80 98 7b 66 e0 00 00:12:06.200 READ DMA EXT
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 27912 -
# 2 Extended offline Completed without error 00% 0 -
Device does not support Selective Self Tests/Logging
Po spuštěni smartctl -t long /dev/hda to po cca 30 minutách ten disk opět vyšouplo z raidu. Máte pravdu, že těch 830 ata chyb je na pováženou a oba disky ( jsou na tom podobně )
už mají své odslouženo. Vyměním je za jiné - díky za pomoc, považuji problém za vyřešený.
Po spuštěni smartctl -t long /dev/hda to po cca 30 minutách ten disk opět vyšouplo z raidu.
Pak je to asi jasné...
Tiskni
Sdílej:
ISSN 1214-1267, (c) 1999-2007 Stickfish s.r.o.