Strona 1 z 1
Raid programowy - wypisanie się dysków z macierzy
: 30 października 2014, 01:49
autor: rafsze
Hej,
Mam w pracy mały serwer www i bazy danych pod kontrolą debiana do aplikacji webowej. Dla podniesienia poziomu niezawodności zainstalowałem go na dwóch dyskach spiętych w raid1 (mirror). Już chyba 3 raz zdarzyło się tak, że dysk samoczynnie się wypiął z macierzy. Po wydaniu odpowiednich komend dysk wraca do macierzy i system posłusznie odbudowuje macierz, i wygląda, że jest ok. (nie pamiętam jak było wcześniej,ale dzisiaj musiałem z rebootować system bo nie widział w ogóle dysku)
Zastanawia mnie przyczyna takiej sytuacji i czy można jakoś temu zapobiec.
Oczywiście serwer alarmuje mnie to takiej sytuacji, ale jest to troszkę upierdliwe, by co jakieś 20 dni odbudowywać macierz (swoją drogą to ten sam dysk się wypina /dev/sdb)
Nadmienię, iż serwer pracuje 24/7.
Jakieś pomysły?
: 30 października 2014, 08:25
autor: LordRuthwen
Tak, odczytaj SMART i zobacz co się z tym dyskiem dzieje, bo wg mnie jest uszkodzony i pasowało by się zastanowić nad wymianą.
: 30 października 2014, 13:45
autor: rafsze
Hm zasadniczo po pierwszym wypięciu nie sprawdziłem dysku, bo nauczony doświadczeniem z tym dostawcą serwera ( jakiś miał talent, że wszystko co dostarczył w końcu się psuło, taki król antymidas) wymieniłem uszkodzony dysk na nowy oraz ten działający profilaktycznie również.
W każdym razie na wszelki wypadek sprawdzę te dyski jak będę w firmie.
: 30 października 2014, 16:54
autor: rafsze
Wpadłem do biura zrobić testy i okazało się, że znów dysk się wypiął. czyli po jakichś 2 dniach...
Podobnie jak poprzednio dysk w ogóle nie był widziany przez system dopiero po reboocie dało się go dodać. Aktualnie macierz się odbudowuje.
Przetestowałem oba dyski (sdb to ten wypinający się). W zasadzie jedynym parametrem, który mnie zasanawia jest seek_error_rate, ale nie mogłem znaleźć żadnych norm. Co na ten temat myślicie?
Może problem brzmi znikanie dysku wogóle z systemu i dlatego znika z macierzy.
/dev/sda
Kod: Zaznacz cały
smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.0-4-686-pae] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST380815AS
Serial Number: 9RW4MDSW
Firmware Version: 4.ADA
User Capacity: 80,000,000,000 bytes [80,0 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Oct 30 16:41:44 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 27) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 098 097 070 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 16
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 063 060 030 Pre-fail Always - 2307929
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 858
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 16
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 064 057 045 Old_age Always - 36 (Min/Max 33/36)
194 Temperature_Celsius 0x0022 036 043 000 Old_age Always - 36 (0 22 0 0)
195 Hardware_ECC_Recovered 0x001a 088 075 000 Old_age Always - 182934975
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
/dev/sdb
Kod: Zaznacz cały
smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.0-4-686-pae] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST380815AS
Serial Number: 6RW3302V
Firmware Version: 4.ADA
User Capacity: 80,000,000,000 bytes [80,0 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Oct 30 16:41:49 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 27) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 099 097 070 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1725
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 066 060 030 Pre-fail Always - 4622459
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 583
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1784
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 065 054 045 Old_age Always - 35 (Min/Max 29/35)
194 Temperature_Celsius 0x0022 035 046 000 Old_age Always - 35 (0 23 0 0)
195 Hardware_ECC_Recovered 0x001a 071 056 000 Old_age Always - 22903739
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 3
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 3
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
-------------------------
Przejrzałem rónież syslogi i znalazłem fragment logów tuż przed informacją mdstat o usunięciu z macierzy jednego z dysków.
Czy ktoś wie co one oznaczają?
Kod: Zaznacz cały
Oct 29 14:47:38 maszyna kernel: [ 2946.835795] ata2: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Oct 29 14:47:38 maszyna kernel: [ 2946.835932] ata2: irq_stat 0x00400000, PHY RDY changed
Oct 29 14:47:38 maszyna kernel: [ 2946.836053] ata2: SError: { PHYRdyChg }
Oct 29 14:47:38 maszyna kernel: [ 2946.836138] ata2: hard resetting link
Oct 29 14:47:41 maszyna kernel: [ 2949.192033] ata2: COMRESET failed (errno=-32)
Oct 29 14:47:41 maszyna kernel: [ 2949.192121] ata2: reset failed (errno=-32), retrying in 8 secs
Oct. 29 14:47:48 maszyna kernel: [ 2956.836052] ata2: limiting SATA link speed to 1.5 Gbps
Oct 29 14:47:48 maszyna kernel: [ 2956.836060] ata2: hard resetting link
Oct 29 14:47:55 maszyna kernel: [ 2962.856051] ata2: link is slow to respond, please be patient (ready=0)
Oct 29 14:47:59 maszyna kernel: [ 2966.888034] ata2: COMRESET failed (errno=-16)
Oct 29 14:47:59 maszyna kernel: [ 2966.888127] ata2: hard resetting link
Oct 29 14:48:04 maszyna kernel: [ 2972.652036] ata2: link is slow to respond, please be patient (ready=0)
Oct 29 14:48:34 maszyna kernel: [ 3001.940045] ata2: COMRESET failed (errno=-16)
Oct 29 14:48:34 maszyna kernel: [ 3001.940138] ata2: hard resetting link
Oct 29 14:48:39 maszyna kernel: [ 3006.968037] ata2: COMRESET failed (errno=-16)
Oct 29 14:48:39 maszyna kernel: [ 3006.968127] ata2: reset failed, giving up
Oct 29 14:48:39 maszyna kernel: [ 3006.968202] ata2.00: disabled
Oct 29 14:48:39 maszyna kernel: [ 3006.968221] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen t4
Oct 29 14:48:39 maszyna kernel: [ 3006.968353] ata2: irq_stat 0x00400040, connection status changed
Oct 29 14:48:39 maszyna kernel: [ 3006.968461] ata2: SError: { RecovComm PHYRdyChg CommWake DevExch }
Oct 29 14:48:39 maszyna kernel: [ 3006.968578] ata2: hard resetting link
Oct 29 14:48:39 maszyna kernel: [ 3006.972035] sd 1:0:0:0: rejecting I/O to offline device
Oct 29 14:48:39 maszyna kernel: [ 3006.972172] sd 1:0:0:0: [sdb] killing request
Oct 29 14:48:39 maszyna kernel: [ 3006.972216] sd 1:0:0:0: rejecting I/O to offline device
Oct 29 14:48:39 maszyna kernel: [ 3006.972321] md: super_written gets error=-5, uptodate=0
Oct 29 14:48:39 maszyna kernel: [ 3006.972330] md/raid1:md0: Disk failure on sdb1, disabling device.
Oct 29 14:48:39 maszyna kernel: [ 3006.972332] md/raid1:md0: Operation continuing on 1 devices.
Oct 29 14:48:39 maszyna kernel: [ 3006.975909] RAID1 conf printout:
: 03 listopada 2014, 08:20
autor: franek4always
Tu widać problem z sektorami, klasyczna przypadłość Seagate'ów
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 3
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 3
jeśli to
serwer to dysk do kosza. Puść test dysku poleceniem:
: 03 listopada 2014, 15:06
autor: rafsze
Co do wypinania się dysku i jego całkowitego znikania z systemu to wydaje się, że problem został rozwiązany przez wymianę kabla danych SATA na nowy. Od czwartku do dziś ani wzmianki w logach na ten temat.
Zasadniczo dysk ma jakiś miesiąc, więc pytanie czy reklamować i wymienić na ten sam czy raczej inna firma?
Kod: Zaznacz cały
smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.0-4-686-pae] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.10
Device Model: ST380815AS
Serial Number: 6RW3302V
Firmware Version: 4.ADA
User Capacity: 80,000,000,000 bytes [80,0 GB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Mon Nov 3 14:45:17 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 430) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 27) minutes.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 100 253 006 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 099 097 070 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 099 099 020 Old_age Always - 1726
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 067 060 030 Pre-fail Always - 5511102
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 676
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 099 099 020 Old_age Always - 1785
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 059 054 045 Old_age Always - 41 (Min/Max 32/43)
194 Temperature_Celsius 0x0022 041 046 000 Old_age Always - 41 (0 23 0 0)
195 Hardware_ECC_Recovered 0x001a 069 056 000 Old_age Always - 194561909
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 3
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 3
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0000 100 253 000 Old_age Offline - 0
202 Data_Address_Mark_Errs 0x0032 100 253 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 676 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
: 04 listopada 2014, 12:01
autor: sethiel
To poczekaj z wymianą, sprawdź za parę dni czy dalej się wywala.
Z mojej strony - też mam lipne 2TB dyski Seagate i już parę razy wysyłaliśmy je sobie, firma schodzi na manowce.