Raid programowy - wypisanie si

rafsze · Post autor: **rafsze** » 30 października 2014, 01:49

Hej,

Mam w pracy mały serwer www i bazy danych pod kontrolą debiana do aplikacji webowej. Dla podniesienia poziomu niezawodności zainstalowałem go na dwóch dyskach spiętych w raid1 (mirror). Już chyba 3 raz zdarzyło się tak, że dysk samoczynnie się wypiął z macierzy. Po wydaniu odpowiednich komend dysk wraca do macierzy i system posłusznie odbudowuje macierz, i wygląda, że jest ok. (nie pamiętam jak było wcześniej,ale dzisiaj musiałem z rebootować system bo nie widział w ogóle dysku)

Zastanawia mnie przyczyna takiej sytuacji i czy można jakoś temu zapobiec.
Oczywiście serwer alarmuje mnie to takiej sytuacji, ale jest to troszkę upierdliwe, by co jakieś 20 dni odbudowywać macierz (swoją drogą to ten sam dysk się wypina /dev/sdb)

Nadmienię, iż serwer pracuje 24/7.

Jakieś pomysły?

Post autor: **LordRuthwen** » 30 października 2014, 08:25

Tak, odczytaj SMART i zobacz co się z tym dyskiem dzieje, bo wg mnie jest uszkodzony i pasowało by się zastanowić nad wymianą.

rafsze · Post autor: **rafsze** » 30 października 2014, 13:45

Hm zasadniczo po pierwszym wypięciu nie sprawdziłem dysku, bo nauczony doświadczeniem z tym dostawcą serwera ( jakiś miał talent, że wszystko co dostarczył w końcu się psuło, taki król antymidas) wymieniłem uszkodzony dysk na nowy oraz ten działający profilaktycznie również.

W każdym razie na wszelki wypadek sprawdzę te dyski jak będę w firmie.

rafsze · Post autor: **rafsze** » 30 października 2014, 16:54

Wpadłem do biura zrobić testy i okazało się, że znów dysk się wypiął. czyli po jakichś 2 dniach...
Podobnie jak poprzednio dysk w ogóle nie był widziany przez system dopiero po reboocie dało się go dodać. Aktualnie macierz się odbudowuje.
Przetestowałem oba dyski (sdb to ten wypinający się). W zasadzie jedynym parametrem, który mnie zasanawia jest seek_error_rate, ale nie mogłem znaleźć żadnych norm. Co na ten temat myślicie?

Może problem brzmi znikanie dysku wogóle z systemu i dlatego znika z macierzy.

/dev/sda

Kod: Zaznacz cały

smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.0-4-686-pae] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST380815AS
Serial Number:    9RW4MDSW
Firmware Version: 4.ADA
User Capacity:    80,000,000,000 bytes [80,0 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Oct 30 16:41:44 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (  430) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  27) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   098   097   070    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   063   060   030    Pre-fail  Always       -       2307929
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       858
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       16
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   064   057   045    Old_age   Always       -       36 (Min/Max 33/36)
194 Temperature_Celsius     0x0022   036   043   000    Old_age   Always       -       36 (0 22 0 0)
195 Hardware_ECC_Recovered  0x001a   088   075   000    Old_age   Always       -       182934975
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

/dev/sdb

Kod: Zaznacz cały

smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.0-4-686-pae] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST380815AS
Serial Number:    6RW3302V
Firmware Version: 4.ADA
User Capacity:    80,000,000,000 bytes [80,0 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Oct 30 16:41:49 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (  430) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  27) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   099   097   070    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1725
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   066   060   030    Pre-fail  Always       -       4622459
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       583
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1784
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   065   054   045    Old_age   Always       -       35 (Min/Max 29/35)
194 Temperature_Celsius     0x0022   035   046   000    Old_age   Always       -       35 (0 23 0 0)
195 Hardware_ECC_Recovered  0x001a   071   056   000    Old_age   Always       -       22903739
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

-------------------------

Przejrzałem rónież syslogi i znalazłem fragment logów tuż przed informacją mdstat o usunięciu z macierzy jednego z dysków.

Czy ktoś wie co one oznaczają?

Kod: Zaznacz cały

Oct 29 14:47:38 maszyna kernel: [ 2946.835795] ata2: exception Emask 0x10 SAct 0x0 SErr 0x10000 action 0xe frozen
Oct 29 14:47:38 maszyna kernel: [ 2946.835932] ata2: irq_stat 0x00400000, PHY RDY changed
Oct 29 14:47:38 maszyna kernel: [ 2946.836053] ata2: SError: { PHYRdyChg }
Oct 29 14:47:38 maszyna kernel: [ 2946.836138] ata2: hard resetting link
Oct 29 14:47:41 maszyna kernel: [ 2949.192033] ata2: COMRESET failed (errno=-32)
Oct 29 14:47:41 maszyna kernel: [ 2949.192121] ata2: reset failed (errno=-32), retrying in 8 secs
Oct. 29 14:47:48 maszyna kernel: [ 2956.836052] ata2: limiting SATA link speed to 1.5 Gbps
Oct 29 14:47:48 maszyna kernel: [ 2956.836060] ata2: hard resetting link
Oct 29 14:47:55 maszyna kernel: [ 2962.856051] ata2: link is slow to respond, please be patient (ready=0)
Oct 29 14:47:59 maszyna kernel: [ 2966.888034] ata2: COMRESET failed (errno=-16)
Oct 29 14:47:59 maszyna kernel: [ 2966.888127] ata2: hard resetting link
Oct 29 14:48:04 maszyna kernel: [ 2972.652036] ata2: link is slow to respond, please be patient (ready=0)
Oct 29 14:48:34 maszyna kernel: [ 3001.940045] ata2: COMRESET failed (errno=-16)
Oct 29 14:48:34 maszyna kernel: [ 3001.940138] ata2: hard resetting link
Oct 29 14:48:39 maszyna kernel: [ 3006.968037] ata2: COMRESET failed (errno=-16)
Oct 29 14:48:39 maszyna kernel: [ 3006.968127] ata2: reset failed, giving up
Oct 29 14:48:39 maszyna kernel: [ 3006.968202] ata2.00: disabled
Oct 29 14:48:39 maszyna kernel: [ 3006.968221] ata2: exception Emask 0x10 SAct 0x0 SErr 0x4050002 action 0xe frozen t4
Oct 29 14:48:39 maszyna kernel: [ 3006.968353] ata2: irq_stat 0x00400040, connection status changed
Oct 29 14:48:39 maszyna kernel: [ 3006.968461] ata2: SError: { RecovComm PHYRdyChg CommWake DevExch }
Oct 29 14:48:39 maszyna kernel: [ 3006.968578] ata2: hard resetting link
Oct 29 14:48:39 maszyna kernel: [ 3006.972035] sd 1:0:0:0: rejecting I/O to offline device
Oct 29 14:48:39 maszyna kernel: [ 3006.972172] sd 1:0:0:0: [sdb] killing request
Oct 29 14:48:39 maszyna kernel: [ 3006.972216] sd 1:0:0:0: rejecting I/O to offline device
Oct 29 14:48:39 maszyna kernel: [ 3006.972321] md: super_written gets error=-5, uptodate=0
Oct 29 14:48:39 maszyna kernel: [ 3006.972330] md/raid1:md0: Disk failure on sdb1, disabling device.
Oct 29 14:48:39 maszyna kernel: [ 3006.972332] md/raid1:md0: Operation continuing on 1 devices.
Oct 29 14:48:39 maszyna kernel: [ 3006.975909] RAID1 conf printout:

franek4always · Post autor: **franek4always** » 03 listopada 2014, 08:20

Tu widać problem z sektorami, klasyczna przypadłość Seagate'ów

197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 3
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 3

jeśli to serwer to dysk do kosza. Puść test dysku poleceniem:

Kod: Zaznacz cały

smartctl -t long /dev/sdb

rafsze · Post autor: **rafsze** » 03 listopada 2014, 15:06

Co do wypinania się dysku i jego całkowitego znikania z systemu to wydaje się, że problem został rozwiązany przez wymianę kabla danych SATA na nowy. Od czwartku do dziś ani wzmianki w logach na ten temat.

Zasadniczo dysk ma jakiś miesiąc, więc pytanie czy reklamować i wymienić na ten sam czy raczej inna firma?

Kod: Zaznacz cały

smartctl 5.41 2011-06-09 r3365 [i686-linux-3.2.0-4-686-pae] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.10
Device Model:     ST380815AS
Serial Number:    6RW3302V
Firmware Version: 4.ADA
User Capacity:    80,000,000,000 bytes [80,0 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Nov  3 14:45:17 2014 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82)    Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0)    The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:         (  430) seconds.
Offline data collection
capabilities:              (0x5b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003)    Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01)    Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:      (   2) minutes.
Extended self-test routine
recommended polling time:      (  27) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   100   253   006    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   099   097   070    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   099   099   020    Old_age   Always       -       1726
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   067   060   030    Pre-fail  Always       -       5511102
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       676
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   099   099   020    Old_age   Always       -       1785
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   059   054   045    Old_age   Always       -       41 (Min/Max 32/43)
194 Temperature_Celsius     0x0022   041   046   000    Old_age   Always       -       41 (0 23 0 0)
195 Hardware_ECC_Recovered  0x001a   069   056   000    Old_age   Always       -       194561909
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       3
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       3
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 Data_Address_Mark_Errs  0x0032   100   253   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%       676         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

sethiel · Post autor: **sethiel** » 04 listopada 2014, 12:01

To poczekaj z wymianą, sprawdź za parę dni czy dalej się wywala.
Z mojej strony - też mam lipne 2TB dyski Seagate i już parę razy wysyłaliśmy je sobie, firma schodzi na manowce.

Raid programowy - wypisanie si

Raid programowy - wypisanie się dysków z macierzy