po raz wtóry mam probelm z klastrem(2 maszyny) skonfigurowanej wg clusterfrom scrach. Problem zaczął się o 22 w sumie na dwóch klastrach równocześnie(poprzednia awaria też była w środę o 22).
Wszystko zaczyna się od tego:
Kod: Zaznacz cały
Nov 25 22:01:06 [1289] clouster2 corosync notice [TOTEM ] A processor failed, forming new configuration.
Nov 25 22:01:06 [1340] clouster2 cib: info: crm_client_new: Connecting 0x7f8b5c938fc0 for uid=0 gid=0 pid=55986 id=4b073a6c-9d58-4ef6-ac37-c000e9cce483
Nov 25 22:01:06 [1340] clouster2 cib: info: crm_client_destroy: Destroying 0 events
Nov 25 22:01:08 [1289] clouster2 corosync notice [TOTEM ] A new membership (10.0.0.1:188) was formed. Members
Nov 25 22:02:13 [1345] clouster2 crmd: info: crm_timer_popped: PEngine Recheck Timer (I_PE_CALC) just popped (900000ms)
Nov 25 22:02:13 [1345] clouster2 crmd: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED origin=crm_timer_popped ]
Nov 25 22:02:13 [1345] clouster2 crmd: info: do_state_transition: Progressed to state S_POLICY_ENGINE after C_TIMER_POPPED
Nov 25 22:02:13 [1340] clouster2 cib: info: cib_process_request: Completed cib_query operation for section 'all': OK (rc=0, origin=local/crmd/3673, version=0.165.2)
Kod: Zaznacz cały
node $id="1" cluster1 \
attributes standby="off"
node $id="2" cluster2 \
attributes standby="off" maintenance="off"
primitive Lvm ocf:heartbeat:LVM \
params volgrpname="opt" \
op start timeout="30" interval="0" \
op stop timeout="30" interval="0" \
meta target-role="Started"
primitive courier-authdaemon lsb:courier-authdaemon \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s"
primitive courier-imap lsb:courier-imap \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
meta target-role="Started"
primitive courier-pop lsb:courier-pop \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s"
primitive ip_flow ocf:heartbeat:IPaddr2 \
params ip="91.220.164.13" cidr_netmask="25" nic="eth1" \
op monitor interval="30s"
primitive ipsource lsb:ipsource.sh \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
meta target-role="Started"
primitive mysql upstart:mysql
primitive opt-directory ocf:heartbeat:Filesystem \
params fstype="ext4" directory="/opt" device="/dev/opt/opt" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="300s"
primitive opt-drbd ocf:linbit:drbd \
params drbd_resource="opt" \
op monitor interval="60s" role="Master" \
op monitor interval="61s" role="Slave" \
op start timeout="240" interval="0" \
op stop timeout="100" interval="0"
primitive postfix lsb :p ostfix \
op start interval="0" timeout="120s" \
op stop interval="0" timeout="120s" \
meta target-role="Started"
group ip-group ipsource ip_flow
group poczta mysql courier-authdaemon courier-imap courier-pop postfix \
meta target-role="Started"
ms opt opt-drbd \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" globally-unique="false"
colocation Lvm-with-opt inf: Lvm opt:Master
colocation ip-group-with-opt inf: ip-group opt:Master
colocation opt-directory-with-Lvm inf: opt-directory Lvm
colocation poczta-with-opt-directory inf: poczta opt-directory
order Lvm-before-opt-directory inf: Lvm opt-directory
order opt-before-Lvm inf: opt :p romote Lvm:start
order opt-before-ip-group inf: opt :p romote ip-group
order poczata-before-opt-directory inf: opt-directory poczta
property $id="cib-bootstrap-options" \
dc-version="1.1.10-42f2063" \
cluster-infrastructure="corosync" \
stonith-enabled="false" \
no-quorum-policy="ignore" \
last-lrm-refresh="1448522891"
clouster1 twierdzi, że masterem jest clouster2 (bo o 22 przełączył) i że wszystko jest ok.
clouster2 że usługi są zatrzymane
Kod: Zaznacz cały
crm(live)resource# show
Resource Group: poczta
mysql (upstart:mysql): Stopped
courier-authdaemon (lsb:courier-authdaemon): Stopped
courier-imap (lsb:courier-imap): Stopped
courier-pop (lsb:courier-pop): Stopped
postfix (lsb :p ostfix): Stopped
Master/Slave Set: opt [opt-drbd]
Slaves: [ clouster1 ]
Stopped: [ clouster2 ]
Lvm (ocf::heartbeat:LVM): Stopped
opt-directory (ocf::heartbeat:Filesystem): Stopped
Resource Group: ip-group
ipsource (lsb:ipsource.sh): Started
ip_flow (ocf::heartbeat:IPaddr2): Started
Później cleanup na opt i wszystko śmiga. Problem na przestrzeni pół roku pojawił się 3 lub 4 raz. Czy ma ktoś pomysł co może być tego przyczyną?
Poniżej wstawiam pliki z pełnymi logami.
http://filebin.ca/2NrWb30Xovhk/first5min.log
http://filebin.ca/2NrWwag0fjGo/corosunc.log.tar.gz