Explanation
The cluster has an offline node and has determined
that one of the candidate nodes matches the characteristics of the
offline node. The cluster has attempted but failed to add the node
back into the cluster. The cluster has stopped attempting to automatically
add the node back into the cluster.
If a node has incomplete
state data, it remains offline after it starts. This occurs if the
node has had a loss of power or a hardware failure that prevented
it from completing the writing of all of the state data to disk. The
node reports a node error 578 when it is in this state.
If three
attempts to automatically add a matching candidate node to a cluster
have been made, but the node has not returned online for 24 hours,
the cluster stops automatic attempts to add the node and logs error
code 1194 "Automatic recovery of offline node failed".
Two
possible scenarios when this error event is logged are:
- The node has failed without saving all of its state data. The
node has restarted, possibly after a repair, and shows node error
578 and is a candidate node for joining the cluster. The cluster attempts
to add the node into the cluster but does not succeed. After 15 minutes,
the cluster makes a second attempt to add the node into the cluster
and again does not succeed. After another 15 minutes, the cluster
makes a third attempt to add the node into the cluster and again does
not succeed. After another 15 minutes, the cluster logs error code
1194. The node never came online during the attempts to add it to
the cluster.
- The node has failed without saving all of its state data. The
node has restarted, possibly after a repair, and shows node error
578 and is a candidate node for joining the cluster. The cluster attempts
to add the node into the cluster and succeeds and the node becomes
online. Within 24 hours the node fails again without saving its state
data. The node restarts and shows node error 578 and is a candidate
node for joining the cluster. The cluster again attempts to add the
node into the cluster, succeeds, and the node becomes online; however,
the node again fails within 24 hours. The cluster attempts a third
time to add the node into the cluster, succeeds, and the node becomes
online; however, the node again fails within 24 hours. After another
15 minutes, the cluster logs error code 1194.
A combination of these scenarios is also possible.
Note:
If the node is manually removed from the cluster, the count of automatic
recovery attempts is reset to zero.
User Response
- If the node has been continuously online in the cluster for more
than 24 hours, mark the error as fixed and go to the Repair Verification
MAP.
- Determine the history of events for this node by locating events
for this node name in the event log. Note that the node ID will change,
so match on the WWNN and node name. Also, check the service records.
Specifically, note entries indicating one of three events: 1) the
node is missing from the cluster (cluster error 1195 event 009052),
2) an attempt to automatically recover the offline node is starting
(event 980352), 3) the node has been added to the cluster (event 980349).
- If the node has not been added to the cluster since the recovery
process started, there is probably a hardware problem. The node's
internal disk might be failing in a manner that it is unable to modify
its software level to match the software level of the cluster. If
you have not yet determined the root cause of the problem, you can
attempt to manually remove the node from the cluster and add the node
back into the cluster. Continuously monitor the status of the nodes
in the cluster while the cluster is attempting to add the node. Note:
If the node type is not supported by the software version of the cluster,
the node will not appear as a candidate node. Therefore, incompatible
hardware is not a potential root cause of this error.
- If the node was added to the cluster but failed again before it
has been online for 24 hours, investigate the root cause of the failure.
If no events in the event log indicate the reason for the node failure,
collect dumps and contact IBM technical support for assistance.
- When you have fixed the problem with the node, you must use either
the cluster console or the command line interface to manually remove
the node from the cluster and add the node into the cluster.
- Mark the error as fixed and go to the verification MAP.
Possible Cause-FRUs or other:
None, although investigation
might indicate a hardware failure.