Fix hardware errors

Before running a system recovery procedure, it is important to identify and fix the root cause of the hardware issues.

Identifying and fixing the root cause can help recover a system, if these are the faults that are causing the system to fail. The following are common issues which can be easily resolved:
  • The node has been powered off or the power cords were unplugged.
  • Check the node status of every node canister that is part of this system. Resolve all hardware errors except node error 578 or node error 550.
    • All nodes must be reporting either a node error 578 or a node error 550. These error codes indicate that the system has lost its configuration data. If any nodes report anything other than these error codes, do not perform a recovery. You can encounter situations where non-configuration nodes report other node errors, such as a 550 node error. The 550 error can also indicate that a node is not able to join a system.
    • If any nodes show a node error 550, record the error data that is associated with the 550 error from the service assistant.
      • In addition to the node error 550, the report can show data that is separated by spaces in one of the following forms:
        • Node identifiers in the format: <enclosure_serial>-<canister slot ID>(7 characters, hyphen, 1 number), for example, 01234A6-2
        • Quorum drive identifiers in the format: <enclosure_serial>:<drive slot ID>[<drive 11S serial number>] (7 characters, colon, 1 or 2 numbers, open square bracket, 22 characters, close square bracket), for example, 01234A9:21[11S1234567890123456789]
        • Quorum MDisk identifier in the format: WWPN/LUN (16 hexadecimal digits followed by a forward slash and a decimal number), for example, 1234567890123456/12
      • If the error data contains a node identifier, ensure that the node that is referred to by the ID is showing node error 578. If the node is showing a node error 550, ensure that the two nodes can communicate with each other. Verify the SAN connectivity, and if the 550 error is still present, restart one of the two nodes from the service assistant by clicking Restart Node.
      • If the error data contains a quorum drive identifier, locate the enclosure with the reported serial number. Verify that the enclosure is powered on and that the drive in the reported slot is powered on and functioning. If the node canister that is reporting the fault is in the I/O group of the listed enclosure, ensure that it has SAS connectivity to the listed enclosure. If the node canister that is reporting the fault is in a different I/O group from the listed enclosure, ensure that the listed enclosure has SAS connectivity to both node canisters in the control enclosure in its I/O group. After verification, restart the node by clicking Restart Node from the service assistant.
      • If the error data contains a quorum MDisk identifier, verify the SAN connectivity between this node and that WWPN. Check the storage controller to ensure that the LUN referred to is online. After verification, if the 550 error is still present, restart the node from the service assistant by clicking Restart Node.
      • If there is no error data, the error is because there are insufficient connections between nodes over the Fibre Channel network. Each node must have at least two independent Fibre Channel logical connections, or logins, to every node that is not in the same enclosure. An independent connection is one where both physical ports are different. In this case, there is a connection between the nodes, but there is not a redundant connection. If there is no error data, wait 3 minutes for the SAN to initialize. Next, verify:
        • There are at least two Fibre Channel ports that are operational and connected on every node.
        • The SAN zoning allows every port to connect to every port on every other node
        • All redundant SANs (if used) are operational.

        After verification, if the 550 error is still present, restart the node from the service assistant by clicking Restart Node.

      Note: If after resolving all these scenarios, half or greater than half of the nodes are reporting node error 578, it is appropriate to run the recovery procedure. Call the Lenovo Support Center for further assistance.
    • For any nodes that are reporting a node error 550, ensure that all the missing hardware that is identified by these errors is powered on and connected without faults. If you cannot contact the service assistant from any node, isolate the problems by using the LED indicators.
    • If you have not been able to restart the system, and if any node other than the current node is reporting node error 550 or 578, you must remove system data from those nodes. This action acknowledges the data loss and puts the nodes into the required candidate state.