HyperSwap system configuration details

You can create an IBMHyperSwap® topology system configuration where each control enclosure that is used to access a HyperSwap volume is physically on a different site. When used with active-active relationships to create HyperSwap volumes, these configurations can be used to maintain access to data on the system when power failures or site-wide outages occur.

In a HyperSwap configuration, each site is defined as an independent failure domain. If one site experiences a failure, then the other site can continue to operate without disruption. You must also configure a third site to host a quorum device or IP quorum application that provides an automatic tie-break in case of a link failure between the two main sites. The main site can be in the same room or across rooms in the data center, buildings on the same campus, or buildings in different cities. Different kinds of sites protect against different types of failures.

Sites are within a single location
If each site is a different power phase within a single location or data center, the system can survive the failure of any single power domain. For example, one node can be placed in one rack installation and the other node can be in another rack. Each rack is considered a separate site with its own power phase. In this case, if power was lost to one of the racks, the partner node in the other rack might be configured to process requests. The partner node can effectively provide availability to data even when the other node is offline due to a power disruption.
Each site is at separate locations
If each site is a different physical location, the system can survive the failure of any single location. These sites can span shorter distances, for example two sites in the same city, or they can be spread farther geographically, such as two sites in separate cities. If one site experiences a site-wide disaster, the remaining site can remain available to process requests.

If configured properly, the system continues to operate after the loss of one site. The key prerequisite is that each site contains one of the control enclosures that are used to access copies of the HyperSwap volume.In the management GUI, the Modify System Topology wizard simplifies setting up HyperSwap system topology. After you configure HyperSwap topology, you can use the Create Volumes wizard to create HyperSwap volumes and copies for each site. In addition, the HyperSwap volume wizard automatically creates active-active relationships and change volumes to manage replication between sites. If you are configuring HyperSwap by using the command-line interface, you must also configure the system topology, volumes, and active-active relationships separately.

You must configure a HyperSwap system to meet the following requirements:
  • Directly connect each node to two or more SAN fabrics at the primary and secondary sites (2 - 8 fabrics are supported). Sites are defined as independent failure domains. A failure domain is a part of the system within a boundary. Any failure within that boundary (such as a power failure, fire, or flood) is contained within the boundary. The failure affects any part that is outside of that boundary. Failure domains can be in the same room or across rooms in the data center, buildings on the same campus, or buildings in different towns. Different kinds of failure domains protect against different types of faults.
  • Use a third site to house a quorum disk on an external storage system or an IP quorum application on a server.
  • The storage system at the third site, if used, must support extended quorum disks. More information is available in the interoperability matrixes that are available at the following website:

    http://support.lenovo.com/us/en/products/servers/lenovo-storage

  • Place independent storage systems at the primary and secondary sites, and use active-active relationships to mirror the host data between the two sites.
  • Connections can vary based on fibre type and small form-factor pluggable (SFP) transceiver (longwave and shortwave).
  • Nodes that have connections to switches that are longer than 100 meters (109 yards) must use longwave Fibre Channel connections. A longwave small form-factor pluggable (SFP) transceiver can be purchased as an optional component, and must be one of the longwave SFP transceivers that are listed at the following website:

    http://support.lenovo.com/us/en/products/servers/lenovo-storage

  • Avoid using inter-switch links (ISLs) in paths between nodes and external storage systems. If this configuration is unavoidable, do not oversubscribe the ISLs because of substantial Fibre Channel traffic across the ISLs. For most configurations, trunking is required. Because ISL problems are difficult to diagnose, switch-port error statistics must be collected and regularly monitored to detect failures.
  • Using a single switch at the third site can lead to the creation of a single fabric rather than two independent and redundant fabrics. A single fabric is an unsupported configuration.
  • Ethernet port 1 on every node must be connected to the same subnet or subnets. Ethernet port 2 (if used) of every node must be connected to the same subnet (this might be a different subnet from port 1). The same principle applies to other Ethernet ports.
  • Some service actions require physical access to all nodes in a system. If nodes in a HyperSwap system are separated by more than 100 meters, service actions might require multiple service personnel. Contact your service representative to inquire about multiple site support.
  • Use consistency groups to manage the volumes that belong to an application. This structure ensures that when a rolling disaster occurs, the out-of-date image is consistent and therefore usable for that application.
    • Use consistency groups to maintain data that is usable for disaster recovery for each application. Add relationships for each volume for an application to an appropriate consistency group.
    • You can add relationships to a consistency group only in certain states, including both sites accessible.
    • If you need to add a volume to an application to provide it with more capacity at a time when only one site is accessible, take careful note as you cannot create and add the HyperSwap relationship. Be sure to create the relationship and add it to the group as soon as possible after the failed site is recovered.

A HyperSwap system locates the active quorum disk at a third site. If communication is lost between the primary and secondary sites, the site with access to the active quorum disk continues to process transactions. If communication is lost to the active quorum disk, an alternative quorum disk at another site can become the active quorum disk.

A system of nodes can be configured to use up to three quorum disks. However, only one quorum disk can be elected to resolve a situation where the system is partitioned into two sets of nodes of equal size. The purpose of the other quorum disks is to provide redundancy if a quorum disk fails before the system is partitioned.

Restriction: Do not connect an external storage system in one site directly to a switch fabric in the other site.

An alternative configuration can use an extra Fibre Channel switch at the third site with connections from that switch to the primary site and to the secondary site.

A HyperSwap system configuration is supported only when the storage system that hosts the quorum disks supports extended quorum. Although the system can use other types of storage systems for providing quorum disks, access to these quorum disks is always through a single path.

For quorum disk configuration requirements, see the technote Guidance for Identifying and Changing Managed Disks Assigned as Quorum Disk Candidates.