The gmlinktolerance feature monitors the response times for Global Mirror relationships in
noncycling mode. You can use the chsystem CLI command or the management GUI to set the gmlinktolerance feature.
The gmlinktolerance feature represents the number of seconds that the primary clustered system tolerates slow response times from the secondary system.
If the poor response extends past the specified tolerance, a 1920 error is logged. Also,
one or more Global Mirror relationships are automatically stopped to protect the application hosts
at the primary site. During normal operation, application hosts see a minimal impact to response
times because the Global Mirror feature uses asynchronous replication. However, if
Global Mirror operations experience
degraded response times from the secondary system for an extended time, I/O operations queue at the
primary system. This situation results in an extended response time to application hosts. In this
case, the gmlinktolerance feature stops Global Mirror relationships and the application hosts
response time returns to normal. After a 1920 error occurs, the Global Mirror auxiliary
volumes are no longer in the consistent_synchronized state until you fix the
cause of the error and restart your Global Mirror relationships. For this reason, ensure that you
monitor the system to track when this error occurs.
You can disable the gmlinktolerance feature by
setting the gmlinktolerance value to 0 (zero). However, the gmlinktolerance feature cannot protect
applications from extended response times if it is disabled. It might be appropriate to disable the
gmlinktolerance feature in the following circumstances:
- During SAN maintenance windows, where degraded performance is expected from
SAN components and application hosts can withstand extended response times from Global Mirror
volumes.
- During periods when application hosts can tolerate extended response times, where it is expected
that the gmlinktolerance feature might stop the Global Mirror relationships. For example, if you are
testing by using an I/O generator that is configured to stress the backend storage, the
gmlinktolerance feature might detect the high latency and stop the Global Mirror relationships.
Disabling gmlinktolerance prevents this at the risk of exposing the test host to extended response
times.
Diagnosing and fixing 1920 errors
A code_1920.html#xx192 error indicates that one
or more of the SAN components are unable to provide the performance that is required by the
application hosts. This error can be temporary (for example, a result
of maintenance activity) or permanent (for example, a result of a hardware failure or unexpected
host I/O workload).
If the 1920 error was preceded by informational event 985004,
Maximum replication delay has been exceeded, the system might not find a path to
the disk in the remote system within the maximum replication delay timeout value. Investigate the
remote system to find, and repair, any degraded paths. You can also use the
lssystem command to view the maxreplicationdelay value. If the
value is too low, use the chsystem command to specify a new
maxreplicationdelay value.
If you are experiencing other 1920 errors, set up a SAN performance
analysis tool, such as the IBM Spectrum Control, and
make sure that it is correctly configured and monitoring statistics when the problem occurs. Set
your SAN performance analysis tool to the minimum available statistics collection interval.
For a
IBM Spectrum Control system, the minimum interval
is 5 minutes. If several 1920 errors occur, diagnose the cause of the earliest error first. The
following questions can help you determine the cause of the
error:
- Was maintenance occurring at the time of the error?
Maintenance might include replacing a
storage
system physical disk, updating the firmware of the storage
system, or completing a code update on
one of the
systems.
Before you restart the Global Mirror relationships in
noncycling mode, you must wait until the maintenance procedure is complete. Otherwise, another 1920
error is issued because the system has not yet returned to a stable state with good performance.
- Were there any unfixed errors on either the source or target system?
If yes, analyze them to
determine whether they are the reason for the error. In particular, determine whether the errors
relate to the volume or MDisks that are being used in the relationship or if the
errors reduced the performance of the target system. Ensure that the errors are fixed before you
restart the Global Mirror relationship.
- Is the long-distance link overloaded?
If your link is not capable of sustaining the
short-term peak Global Mirror workload, a 1920 error can occur. Complete the following checks to
determine whether the long-distance link is overloaded:
- Look at the total Global Mirror auxiliary volume write throughput before the
Global Mirror relationships were stopped. If this volume is approximately equal to your link
bandwidth, your link might be overloaded. This issue might be due to application host I/O operations
or a combination of host I/O and background (synchronization) copy activities.
- Look at the total Global Mirror source volume write throughput before the
Global Mirror relationships were stopped. This value represents the I/O operations that are being
completed by the application hosts. If these operations are approaching the link's bandwidth,
reduce the I/O operations that the application is attempting to complete, or use Global Mirror to
copy fewer volumes. If the auxiliary disks show significantly more write I/O
operations than the source volumes, there is a high level of background copy.
Decrease the Global Mirror partnership's background copy rate parameter to bring the total
application I/O bandwidth and background copy rate within the link's capabilities.
- Look at the total Global Mirror source volume write throughput after the
Global Mirror relationships were stopped. If write throughput increases by 30% or more when the
relationships are stopped, the application hosts are attempting to complete more I/O operations than
the link can sustain. While the Global Mirror relationships are active, the overloaded link causes
higher response times to the application host, which decreases the throughput it can achieve. After
the Global Mirror relationships stop, the application host sees lower response times. In this case,
the link bandwidth must be increased, the application host I/O rate must be decreased, or fewer
volumes must be copied by using Global Mirror.
- Are the storage systems at the
secondary system overloaded?
If
application I/O operations cannot proceed at the rate that is needed by the application host because
one or more MDisks is providing poor service to the system, a 1920 error occurs.
If
the back-end
storage
system
requirements were followed, the error might be due to a decrease in
storage
system performance.
Check the back-end write response time for each MDisk at the secondary
system. A response time for an
individual MDisk that suddenly increased 50 ms or more or a response time above 100 ms indicates a
problem. Complete the following checks to determine whether the
storage systems are overloaded:
- Check the storage system for error
conditions such as media errors, a failed physical disk, or associated activity such as RAID
rebuilding. Fix any problems and then restart the Global Mirror relationships.
- If there is no error, determine whether the secondary storage
system can process the required level
of application host I/O operations. It might be possible to improve the performance of the storage
system by adding more physical disks
to an array, changing the RAID level of the array, changing the cache settings of the storage
system, ensuring the cache battery
is operational, or changing other specific configuration parameters of the storage
system.
- Are the storage systems at the
primary system overloaded?
Analyze the performance of the primary back-end storage by using the
same steps as for the secondary back-end storage. If performance is bad, limit the number of I/O
operations that can be completed by application hosts. Monitor the back-end storage at the primary
site even if the Global Mirror relationships were not
affected. If bad performance continues for a prolonged period, a 1920 error occurs and the Global
Mirror relationships are stopped.
- Is one of your systems overloaded?
Check the
port-to-local-node send response time and the local-node send queue
time.
If the total of these two statistics for either system is above 1 millisecond, the system might be
experiencing a high I/O load. Also, check the
system node CPU utilization,
as rates greater than 50% can also contribute to the problem. In either case, contact your IBM service representative for further assistance.
- Do you have FlashCopy
operations in the prepared state at the secondary system?
If the Global Mirror auxiliary
volumes are the sources of a FlashCopy mapping and that mapping is in the prepared state for an extended time, performance
to those volumes can be impacted because the cache is disabled. Start the FlashCopy mapping to enable the cache and improve
performance for Global Mirror I/O operations.