The gmlinktolerance feature monitors the response times
for Global Mirror relationships in noncycling mode. You can use the chsystem CLI command or the management GUI to set the gmlinktolerance feature. The gmlinktolerance feature
represents the number of seconds that the primary Lenovo Storage V7000 clustered system tolerates slow response times from the secondary system.
If the poor response extends past the specified tolerance,
a 1920 error is logged; also, one or more Global Mirror relationships
are automatically stopped to protect the application hosts at the
primary site. During normal operation, application hosts see a minimal
impact to response times because the Global Mirror feature uses asynchronous
replication. However, if
Global Mirror operations experience degraded response times from the
secondary system for an extended time, I/O operations begin to queue
at the primary system. This results in an extended response time to
application hosts. In this situation, the gmlinktolerance feature
stops Global Mirror relationships and the application hosts response
time returns to normal. After a 1920 error occurs, the Global Mirror
auxiliary
volumes are no longer
in the consistent_synchronized state until you fix the cause of the
error and restart your Global Mirror relationships. For this reason,
ensure that you monitor the system to track when this occurs.
You
can disable the gmlinktolerance feature by setting the gmlinktolerance
value to 0 (zero). However, the gmlinktolerance cannot protect applications
from extended response times if it is disabled. It might be appropriate
to disable the gmlinktolerance feature in the following circumstances:
- During SAN maintenance windows, where degraded performance is
expected from SAN components and application hosts can withstand extended
response times from Global Mirror volumes.
- During periods when application hosts can tolerate extended response
times, where it is expected that the gmlinktolerance feature might
stop the Global Mirror relationships. For example, if you are testing
by using an I/O generator that is configured to stress the backend
storage, the gmlinktolerance feature might detect the high latency
and stop the Global Mirror relationships. Disabling gmlinktolerance
prevents this at the risk of exposing the test host to extended response
times.
Diagnosing and fixing 1920 errors
A code_1920.html#xx192 error indicates
that one or more of the SAN components are unable to provide the performance
that is required by the application hosts. This can be temporary (for
example, a result of maintenance activity) or permanent (for example,
a result of a hardware failure or unexpected host I/O workload).
If the 1920 error was preceded by
informational event 985004, Maximum replication delay has
been exceeded, the system might not find a path to the disk
in the remote system within the maximum replication delay timeout
value. Investigate the remote system to find, and repair, any degraded
paths. You can also use the lssystem command to
view the maxreplicationdelay value. If the value
is too low, use the chsystem command to specify
a new maxreplicationdelay value.
If you
are experiencing other 1920 errors, set up a SAN performance analysis
tool, such as the
Spectrum Control,
and make sure that it is correctly configured and monitoring statistics
when the problem occurs. Set your SAN performance analysis tool to
the minimum available statistics collection interval. For the
Spectrum Control,
the minimum interval is 5 minutes. If several 1920 errors occur, diagnose
the cause of the earliest error first. The following questions can
help you determine the cause of the error:
- Was maintenance occurring at the time of the error?
This might
include replacing a storage system physical disk, updating the firmware of the storage system, or completing a code update on one of the Lenovo Storage V7000 systems.Before you restart the Global Mirror relationships in noncycling mode, you must wait until
the maintenance procedure is complete. Otherwise, you will receive
another 1920 error because the system has not yet returned to a stable
state with good performance.
- Were there any unfixed errors on either the source or target system?
If yes, analyze them to determine whether they are the reason for
the error. In particular, see whether they relate to the volume or MDisks that are being used in
the relationship, or if they would have caused a reduction in performance
of the target system. Ensure that the error is fixed before you restart
the Global Mirror relationship.
- Is the long-distance link overloaded?
If your link is not capable
of sustaining the short-term peak Global Mirror workload, a 1920 error
can occur. Complete the following checks to determine if the long
distance link is overloaded:
- Look at the total Global Mirror auxiliary volume write throughput before the Global Mirror relationships
were stopped. If this volume is approximately equal to your link bandwidth,
your link might be overloaded. This might be due to application host
I/O operations or a combination of host I/O and background (synchronization)
copy activities.
- Look at the total Global Mirror source volume write throughput before the Global Mirror relationships
were stopped. This value represents the I/O operations that are being
completed by the application hosts. If these operations are approaching
the link's bandwidth,
reduce the I/O operations that the application is attempting to complete,
or use Global Mirror to copy fewer volumes. If the auxiliary disks show significantly more write I/O operations
than the source volumes, there
is a high level of background copy. Decrease the Global Mirror partnership's
background copy rate parameter to bring the total application I/O
bandwidth and background copy rate within the link's capabilities.
- Look at the total Global Mirror source volume write throughput after the Global Mirror relationships
were stopped. If write throughput increases by 30% or more when the
relationships are stopped, the application hosts are attempting to
complete more I/O operations than the link can sustain. While the
Global Mirror relationships are active, the overloaded link causes
higher response times to the application host, which decreases the
throughput it can achieve. After the Global Mirror relationships have
stopped, the application host sees lower response times. In this case,
the link bandwidth must be increased, the application host I/O rate
must be decreased, or fewer volumes must be copied using Global Mirror.
- Are the storage systems at the secondary system overloaded?
If application I/O operations cannot proceed at the rate
needed by the application host because one or more MDisks is providing
poor service to the system, a 1920 error will occur.
If the back-end
storage system requirements were followed, the error might have been
caused by a decrease in
storage system performance.
Check the back-end
write response time for each MDisk at the secondary system.A response time for an individual MDisk that suddenly
increased 50 ms or more or a response time above 100 ms indicates
a problem. Complete the following checks to determine if the
storage systems are overloaded:
- Check the storage system for error conditions such as media errors, a failed physical
disk, or associated activity such as RAID rebuilding. If there is
an error, fix the problem and then restart the Global Mirror relationships.
- If there is no error, determine if the secondary storage system can process the required level of application host I/O
operations. It might be possible to improve the performance of the storage system by adding more physical disks to an array, changing the
RAID level of the array, changing the cache settings of the storage system, checking the cache battery to ensure it is operational,
or changing other specific configuration parameters of the storage system.
- Are the storage systems at the primary system overloaded?
Analyze the performance
of the primary back-end storage by using the same steps as for the
secondary back-end storage. If performance is bad, limit the number
of I/O operations that can be completed by application hosts. Monitor
the back-end storage at the primary site even if the Global Mirror relationships were not affected. If bad performance continues
for a prolonged period, a 1920 error occurs and the Global Mirror
relationships are stopped.
- Is one of your Lenovo Storage V7000 systems
overloaded?
Check the port-to-local-node
send response time and the local-node send queue time.If the total of these two statistics for either system is above 1
millisecond, the system might be experiencing a high I/O load. Also, check the Lenovo Storage V7000 node
CPU utilization. If this figure is above 50%, this can also be contributing
to the problem. In either case, contact your Lenovo service representative for further assistance.
- Do you have FlashCopy® operations in the prepared state at the secondary system?
If the Global Mirror auxiliary volumes are the sources of a FlashCopy mapping and that mapping is in the
prepared state for an extended time, performance to those volumes can be impacted because the cache
is disabled. Start the FlashCopy mapping to enable the cache and improve
performance for Global Mirror I/O operations.