This article is contributed. See the original author and article here.
The database level health detection failover option introduced on this article
In addition to the existing checks, the new implementation has the following additional checks.
- The new implementation stores and uses a historical snapshot of the database state information to decide if the AG needs to be marked in error state or not. The health check routine caches the database state and associated error information, for the last three executions, which is then compared with the state information from the current execution of the health detection routine. If the same error condition (for the below mentioned error codes) exists in the four consecutive runs of the health detection routine, a failover is initiated. So for example if during the first run, let’s say at 22:00:00 there is an error 823, and the same error conditions exists at the subsequent runs at 22:00:10, 22:00:20 and 22:00:30 then the AG is marked in an error state and the control is passed to the cluster. This implementation is intended to provide safeguards against transient errors and issues which can be fixed by the auto page repair capabilities of the availability groups.
- The new implementation checks for following additional errors. Majority of these errors are indicative of a hardware issues on the server. Please note, that this is not an exhaustive list of errors which could impact the database availability. There is an outstanding item to include error 824 to this list.
Error |
Cause |
Documentation |
605 |
Page or allocation corruption. |
|
823 |
Checkpoint failures. |
|
829 |
Disk corruption. |
|
832 |
Hardware or memory corruption. |
|
1101 |
No disk space available in a filegroup. |
|
1105 |
No disk space available in a filegroup. |
|
5102 |
Missing filegroup ID requests. |
|
5180 |
Wrong file ID requests. |
|
5515 |
|
|
5534 |
Log corruption due to FILESTREAM operation log record. |
|
5535 |
FILESTREAM data container corruption. |
|
9004 |
Log Corruption |
If we enable this feature, to make sure AG can failover successfully, we need to change the default failover policy.
The default “max restarts in the specified Period =1 in 1 hour
The default “max failure in the specified Period” =1 in 6 hours
Based on this settings, if the 823 error reported but this error could not be repaired from the secondary replica:
- Detected 823 in 3 generation(30s)
- WFSC got the error state from the AG controller, restarted the AG resource. –DB level error won’t prevent AG resource restart, the restart always works fine.
- The failure count +1, the restart count+1
- If continue detect 823 again—>AG offline. No failover because it reach to “max failure in the specified Period” =1. no more action took.
Recommend setting : “max failure in the specified Period” >=“max restarts in the specified Period+1 at least. Then all restart attempt finished but the issue still is detected, next time failover will trigger.
Brought to you by Dr. Ware, Microsoft Office 365 Silver Partner, Charleston SC.
Recent Comments