AlwaysOn
What is SQL Server AlwaysOn?
Answer –AlwaysOn is a term Microsoft has
used since SQL Server 2012 for high availability and disaster recovery
solutions. As of now, two features fall under the umbrella of AlwaysOn. These two features support high availability and
disaster recovery for SQL Server databases:
SQL Server AlwaysOn FCIs are SQL
Server clustered instances whereas AGs are the new features introduced in SQL
Server 2012 to support data high availability and disaster recovery. We can
group each set of databases into one unit and execute a failover at one time
with the help of the Availability Group.
What is the difference
between AlwaysOn Failover Cluster Instances and AlwaysOn Availability Groups (AOAG)?
Answer – Please have a look
at the main differences between both AlwaysOn
solutions:
Can you explain
Availability Group Listeners?
Answer - The Availability
Group Listener is a virtual network name that we use to make connections to the
databases whether it is running from a primary replica or secondary replica after
failover.
Can we configure
Availability Groups without configuring Availability Group Listeners?
Answer- Yes, we can
configure an Availability Group without configuring listeners. Listeners are
configured to make databases connections irrespective of their replica status.
When I tried to create a
new Availability Group in SQL Server Management Studio, I saw that "New
Availability Group Wizard" is grayed out and I am not able to proceed
further to configure Availability Group. What could be the reason for this?
Answer – The "New Availability Group
Wizard" option is disabled until you enable the Availability Group feature
from the SQL Server service property.
Answer – We can easily
add new databases to an existing Availability Group. First, we need to prepare
the secondary database by taking the full backup and subsequent transaction log
backup then restore it on the secondary replicas in no recovery mode. Then we
can right click on Availability Group name to launch the Add Database wizard.
We should follow all required steps to proceed with this wizard. Once
completed, your new database will be added to the identified Availability
Group.
Answer – Until SQL
Server 2014, the AlwaysOn Availability Group will not
initiate a failover process if anything goes wrong at the database level.
Microsoft introduced an option named Enhanced Database Failover in SQL Server 2016 to
trigger the failover in case any database participating in an Availability
Group loses the ability to write transactions. We also call it Database
Level Health Detection in an Availability Group. By default,
this option is not enabled. You need to configure it if you want to initiate a
failover if anything goes wrong at the database level.
Answer – No, we cannot
configure AlwaysOn Availability Group between
different Windows server failover cluster groups. All replicas must be part of
same Windows server failover group. This is a basic prerequisite for AOAG.
What are the differences
between a SQL Server Availability Group and Replication?
Answer - Below are the main
differences between an Availability Group and Replication.
Answer - Yes,
we can take a SQL Server Availability Group offline by running the ALTER statement
as shown below:
ALTER AVAILABILITY GROUP group_name OFFLINE
Answer – Automatic
Seeding is a term that is used for automatically initialization of Availability
Groups. This feature was introduced in SQL Server 2016. When you create an
Availability Group with automatic seeding, SQL Server automatically creates the
secondary replicas for every database in the group. You no longer have to manually backup and restore the secondary replicas.
How many secondary
replicas can we configure in a SQL Server AlwaysOn
Availability Group?
Answer – We can configure
eight secondary replicas for any Availability Group.
Can we add system
databases to Availability Group?
Answer – No, we cannot
configure SQL Server Availability Groups for system databases.
Can we configure
automatic failover for a SQL Server Availability Group with the asynchronous
mode of data transfer?
Answer – No, we must use synchronous
commit data transfer to configure SQL Server Availability Groups with automatic
failover.
Can we change failover
modes for SQL Server Availability Group replicas? If yes, how?
Answer – Yes, we can change
failover modes of a SQL Server Availability Group with these steps:
Can you explain how many
types of Availability Groups there are in SQL Server?
Answer – There are a few
variants of Availability Groups in SQL Server.
Answer – No, SQL Server
Agent Jobs and Logins will not be replicated automatically. We need to manually
replicate them to secondary replicas.
Answer - If we drop and
re-create the Windows Server Failover Cluster (WSFC), we must disable and
re-enable the AlwaysOn Availability Groups feature on
each instance of SQL Server that hosted an availability replica on the original
WSFC cluster.
Answer - By default SQL
Server compresses data where appropriate while replicating it to secondary
replica with SQL Server AlwaysOn. But this is not
always true. It depends on the failover mode or type of operation that we
choose to perform in AlwaysOn. The table below shows
when SQL Server uses compression for Availability Group log streams:
Setting |
Compression |
Asynchronous-commit
replica |
Compressed |
Synchronous- commit
replicas |
Not Compressed |
During automatic
seeding |
Not compressed |
Q. We have got an alert
“WSFC cluster service is offline”. What is your action plan?
Ans:
This alert is raised when the
WSFC cluster is offline or in the forced quorum state. All availability groups
hosted within this cluster are offline (a disaster recovery action is
required).
Possible Reasons:
This issue can be caused by a
cluster service issue or by the loss of the quorum in the cluster.
Possible Solutions:
Use the Cluster Administrator
tool to perform the forced quorum or disaster recovery workflow. Once WFSC is
started you must re-evaluate and reconfigure NodeWeight
values to correctly construct a new quorum before bringing other nodes back
online. Otherwise, the cluster may go back offline again.
Reestablishment may require if
there are any High Availability features (Alwayson
Availability Groups, Log Shipping, Database Mirroring) using on effected nodes.
Q. How to force a WSFC
(Windows Server Failover Cluster) Cluster to start without a quorum?
Ans:
This can be done using
§
Failover Cluster Manager
§
Net.exe
§
PowerShell
Here we’ll see how this can be
done using FCM.
Failover Cluster Manager
§
Open a Failover Cluster Manager and connect to the desired
cluster node to force online.
§
In the Actions pane, click Force Cluster Start, and then click
Yes – Force my cluster to start.
§
In the left pane, in the Failover Cluster Manager tree, click
the cluster name.
§
In the summary pane, confirm that the current Quorum
Configuration value is: Warning: Cluster is running in ForceQuorum
state.
Q. We have got an alert
“Availability group is offline”. Can you explain about this warning and your
action plan?
Ans:
This alert is raised when the
cluster resource of the availability group is offline
or the availability group does not have a primary replica.
Possible Reasons:
§
The availability group is not configured with automatic failover
mode. The primary replica becomes unavailable and the role of all replicas in
the availability group become RESOLVING.
§
The availability group is configured with automatic failover
mode and does not complete successfully.
§
The availability group resource in the cluster becomes offline.
§
There is an automatic, manual, or forced failover in progress
for the availability group.
Possible Solutions:
§
If the SQL Server instance of the primary replica is down,
restart the server and then verify that the availability group recovers to a
healthy state.
§
If the automatic failover appears to have failed, verify that
the databases on the replica are synchronized with the previously known primary
replica, and then failover to the primary replica. If the databases are not
synchronized, select a replica with a minimum loss of data, and then recover to
failover mode.
§
If the resource in the cluster is offline while the instances of
SQL Server appear to be healthy, use Failover Cluster Manager to check the
cluster health or other cluster issues on the server. You can also use the
Failover Cluster Manager to attempt to turn the availability group resource
online.
§
If there is a failover in progress, wait for the failover to
complete.
Q. We have got an alert
“Availability group is not ready for automatic failover”. Can you explain about
this warning and your action plan?
Ans:
This alert is raised when the
failover mode of the primary replica is automatic; however
none of the secondary replicas in the availability group are failover ready.
Possible Reasons:
The primary replica is
configured for automatic failover; however, the secondary replica is not ready
for automatic failover as it might be unavailable or
its data synchronization state is currently not SYNCHRONIZED.
Possible Solutions:
§
Verify that at least one secondary replica is configured as
automatic failover. If there is not a secondary replica configured as automatic
failover, update the configuration of a secondary replica to be the automatic
failover target with synchronous commit.
§
Use the policy to verify that the data is in a synchronization
state and the automatic failover target is SYNCHRONIZED, and then resolve the
issue at the availability replica.
Q. In your environment
data inserted on Primary replica but not able to see that on secondary replica.
When you check that Availability is in healthy state and in most cases data
reflects in a few minutes but in this case it’s didn’t
happen. Now you need to check for the bottleneck and fix the issue. Can you
explain your views and workaround in this situation?
Ans:
Possible Reasons:
§
Long-Running Active Transactions
§
High Network Latency or Low Network Throughput Causes Log
Build-up on the Primary Replica
§
Another Reporting Workload Blocks the Redo Thread from Running
§
Redo Thread Falls behind Due to
Resource Contention
Possible Workaround:
§
Use DBCC OPENTRAN and check if there are any oldest transactions
running on primary replica and see if they can be rolled back.
§
A high DMV (sys.dm_hadr_database_replica_states)
value log_send_queue_size can indicate logs being
held back at the primary replica. Dividing this value by log_send_rate
can give you a rough estimate on how soon data can be caught up on the
secondary replica.
§
Check two performance objects SQL Server:Availability Replica > Flow Control Time (ms/sec) and SQL Server:Availability
Replica > Flow control/sec. Multiplying these two values shows you in the
last second how much time was spent waiting for flow control to clear. The
longer the flow control wait time, the lower the send rate.
§
When the redo thread is blocked, an extended event called sqlserver.lock_redo_blocked is
generated. Additionally, you can query the DMV sys.dm_exec_request
on the secondary replica to find out which session is blocking the REDO thread,
and then you can take corrective action. You can let the reporting workload to
finish, at which point the redo thread is unblocked. You can unblock the redo
thread immediately by executing the KILL command on the blocking session ID.
The following query returns the session ID of the reporting workload that is
blocking the redo thread.
Q. You perform a forced
manual failover on an availability group to an asynchronous-commit secondary
replica, you find that data loss is more than your recovery point objective
(RPO). Or, when you calculate the potential data loss of an asynchronous-commit
secondary replica using the method in Monitor Performance for AlwaysOn Availability Groups, you find that it exceeds your
RPO. What are the possible reasons that causes data loss is more than your
recovery point objective?
Ans:
There are mainly two
reasons:
§
High Network Latency or Low Network Throughput Causes Log
Build-up on the Primary Replica. The primary replica activates
flow control on the log send when it has exceeded the maximum allowable number
of unacknowledged messages sent over to the secondary replica. Until some of
these messages have been acknowledged, no more log blocks can be sent to the
secondary replica. Since data loss can be prevented only when they have been
hardened on the secondary replica, the build-up of unsent log messages
increases potential data loss.
§
Disk I/O Bottleneck Slows Down Log Hardening on the Secondary
Replica. If
the log file and the data file are both mapped to the same hard disk, reporting
workload with intensive reads on the data file will consume the same I/O
resources needed by the log hardening operation. Slow log hardening can
translate to slow acknowledgement to the primary replica, which can cause
excessive activation of the flow control and long flow control wait times.
Q. Let’s say you have
configured Automatic failover on SQL server 2012 AlwaysOn
environment. An automatic failover triggered but unsuccessful in making
secondary replica as PRIMARY. How do you identify that failover is not
successful and what are the possible reasons that causes an unsuccessful
failover?
Ans:
If an automatic failover event
is not successful, the secondary replica does not successfully transition to
the primary role. Therefore, the availability replica will report that this
replica is in Resolving status. Additionally, the availability databases report
that they are in Not Synchronizing status, and applications cannot access these
databases.
Possible Reasons for
Unsuccessful Failover:
§
“Maximum Failures in the Specified Period” value is exhausted: The availability group
has Windows cluster resource properties, such as the Maximum Failures in the
Specified Period property. This property is used to avoid the indefinite
movement of a clustered resource when multiple node failures occur.
§
Insufficient NT Authority\SYSTEM account permissions: The SQL Server
Database Engine resource DLL connects to the instance of SQL Server that is
hosting the primary replica by using ODBC in order to
monitor health. The logon credentials that are used for this connection are the
local SQL Server NT AUTHORITY\SYSTEM login account. By default, this local
login account is granted the following permissions: 1.Alter
Any Availability Group, 2.Connect SQL, 3.View server state. If the NT
AUTHORITY\SYSTEM login account lacks any of these permissions on the automatic
failover partner (the secondary replica), then SQL Server cannot start health
detection when an automatic failover occurs. Therefore, the secondary replica
cannot transition to the primary role. To investigate and diagnose whether this
is the cause, review the Windows cluster log.
§
The availability databases are not in a SYNCHRONIZED state: In
order to automatically fail over, all availability databases that are
defined in the availability group must be in a SYNCHRONIZED state between the
primary replica and the secondary replica. When an automatic failover occurs,
this synchronization condition must be met in order to
make sure that there is no data loss. Therefore, if one availability database
in the availability group in the synchronizing or not synchronized state,
automatic failover will not successfully transition the secondary replica into
the primary role.
Q. Have you ever seen
the Error 41009?
Ans:
Yes! This error might occur when
you try to create multiple availability groups in a SQL Server 2012 AlwaysOn failover clustering environment. This issue can be
resolved by applying Cumulative Update Package 2.
Q. Let’s say you added a
new file to a database which is a part of AlwaysOn
Availability Groups. The add file operation succeeded on primary replica but
failed in secondary replica. What is the impact and how you troubleshoot?
Ans:
This might happens
due to a different file path between the systems that hosts primary and
secondary replica. Failed add-file operation will cause the secondary database
to be suspended. This, in turn, causes the secondary replica to enter the NOT
SYNCHRONIZING state.
Resolution:
§
Remove the secondary database from the availability group.
§
On the existing secondary database, restore a full backup of the
filegroup that contains the added file to the secondary database, using WITH
NORECOVERY and WITH MOVE (Specify the correct file path as per secondary).
§
Back up the transaction log that contains the add-file operation
on the primary database, and manually restore the log backup on the secondary
database using WITH NORECOVERY and WITH MOVE. Restore the last transaction log
file with NO RECOVERY.
§
Rejoin the secondary database to the availability group.
Q. Data synchronization
state for one of the availability database is not
healthy. Can you tell me the possible reasons?
Ans:
If this is an
asynchronous-commit availability replica, all availability databases should be
in the SYNCHRONIZING state. If this is a synchronous-commit availability
replica, all availability databases should be in the SYNCHRONIZED state. This
issue can be caused by the following:
§
The availability replica might be disconnected.
§
The data movement might be suspended.
§
The database might not be accessible.
§
There might be a temporary delay issue due to network latency or
the load on the primary or secondary replica.
Q. Let’s say we have a
premium production server and it is in AlwaysOn
Availability Group. You oberve that CPU utilization
is hitting top at a specific time in a day. You did an RCA and found that CPU
utilization reaches top and most CPU is from backup process due to backup
compression is on. Now what do you suggest? Do we have any features for backup
Ans:
Yes! There is an option to
perform backup from secondary replicas. We can set this from Availability Group
properties we can find “Backup Preferences” and from that we can choose one of
the option from:
Preferred Secondary: Backups
performed on Secondary if there is no secondary configured performed from
primary
Secondary Only: Backups should
be done from secondary only
Primary: Must occur on Primary
Replica
Any Replica: Can occur from any
replica in Availability Group
Q.Is there any specific
limitations if we need to perform auto backups from secondary backups?
Ans:
Yes! There are few:
§
Only Copy_Only backup allowd from secondary replica
§
Differential backups not allowed from secondary replica.
§
Log backups can be performed from different secondary replicas
but all these backups maintains a single log chain
(LSN sequence). It might help in some of the situations
Q. Have you ever applied
patches / CU / service packs on Alwayson Availability
Groups? Did you face any issues while applying?
Ans:
Yes! I have applied CU and
service packs on SQL Server 2012 SP2 Cumulative Update 4
I had a bad experience with Alwayson AG:
After CU4 applied we saw that AlwaysOn vailiabilty Gropus are in Non- Synchronizing state.
After RCA we found that there
was a huge blocking between user sessions and a
unknown session, CHECKPOINT with command running as “DB_STARTUP”.
Through of the MSDN SITE we
found that Microsoft declared it’s a bug and the solution chosen as below:
§
We had to open an outage:
§
Disable Automatic Failover
§
Restart the SQL Server on Primary Replica
§
Re-enable automatic failover.
§
This worked and fixed the issue.
Q. Can you explain any
difficult issue you have faced recently on High Availability Groups?
Ans:
Sure! We are configuring AlwaysOn AG on SQL server 2014.
We have taken backup from
Primary replica and restored on secondary replica
When we are trying to add
secondary replica to availability group to our surprise sql
server got shut down and we found the error message:
(Error: 3449, Severity:
21, State: 1.
SQL Server must shut down in order to recover a database (database ID 1). The database
is either a user database that could not be shut down or a system database.
Restart SQL Server. If the database fails to recover after another startup,
repair or restore. SQL Trace was stopped due to server shutdown. Trace ID =
‘1’. This is an informational message only; no user action is required. )
Cause:
We did RCA and found the below.
§
Service broker is enabled at Primary Replica
§
We have taken a full backup from
Primary Replica
§
Restored on Secondary Replica where Service Broker is not
enabled
§
When we try to add secondary replica to AG, Service Broker is
enabled, the same GUID on availability database is detected which causes an silent error 9772:
§
“The Service Broker in database “<dbname>”
cannot be enabled because there is already an enabled Service Broker with the
same ID”.
§
This results into error 3449 and shut down the sql server unexpectedly.
Solution:
This has been fixed by applying
the CU1 on SQL Server 2014.
Q. Replica is in
“resolving” status? What does it mean?
Ans:
A replica is into “RESOLVING”
state when a auto failover
is not successful.
Additionally the availability
databases reports that they are in non-synchronizing state and not accessible.
Q. What are the top
reasons that cause an unsuccessful failover?
Ans:
§
Auto failovers in a specific period may crossed the value
“Maximum Failures in the Specified Period”
§
Insufficient NT Authority\SYSTEM account permissions
§
The availability databases are not in a SYNCHRONIZED state
§ Q. Suppose primary
database became in suspect mode. Will AG have failover to secondary replica?
§ Ans:
§ Issues at the database
level, such as a database becoming suspect due to the loss of a data file,
deletion of a database, or corruption of a transaction log, do not cause an
availability group to failover.
§ Q. How many types of
Data synchronization preference options are available in Always ON?
§ Ans:
§ There are three options-
Full, Join only, or Skip initial data synchronization.
Q. What are the benefits
of Always on feature?
Ans:
§
Utilizing database mirroring for the data transfer over TCP/IP
§
providing a combination of Synchronous and Asynchronous
mirroring
§
providing a logical grouping of similar databases via
Availability Groups
§
Creating up to four readable secondary replicas
§
Allowing backups to be undertaken on a secondary replica
§
Performing DBCC statements against a secondary replica
§
Employing Built-in Compression & Encryption
Q. What all types of DB
backups are possible on Secondary Replicas?
Ans:
§
BACKUP DATABASE supports only copy-only full
backups of databases, files, or filegroups when it is executed on secondary
replicas. Note that copy-only backups do not impact the log chain or clear the
differential bitmap.
§
Differential backups are not supported on secondary replicas.
Q. Can we take
Transaction log backups on the secondary replicas?
Ans:
Yes, we can take transaction log
backups on the secondary replicas without COPY_ONLY option.
Q. How many types of
Failover are supported by Always ON?
Ans:
Three forms of failover
exist—automatic, manual, and forced (with possible data loss). The form or
forms of failover supported by a given secondary replica depends on its
availability mode.