SQL Server Always On – Monitor The Redo_Queue_Size

Cause sometimes you need to check why replication is slow or halted.
For example, like during an IO storm – no hurricane pun intended. Say you have transnational replication over an Always On (SQL Server Availability Groups) environment that’s a publisher.

use Jarvis
go
create or alter proc uspAlertRedoQueueSize
as
set nocount on;

declare @minrds int = 5242880, @rds bigint, @dbname nvarchar(128), @node varchar(128), @msg nvarchar(max), @to varchar(max) = 'tony.stark@avengers.com'

select top 1 @rds = coalesce(max(drs.redo_queue_size), 0), @dbname = db_name(drs.database_id), @node = ar.replica_server_name
from sys.dm_hadr_availability_replica_states ars
left join sys.availability_replicas ar on ar.replica_id = ars.replica_id
left join sys.availability_groups ag on ag.group_id = ars.group_id
left join sys.dm_hadr_database_replica_states drs on drs.replica_id = ars.replica_id
where ars.role_desc = 'secondary'
and drs.redo_queue_size > @minrds
group by db_name(drs.database_id), ar.replica_server_name
order by coalesce(max(drs.redo_queue_size), 0) desc;

if @rds >= 10485760
begin
	set @msg = null;
	select @msg = @node + '.' + @dbname + ': ' + convert(varchar(20), @rds);
	exec msdb.dbo.sp_send_dbmail @recipients = @to
	,@subject = 'Crit: Max redo_queue_size >= 10GB'
	,@body = @msg
	,@importance = 'high'
end

if @rds between @minrds and 10485760
begin
	select @msg = @node + '.' + @dbname + ': ' + convert(varchar(20), @rds);
	exec msdb.dbo.sp_send_dbmail @recipients = @to
	,@subject = 'Warn: Max redo_queue_size > 5GB'
	,@body = @msg
end
go

To see in the Dashboard just add the column, this is what it looks like when there’s a problem. If you’re wonder that number was 284.86GBs – yup, it was a very busy day.

AlwaysOnDashboardRedoQueueSize

Awesome PowerShell core calculationPowerShellCalc

So to automate things, add the alert monitor as a step to a recurring SQL Agent Job to all replicas where it runs only if the node is Primary and you’ll get a nice email.

if exists (select 1
	from  sys.dm_hadr_availability_replica_states as ars
	inner join sys.availability_group_listeners as agl on ars.group_id = agl.group_id
	inner join sys.availability_replicas as arcn on arcn.replica_id = ars.replica_id
	where ars.role_desc = 'PRIMARY'
	and ars.operational_state_desc = 'ONLINE'
	and agl.dns_name = 'STARK-LG'
	and arcn.replica_server_name = @@SERVERNAME)

begin
	--job step here:
	exec Jarvis.dbo.uspAlertRedoQueueSize
end

else begin
	print 'Server is not Primary for LG.'
end

P.S. Go get ’em tiger.

🙂 – emoji
Hiram

One thought on “SQL Server Always On – Monitor The Redo_Queue_Size

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.