Cause sometimes you need to check why replication is slow or halted.
For example, like during an IO storm – no hurricane pun intended. Say you have transnational replication over an Always On (SQL Server Availability Groups) environment that’s a publisher.
use Jarvis go create or alter proc uspAlertRedoQueueSize as set nocount on; declare @minrds int = 5242880, @rds bigint, @dbname nvarchar(128), @node varchar(128), @msg nvarchar(max), @to varchar(max) = 'tony.stark@avengers.com' select top 1 @rds = coalesce(max(drs.redo_queue_size), 0), @dbname = db_name(drs.database_id), @node = ar.replica_server_name from sys.dm_hadr_availability_replica_states ars left join sys.availability_replicas ar on ar.replica_id = ars.replica_id left join sys.availability_groups ag on ag.group_id = ars.group_id left join sys.dm_hadr_database_replica_states drs on drs.replica_id = ars.replica_id where ars.role_desc = 'secondary' and drs.redo_queue_size > @minrds group by db_name(drs.database_id), ar.replica_server_name order by coalesce(max(drs.redo_queue_size), 0) desc; if @rds >= 10485760 begin set @msg = null; select @msg = @node + '.' + @dbname + ': ' + convert(varchar(20), @rds); exec msdb.dbo.sp_send_dbmail @recipients = @to ,@subject = 'Crit: Max redo_queue_size >= 10GB' ,@body = @msg ,@importance = 'high' end if @rds between @minrds and 10485760 begin select @msg = @node + '.' + @dbname + ': ' + convert(varchar(20), @rds); exec msdb.dbo.sp_send_dbmail @recipients = @to ,@subject = 'Warn: Max redo_queue_size > 5GB' ,@body = @msg end go
To see in the Dashboard just add the column, this is what it looks like when there’s a problem. If you’re wonder that number was 284.86GBs – yup, it was a very busy day.
Awesome PowerShell core calculation
So to automate things, add the alert monitor as a step to a recurring SQL Agent Job to all replicas where it runs only if the node is Primary and you’ll get a nice email.
if exists (select 1 from sys.dm_hadr_availability_replica_states as ars inner join sys.availability_group_listeners as agl on ars.group_id = agl.group_id inner join sys.availability_replicas as arcn on arcn.replica_id = ars.replica_id where ars.role_desc = 'PRIMARY' and ars.operational_state_desc = 'ONLINE' and agl.dns_name = 'STARK-LG' and arcn.replica_server_name = @@SERVERNAME) begin --job step here: exec Jarvis.dbo.uspAlertRedoQueueSize end else begin print 'Server is not Primary for LG.' end
P.S. Go get ’em tiger.
🙂 – emoji
Hiram
[…] Source: SQL Server Always On – Monitor The Redo_Queue_Size […]
LikeLike