SQL Server 2016 AlwaysOn in Windows Server 2012R2 WFC – MultiSubnet Listener returns wrong IP address after a FlushDNS – Error Connection Timeout!

Issue: (Failover had not occurred, commands executed without pause).

ErrorAlwaysOnMultiSubnetTimeoutAnonymized

 

Fixed.

Key changes were RegisterAllProvidersIP=0, HostRecordTTL=300 (5 minutes instead of default 1200 (20 minutes)).

If using .Net Framework 4.5+, recommend adding to client connection strings: MutiSubnet=True; Ref SqlClient support for HA and DR.

Key steps were:

  1. Set RegisterAllProvidersIP=0 for the Cluster and LG.
  2. Set HostRecordTTL=300 (5 minutes instead of default 1200 (20 minutes)) for the Cluster and LG.
  3. Remove the dependency AG from the LG in the cluster.
  4. Restart the LG cluster resource.
  5. Force a DNS Register Update of the Cluster Resource.
  6. Then re-add the dependency AG to the LG in the cluster.
  7. Lastly, Verify the fix and settings.

This post helped me solve it. I didn’t need to run the T-SQL mentioned after the PS cmds because I only had 1 AG and LG.

If you want to verify this doesn’t happen to you. Run cmd.

ping YourListinerName
ipconfig /flushdns
ping YourListinerName
ipconfig /flushdns
ping YourListinerName

Enjoy,

Hiram

Full PowerShell Script:

#https://technet.microsoft.com/en-us/library/ee460984.aspx

Get-Cluster -Name "PCLUSQL09" |FT *SUBNET*
Get-Cluster -Name "PCLUSQL03" |FT *SUBNET*

Get-ClusterResource -Cluster "PCLUSQL09" | Get-ClusterParameter
Get-ClusterResource -Cluster "PCLUSQL03" | Get-ClusterParameter

Get-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP

Get-ClusterResource -Cluster "PCLUSQL03" -Name 'Cluster Name' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP
Get-ClusterResource -Cluster "PCLUSQL03" -Name 'PSQL03LG01' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP

#https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/create-or-configure-an-availability-group-listener-sql-server#HostRecordTTL
#Cluster PCLUSQL03 didnt have an issue. I just got the values to compare settings.
#Also, to compare I ran a Validate Cluster Report on both Clusters using the Failover Cluster Manager console.
#Needed to fix only Cluster PCLUSQL09 and Listener PSQL09LG01 TTLs and Registers for MultiSubnetFailover=True and Legacy clients.

#Update Cluster
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name' | Set-ClusterParameter RegisterAllProvidersIP 0
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name' | Set-ClusterParameter HostRecordTTL 300
#Update LG
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Set-ClusterParameter RegisterAllProvidersIP 0
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Set-ClusterParameter HostRecordTTL 300

#Restart Resources
Stop-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01'
Start-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01'
Start-Sleep -s 10
Stop-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name'
Start-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name'

####################### FIX #######################
#https://blogs.msdn.microsoft.com/alwaysonpro/2014/06/03/connection-timeouts-in-multi-subnet-availability-group/

Get-ClusterResource -Cluster "PCLUSQL09"

#4. Set HostRecordTTL and RegisterAllProvidersIP
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Set-ClusterParameter HostRecordTTL 300
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Set-ClusterParameter RegisterAllProvidersIP 0

#5. Temporarily remove dependency between the availability group resource and the listener name resource.
Get-ClusterResource -Cluster "PCLUSQL09" | Remove-ClusterResourceDependency -Resource 'PSQL09AG01' -Provider 'PSQL09AG01_PSQL09LG01'

#Restart Resources
Stop-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01'
Start-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01'

#To force updating DNS on Windows Server 2012 or 2012 R2:
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Update-ClusterNetworkNameResource

#7. Re-add the dependency of the AG resource on the Listener name resource.
Get-ClusterResource -Cluster "PCLUSQL09" |  Add-ClusterResourceDependency -Resource 'PSQL09AG01' -Provider 'PSQL09AG01_PSQL09LG01'

#8. VERIFY
Get-ClusterResource -Cluster "PCLUSQL09" | Get-ClusterResourceDependency -Name 'PSQL09AG01'
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP
Get-ClusterResource -Cluster "PCLUSQL09"

ping PSQL09LG01
ipconfig /flushdns
ping PSQL09LG01
ipconfig /flushdns
ping PSQL09LG01
Advertisements

2 thoughts on “SQL Server 2016 AlwaysOn in Windows Server 2012R2 WFC – MultiSubnet Listener returns wrong IP address after a FlushDNS – Error Connection Timeout!

  1. This is so pointless and I can not believe MS is telling this to customers. I discovered that you are able to get this working properly without this non sense TTL and register just a single IP. Let me explain:
    1. First of all obvious I don’t like this custom setups, long run my folks always need to be aware they can’t create a regular AG, they need parameters….long term will cause issues with new folks etc.
    2. Everything works properly if your Application client is on the same subnet with one of the SQL Multi-subnets nodes, doesn’t matter which one. So I am wondering how come MS is not capable of figure it out why as soon as I move my Application client to another subnet outside of the Nodes subnet it won’t work anymore.
    3. Challenge with lowerTTL let`s say 30 seconds. Good luck buddy. Your application client in Subnet A let`s say talks to DNS server A. Now your SQL Node B in Subnet B fail-over, updates the DNS server B in site B, no way in hell your Client A can now take the new IP of that record even with a TTL of 30 seconds, WHYYY ? Cause DNS Server B needs to replicate though AD replication most likely 15 minutes or 5 minutes whatever, so my point is .. Lower TTL for fail-over is just a story…

    George Popescu …

    Like

    1. Hi George, I don’t agree about it being pointless. I think it’s a good feature to have and I believe MS made improvements on it with Win Svr 2016 site-aware failover clusters.
      Regarding your #1. Don’t see it as a custom setup. For you, this may be custom, but for other environments, these parameters are requirements for their applications to be able to leverage the functionality across sites and they let them adapt the cluster config to their specific business needs or failover policies. New folks should be trained when onboarding about specific env configs. Assume nothing is setup as default.
      Regarding your #2. The application should be pointing to the LG. Try the MultiSubnetFailover=True on your connection string. More so, your network should allow traversing the required subnets if you want to independently failover different layers of your stack. However, I think you’ll experience latency or other performance issues depending on your env due to network reach. I recommend failover the stack together and that doesn’t have to be done automatically nor at the same time. A geo-failover typically means a bad case scenario, vs the typical case, would be failover locally meaning from a primary to an onsite secondary which would be on the same subnet for which a partial failover of the stack would suffice. In any case, MS did put out a newer version of the .Net framework and SqlClient to enable MultiSubnetFailover=True, but I think the issue you’re describing is related to the design or network of your specific env.
      Regading your #3. Why would ever set the TTL so low? lol, If that’s the case, don’t design it for multi-subnet failover. If your RPO/RTO window is so small then use sync replicas on the same subnet so the LG doesn’t need to change IPs during a failover and nothing gets updated in DNS. Anyways, Client A can use in its connection string the setting MultiSubnet=True and/or you can lower the AD replication setting down to 15 minutes like you mentioned (try cmds repadmin /showrepl, or repadmin /sync) but think about it…what if you simply need to do an svr reboot that takes less than 5 minutes, do you want to fail-over the whole stack across sites? I don’t think so.
      The point of all this is that it’s proposed more for a DR case scenario or 2nd-3rd tier failover in which case the organization might want some manual intervention or slight delay to avoid an automated attempt of resources failing back and forth across sites.
      Thx,
      Hiram

      Like

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s