SQL Server 2016 AlwaysOn in Windows Server 2012R2 WFC – MultiSubnet Listener returns wrong IP address after a FlushDNS – Error Connection Timeout!

Issue: (Failover had not occurred, commands executed without pause).

ErrorAlwaysOnMultiSubnetTimeoutAnonymized

 

Fixed.

Key changes were RegisterAllProvidersIP=0, HostRecordTTL=300 (5 minutes instead of default 1200 (20 minutes)).

If using .Net Framework 4.5+, recommend adding to client connection strings: MutiSubnet=True; Ref SqlClient support for HA and DR.

Key steps were:

  1. Set RegisterAllProvidersIP=0 for the Cluster and LG.
  2. Set HostRecordTTL=300 (5 minutes instead of default 1200 (20 minutes)) for the Cluster and LG.
  3. Remove the dependency AG from the LG in the cluster.
  4. Restart the LG cluster resource.
  5. Force a DNS Register Update of the Cluster Resource.
  6. Then re-add the dependency AG to the LG in the cluster.
  7. Lastly, Verify the fix and settings.

This post helped me solve it. I didn’t need to run the T-SQL mentioned after the PS cmds because I only had 1 AG and LG.

If you want to verify this doesn’t happen to you. Run cmd.

ping YourListinerName
ipconfig /flushdns
ping YourListinerName
ipconfig /flushdns
ping YourListinerName

Enjoy,

Hiram

Full PowerShell Script:

#https://technet.microsoft.com/en-us/library/ee460984.aspx

Get-Cluster -Name "PCLUSQL09" |FT *SUBNET*
Get-Cluster -Name "PCLUSQL03" |FT *SUBNET*

Get-ClusterResource -Cluster "PCLUSQL09" | Get-ClusterParameter
Get-ClusterResource -Cluster "PCLUSQL03" | Get-ClusterParameter

Get-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP

Get-ClusterResource -Cluster "PCLUSQL03" -Name 'Cluster Name' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP
Get-ClusterResource -Cluster "PCLUSQL03" -Name 'PSQL03LG01' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP

#https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/create-or-configure-an-availability-group-listener-sql-server#HostRecordTTL
#Cluster PCLUSQL03 didnt have an issue. I just got the values to compare settings.
#Also, to compare I ran a Validate Cluster Report on both Clusters using the Failover Cluster Manager console.
#Needed to fix only Cluster PCLUSQL09 and Listener PSQL09LG01 TTLs and Registers for MultiSubnetFailover=True and Legacy clients.

#Update Cluster
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name' | Set-ClusterParameter RegisterAllProvidersIP 0
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name' | Set-ClusterParameter HostRecordTTL 300
#Update LG
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Set-ClusterParameter RegisterAllProvidersIP 0
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Set-ClusterParameter HostRecordTTL 300

#Restart Resources
Stop-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01'
Start-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01'
Start-Sleep -s 10
Stop-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name'
Start-ClusterResource -Cluster "PCLUSQL09" -Name 'Cluster Name'

####################### FIX #######################
#https://blogs.msdn.microsoft.com/alwaysonpro/2014/06/03/connection-timeouts-in-multi-subnet-availability-group/

Get-ClusterResource -Cluster "PCLUSQL09"

#4. Set HostRecordTTL and RegisterAllProvidersIP
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Set-ClusterParameter HostRecordTTL 300
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Set-ClusterParameter RegisterAllProvidersIP 0

#5. Temporarily remove dependency between the availability group resource and the listener name resource.
Get-ClusterResource -Cluster "PCLUSQL09" | Remove-ClusterResourceDependency -Resource 'PSQL09AG01' -Provider 'PSQL09AG01_PSQL09LG01'

#Restart Resources
Stop-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01'
Start-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01'

#To force updating DNS on Windows Server 2012 or 2012 R2:
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Update-ClusterNetworkNameResource

#7. Re-add the dependency of the AG resource on the Listener name resource.
Get-ClusterResource -Cluster "PCLUSQL09" |  Add-ClusterResourceDependency -Resource 'PSQL09AG01' -Provider 'PSQL09AG01_PSQL09LG01'

#8. VERIFY
Get-ClusterResource -Cluster "PCLUSQL09" | Get-ClusterResourceDependency -Name 'PSQL09AG01'
Get-ClusterResource -Cluster "PCLUSQL09" -Name 'PSQL09AG01_PSQL09LG01' | Get-ClusterParameter HostRecordTTL,RegisterAllProvidersIP
Get-ClusterResource -Cluster "PCLUSQL09"

ping PSQL09LG01
ipconfig /flushdns
ping PSQL09LG01
ipconfig /flushdns
ping PSQL09LG01

8 thoughts on “SQL Server 2016 AlwaysOn in Windows Server 2012R2 WFC – MultiSubnet Listener returns wrong IP address after a FlushDNS – Error Connection Timeout!

  1. This is so pointless and I can not believe MS is telling this to customers. I discovered that you are able to get this working properly without this non sense TTL and register just a single IP. Let me explain:
    1. First of all obvious I don’t like this custom setups, long run my folks always need to be aware they can’t create a regular AG, they need parameters….long term will cause issues with new folks etc.
    2. Everything works properly if your Application client is on the same subnet with one of the SQL Multi-subnets nodes, doesn’t matter which one. So I am wondering how come MS is not capable of figure it out why as soon as I move my Application client to another subnet outside of the Nodes subnet it won’t work anymore.
    3. Challenge with lowerTTL let`s say 30 seconds. Good luck buddy. Your application client in Subnet A let`s say talks to DNS server A. Now your SQL Node B in Subnet B fail-over, updates the DNS server B in site B, no way in hell your Client A can now take the new IP of that record even with a TTL of 30 seconds, WHYYY ? Cause DNS Server B needs to replicate though AD replication most likely 15 minutes or 5 minutes whatever, so my point is .. Lower TTL for fail-over is just a story…

    George Popescu …

    Like

    1. Hi George, I don’t agree about it being pointless. I think it’s a good feature to have and I believe MS made improvements on it with Win Svr 2016 site-aware failover clusters.
      Regarding your #1. Don’t see it as a custom setup. For you, this may be custom, but for other environments, these parameters are requirements for their applications to be able to leverage the functionality across sites and they let them adapt the cluster config to their specific business needs or failover policies. New folks should be trained when onboarding about specific env configs. Assume nothing is setup as default.
      Regarding your #2. The application should be pointing to the LG. Try the MultiSubnetFailover=True on your connection string. More so, your network should allow traversing the required subnets if you want to independently failover different layers of your stack. However, I think you’ll experience latency or other performance issues depending on your env due to network reach. I recommend failover the stack together and that doesn’t have to be done automatically nor at the same time. A geo-failover typically means a bad case scenario, vs the typical case, would be failover locally meaning from a primary to an onsite secondary which would be on the same subnet for which a partial failover of the stack would suffice. In any case, MS did put out a newer version of the .Net framework and SqlClient to enable MultiSubnetFailover=True, but I think the issue you’re describing is related to the design or network of your specific env.
      Regading your #3. Why would ever set the TTL so low? lol, If that’s the case, don’t design it for multi-subnet failover. If your RPO/RTO window is so small then use sync replicas on the same subnet so the LG doesn’t need to change IPs during a failover and nothing gets updated in DNS. Anyways, Client A can use in its connection string the setting MultiSubnet=True and/or you can lower the AD replication setting down to 15 minutes like you mentioned (try cmds repadmin /showrepl, or repadmin /sync) but think about it…what if you simply need to do an svr reboot that takes less than 5 minutes, do you want to fail-over the whole stack across sites? I don’t think so.
      The point of all this is that it’s proposed more for a DR case scenario or 2nd-3rd tier failover in which case the organization might want some manual intervention or slight delay to avoid an automated attempt of resources failing back and forth across sites.
      Thx,
      Hiram

      Like

  2. how to Set RegisterAllProvidersIP=0 for the Cluster and LG and
    Set HostRecordTTL=300 (5 minutes instead of default 1200 (20 minutes)) for the Cluster and LG?
    thanks
    John

    Like

    1. Hi John, Its in the script above, just scroll to the right and you’ll see it in line 20-25. Then do the restart resources cmds as a well for it to take affect. Careful to schedule it during a maintenance window if you’re doing it in prod. Hope it helps. Thx, Hiram

      Like

  3. Hello, We have a similar setup with nodes in multi subnets and RegisterAllProvidersIP is set to 0. We are ending up with 2 dns entries after a failover. We have a F5 load balancer infront of DNS servers. If I point the cluster nodes to DNS server directly i am not seeing the issue . Load balancer is forwarding DNS traffic to DNS servers in different sites (prod & dr).

    Like

    1. Depending if you need it or not you can also lower the TTLs like mentioned above and use the multisubnet param in your app server’s connection string. So that the app svr resolves the latest IP for the LG in-case it fails-over across subnets, ie. like from a primary site to a DR site. Secondly, I don’t foresee it being an issue if your comment is about having that LB in-front of the DNS svrs affecting you. Personally, I haven’t tested that sort of DNS scenario and in my case my servers’ static IPs use for DNS servers actual DNS servers on the same subnet not an LB of a DNS farm.

      Like

      1. Thank you for the quick reply. Is there a way to force the server to register its interface in DNS after the Failover event.

        Like

      2. Hi,

        No problem, thx for your comment. That’s the purpose of RegisterAllProvidersIP=0. This will register the interface of the new primary to DNS.

        What we during a test is force a DNS update to propagate across the network immediately by doing something similar to an ipconfig /flushdns or /registerdns. This forces the app layer to immediately re-arp & pickup the new IP of the LG. If you lower the TTLs of your DNS svrs low enough it will do the same thing.

        There’s definitely a powershell script that you can use to force DNS replication across your servers, but its something infrastructure folks run because you’ll typically need admin access and it’ll trickle down to workstations too. Its not something I personally do, because its an uncommon thing for the database team.

        For example, check this out on Stack Overflow: https://stackoverflow.com/questions/35083798/flushdns-and-registerdns-on-multiple-machine-by-powershell

        Careful with what you put inside servers.txt. Try a few machines only so you can measure the impact on your network and DNS server.

        Good luck!

        Like

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.