Always On VPN IKEv2 Load Balancing and NAT

Always On VPN IKEv2 Load Balancing and NATOver the last few weeks, I’ve worked with numerous organizations and individuals troubleshooting connectivity and performance issues associated with Windows 10 Always On VPN, and specifically connections using the Internet Key Exchange version 2 (IKEv2) VPN protocol. An issue that appears with some regularity is when Windows 10 clients fail to connect with error 809. In this scenario, the server will accept connections without issue for a period of time and then suddenly stop accepting requests. When this happens, existing connections continue to work without issue in most cases. Frequently this occurs with Windows Server Routing and Remote Access Service (RRAS) servers configured in a clustered array behind an External Load Balancer (ELB).

Network Address Translation

It is not uncommon to use Network Address Translation (NAT) when configuring Always On VPN. In fact, for most deployments the public IP address for the VPN server resides not on the VPN server, but on an edge firewall or load balancer connected directly to the Internet. The firewall/load balancer is then configured to translate the destination address to the private IP address assigned to the VPN server in the perimeter/DMZ or the internal network. This is known a Destination NAT (DNAT). Using this configuration, the client’s original source IP address is left intact. This configuration presents no issues for Always On VPN.

Source Address Translation

When troubleshooting these issues, the common denominator seems to be the use of Full NAT, which includes translating the source address in addition to the destination. This results in VPN client requests arriving at the VPN server as appearing not to come from the client’s original IP address, but the IP address of the network device (firewall or load balancer) that is translating the request. Full NAT may be explicitly configured by an administrator, or in the case of many load balancers, configured implicitly because the load balancer is effectively proxying the connection.

Known Issues

IKEv2 VPN connections use IPsec for encryption, and by default, Windows limits the number of IPsec Security Associations (SAs) coming from a single IP address. When a NAT device is performing destination/full NAT, the VPN server sees all inbound IKEv2 VPN requests as coming from the same IP address. When this happens, clients connecting using IKEv2 may fail to connect, most commonly when the server is under moderate to heavy load.

Resolution

The way to resolve this issue is to ensure that any load balancers or NAT devices are not translating the source address but are performing destination NAT only. The following is configuration guidance for F5, Citrix ADC (formerly NetScaler), and Kemp load balancers.

F5

On the F5 BIG-IP load balancer, navigate to the Properties > Configuration page of the IKEv2 UDP 500 virtual server and choose None from the Source Address Translation drop-down list. Repeat this step for the IKEv2 UDP 4500 virtual server.

Always On VPN IKEv2 Load Balancing and NAT

Citrix ADC

On the Citrix ADC load balancer, navigate to System > Settings > Configure Modes and check the option to Use Subnet IP.

Always On VPN IKEv2 Load Balancing and NAT

Next, navigate to Traffic Management > Load Balancing > Service Groups and select the IKEv2 UDP 500 service group. In the Settings section click edit and select Use Client IP. Repeat these steps for the IKEv2 UDP 4500 service group.

Always On VPN IKEv2 Load Balancing and NAT

Kemp

On the Kemp LoadMaster load balancer, navigate to Virtual Services > View/Modify Services and click Modify on the IKEv2 UDP 500 virtual service. Expand Standard Options and select Transparency. Repeat this step for the IKEv2 UDP 4500 virtual service.

Always On VPN IKEv2 Load Balancing and NAT

Caveat

Making the changes above may introduce routing issues in your environment. When configuring these settings, it may be necessary to configure the VPN server’s default gateway to use the load balancer to ensure proper routing. If this is not possible, consider implementing the workaround below.

Workaround

To fully resolve this issue the above changes should be made to ensure the VPN server can see the client’s original source IP address. If that’s not possible for any reason, the following registry key can be configured to increase the number of established SAs from a single IP address. Be advised this is only a partial workaround and may not fully eliminate failed IKEv2 connections. There are other settings in Windows that can prevent multiple connections from a single IP address which are not adjustable at this time.

To implement this registry change, open an elevated PowerShell command window on the RRAS server and run the following commands. Repeat these commands on all RRAS servers in the organization.

New-ItemProperty -Path ‘HKLM:SYSTEM\CurrentControlSet\Services\IKEEXT\Parameters\’ -Name IkeNumEstablishedForInitialQuery -PropertyType DWORD -Value 50000 -Force

Restart-Service IKEEXT -Force -PassThru

Additional Information

IPsec Traffic May Be Blocked When A Computer is Behind a Load Balancer

Windows 10 Always On VPN IKEv2 Load Balancing with Citrix NetScaler ADC

Windows 10 Always On VPN IKEv2 Load Balancing with F5 BIG-IP

Windows 10 Always On VPN IKEv2 Load Balancing with Kemp LoadMaster

Leave a comment

60 Comments

  1. Matt Klein

     /  April 13, 2020

    We are experiencing this issue described above with both of our VPN servers not accepting new connections after 2-3 days. We typically have around 200 clients per day on the server.

    However, we are using direct NAT and DNS round robin so we can see the originating IP so the issue doesn’t appear the same – is it still worth applying this fix or is this possibly an entirely separate issue? We’ve had a ticket open with MS for about 2 weeks with not much movement

    Reply
    • It wouldn’t hurt to enable that registry key just to see if it provides any relief. Are you using the Kemp LoadMaster load balancer by chance?

      Reply
      • Matt Klein

         /  April 13, 2020

        We’re currently not using any LoadBalancer – just DNS Round Robin with 2 IPs that have NAT’s.

        We’re testing setting this up for F5 so we have more DR than DNS round Robin provides

      • Got it. If your VPN server sees the client’s original IP address then you won’t get any benefit from adding that registry setting.

      • Matt Klein

         /  April 13, 2020

        Hope I’m not derailing the point of this article – but are there any other IKEV2 limitations that might cause this exact same issue (all new connections initiate and then drop) after 2-3 days of connections? Despite having direct NAT?

      • Potentially, yes. Reach out to me directly via email and I’ll provide you with more detail.

  2. Luke Flack

     /  April 13, 2020

    Hi Richard, thank you so much for posting this. This issue caused me a great deal of trouble when initially setting up AOVPN behind our F5 load balancer and I now have an explanation as to why, where before I had none.

    Reply
  3. Dave K

     /  April 13, 2020

    Thanks for the post, Richard. Another timely article for sure. In our implementation we’ve leveraged a device tunnel with user certificates. I see in some cases where during the ISAKMP negotiation the server is unable to send the server certificate back to the client and complete the negotiation. This results in a event ID 4652 and the server gives up trying after about the 3rd attempt. This is rare, however. We have over 1000 clients connected on two RRAS servers load balanced behind an F5. We’ve engaged both Microsoft support and F5 but not reached a resolution yet.

    Reply
    • What version of Windows Server are you running?

      Reply
      • Dave K

         /  April 14, 2020

        We’re running Server 2019 Datacenter edition.

      • Have you enabled IKEv2 fragmentation support on the server?

      • Dave K

         /  April 14, 2020

        Yes we have enabled fragmentation. We typically see 11-12 fragments during the ISAKMP negotiation. I blame the certificate size.

      • Ok, just checking. 🙂 What version of Windows 10 client? I think IKEv2 fragmentation support wasn’t added until 1803. If you’re using 1709 you’ll have problems.

      • Dave K

         /  April 14, 2020

        We’re using Windows 10 Enterprise build 1909. By and large our clients are connecting without issue, but we have a smattering that cannot connect or do not do so unless there’s a detected network change. We’ve been having people disconnect from their home ISP to a wifi hotspot and seen success, but consistency is the goal. 🙂

      • Great. I’m assuming of course you configured the F5 with a persistency group for UDP 500 and 4500 with “match across services” set as well?

    • Dave K

       /  April 14, 2020

      Indeed we have. Your site has been a great source of information for us when it came to configuring load balancing. F5 is investigating now and I hope to hear something back from them soon.

      Reply
  4. Ben W

     /  April 15, 2020

    Hey Richard – we’ve been deploying AOVPN on a pilot set of users – moving toward going wider with it. But I’ve got a small thing bugging me I can’t figure out why.

    We’re doing Device based tunnels, behind an F5 using Server 2016 RAS Servers (With the persistence profiles and client IP’s passing through) – but I’ve noticed that when some clients connect – they create multiple tunnels at once – only one is actually live, but the other ones don’t seem to go away – they consume a few ports/ip’s for the duration the client is connected. Not everyone does it – only some.

    Is this normal behavior – something fixable?

    Reply
    • Definitely not normal, but not sure what would be causing that. Have a look at the event logs on the client and see if there’s any indication as to why the client might have disconnected. Perhaps that will shed some light on what’s happening.

      Reply
      • Ben W

         /  April 16, 2020

        Fragmented Packets was the answer in the end. We were trying to use Server 2016 as 2019 hadn’t been certified for our environment yet – but in the end after confirming Fragmented packets, upgrading to 2019 and enabling Server Fragmentation the issue’s gone away and it’s looking alot better.

      • Great to hear! 🙂

  5. Johan

     /  April 18, 2020

    Hi
    Great article, I have a question regarding the registry setting and what the impact will be.

    What is the default value of this registry setting

    And if you increees this value, will you just push the problem infront of you, or do we have some kind of reset after X hours??

    Reply
    • The default is 10. If you increase this value you will likely see some benefits in certain scenarios. However, there are other limits in Windows which are hardcoded that may result in connectivity issues. The best way to completely resolve this issue is to configure the load balancer to forward the client’s original source IP address to the VPN server.

      Reply
      • Johan

         /  April 19, 2020

        Thank you for the answer.

      • Hi Richard, I hope you are well.

        I am also interested in testing this registry key in one of our implementations.

        The scenario I have is that the F5 is load balancing the in from of 5 RRAS servers. The source IP Address of the laptops is visible to the RRAS servers so no Source NAT configured.

        I have the Device and User Tunnel deployed to the laptops using IKEv2.

        There are lots of couples who work for the company and are all now working from home. So its common that married couples are both connecting using there own company provisioned Laptops using AOVPN over the same broadband connection.

        I have had a scenario where only one person can successfully connect to the VPN at a time. Its whoever booted up first and logged on. Then the second person can’t get a connection. If they alternate the boot up order the problem for the working and none working user switches.

        With two AOVPN enabled laptops using two IKEv2 tunnels from a single IP Address cause a scenario where the default limit of 10 security associations is triggered, or should I be looking somewhere else.

        Thanks in advance

        Dave

      • Hi Dave. I’m not sure if implementing that registry key will help, but give it a shot and let me know what you find. What I think is more likely is that it is an issue with their on-premises networking equipment (router, firewall, etc.). I’ve heard numerous reports of people having issues with IPsec VPNs behind various ISP equipment. Often times there is an IPsec bypass option which might help. I’d suggest having a look there to see what you can find too.

      • Hi Richard, thanks as always for the swift response. I will get a change raised to try this. I don’t think it can harm by increasing the parameter to a higher number. I will update with my findings.

        Regards,

        Dave

  6. We misconfigured our F5 (NAT) and started to get 809 errors.

    Corrected the issue passing source IPs.

    Odd thing now is some specific clients are still getting hit with 809. More puzzling still is on the affected devices, ethernet will work fine the error just occurs over wifi.

    Tried the reg entry mentioned, that didn’t seem to help.

    The issue persists through reimaging even. Completely repeatable, same profiles etc.

    IPSec errors I see on RAS are either negotiation timeouts or “Max number of established MM SAs to peer exceeded”

    Restarts of the server don’t seem to help and haven’t really found a way to clear SAs or find the offending device. Really appears to be an issue with the specific server, tried another one from our stage environment to figure out the F5 NAT issue and that would connect fine.

    Any thoughts on how to reset SAs? or find them?

    Reply
    • On the VPN server you can view main-mode SAs using the Get-NetIPsecMainModeSA PowerShell command. You can clear them by piping the command to Remove-NetIPsecMainModeSA.

      Reply
  7. Cheryl

     /  April 29, 2020

    Hi, Has anyone got this working with a F5 in a OneArm deployment with Source NAT used?

    Reply
  8. Rich

     /  May 1, 2020

    Hi Richard, please can I express my deepest gratitude towards you and your posts. This really could not have come at a better time. Enabling transparency on the Load Master and reconfiguring the default gateway on the external NIC now means the clients Public IP addresses are being properly displayed.
    Thank you very much !!

    Reply
  9. Paddy Berger

     /  May 6, 2020

    Hi Richard, I can see you have ticked “use client IP” in the UDP service group, however we are using services instead. What would be the setting within here?

    Reply
  10. Hi Richard, I hope you are well.

    I was reading your article and interested in the workaround for increasing the number of SA’s from a single IP Address registry key.

    You mentioned that there are other settings within Windows that can stop multiple connections from a single IP Address that are not adjustable at this time. I am interested in what those settings are, are you able to elaborate on this?

    Regards,

    Dave

    Reply
    • In addition to limiting the number of SAs from a single source IP address, Windows also limits the number of “in progress” main mode SAs from a single IP address to 35. So, while the registry entry I provided might provide *some* relief, it may not fully resolve the issue. The ultimate way to fix this problem is to ensure the VPN server sees the client’s original public source IP address.

      Reply
  11. Hi Richard, One thing I have never seen covered in this is client IP addressing when load balanced across say 2 VPN servers. Would you typically use two separate IP pools for clients, one on each server, or use one large pool issued by a DHCP server? or are both acceptable.
    The platform team here are asking what subnets needs to be carved up in AWS for this to work and for the life of me I can’t find information on this from MS.

    Also, have you managed to load balance successfully in AWS using the elastic load balancer? I am skeptical is has the features we need to allow port following/persistency or Sticky ports.

    Reply
    • I’ve had a post in draft for quite some time that covers this topic. One of these days I’ll get it published, I promise! 😉 To answer your question, yes, unique IP address pools are required per server to ensure proper routing from the internal network. You’ll create routes internally to forward each VPN client IP subnet back to the respective VPN server.

      As for load balancing in AWS, you can use the ELB for SSTP, but not for IKEv2. It does not support session persistence across services as required for IKEv2. For that you’ll have to deploy a load balancing appliance such as Kemp or something else.

      Reply
      • William

         /  May 20, 2020

        Hi Richard, thanks for the reply and congrats on the magazine feature!

        Can you clarify whether SSTP works in aws without NAT for client IP addresses? I understand that AWS instances do not use ARP, so I am wondering about the Proxy ARP that RRAS usually uses To route L2 traffic. I am hoping that it would work with a standard 2 NIC configuration with no NAT involved and clients IP addresses handed out from a the LAN interface of the RRAS box, but then how would the traffic get routes without using proxy arp?

        We have been testing for a while now and seem to run into routing issues that’s all.

      • It’s been a while since I’ve done RRAS in AWS, but I do remember getting it to work in the past without NAT. You’ll need to create a unique subnet for VPN clients though and configure routes to point the traffic back to RRAS though.

      • William

         /  May 26, 2020

        Thank you Richard. In that instance where there is a separate Client subnet to the VPC do you know whether Client traffic will be routable from a VPC back to a datacenter via a direct connect?

        I have heard that a VPC may not be able to route traffic outside the VPC if it’s not in the VPC summary CIDR block, so we would be able to route clients back to the Datacenter, so NAT would be the only viable client option.

        Never the less we will test but it will take a while to reach that point.

      • I’m not certain about routing via direct connect to be honest. There are similar restrictions for routing in Azure so it sounds plausible.

      • William

         /  June 2, 2020

        Hi Richard, I have now found the answer to this, and it may be useful for the future.

        You cannot route a different Subnet to the VPC CIDR via a virtual gateway, the packets will just get dropped as they don’t originate from the VPC range.

        To get around this you must instead use an AWS Transit gateway to connect your on premise network. This is supported and will work according to Aws support!

        Best regards

      • Great to hear. Thanks for the update!

  12. Rick

     /  May 28, 2020

    Hi Richard thank you for the article we have been experiencing many of these issues.
    Just a question if we use the command Remove-NetIPsecMainModeSA
    will endpoints lose connection to VPN?

    Reply
  13. Jason

     /  June 5, 2020

    Great resources Richard! I am running into the same issue with the IKEV2 Device Tunnel with machine cert. We have 2 VPN Servers running RRAS 2019 with Fragmentation enabled and clients are Windows 10 1909 … receiving error 809 when my attempts go through the External F5 LTM. If I bypass the External F5 with a local host entry and add in the External IP of the VPN box to my existing FW rule… directly it always works perfectly. For testing/troubleshooting I removed one of the VPN boxes from the F5 so currently I only have the one VPN box behind the F5 and I’m still getting the intermittent 809 error on the device tunnel. I followed all the advice and settings for source address translation to NONE and have the default GW of the External NIC of VPN box to be the FLOAT IP of the F5. Also have persistence of source address and match across service enabled. So just curious … I currently have 3 separate VIPs on the F5, there are the 2 UDP ones … the 500 and 4500 and the SSTP one of 443. I also have 3 separate matching pools with the same corresponding ports. Would it make sense to keep the 3 separate VIPs but perhaps combine the 500 & 4500 into the same pool? I also tried using wildcard on one single VIP that should allow all ports and all protocols in and did the same for the pool but that didn’t seem to help either. The User Tunnel seem to be fine. If I change the source address translation to AUTO MAP and fix the VPN server default GW back to the firewall I still get the intermittent connectivity and I obviously loose the Client IP info as all the connections are showing the IP of the F5 FLOAT. This is a new deployment and we would like to go into production soon…everything is working perfectly when bypassing the F5 so this is the last component to address. If you have any ideas on what I could try on the LTM it would be greatly appreciated. Thanks.

    Reply
    • Looks like you’ve done everything right. Not sure why you are still getting 809 errors. I typically configure the F5 as you did, with three separate pools (one per port) and three individual VIPs. I’m assuming you’re not having any issues with SSTP? Just IKEv2?

      Reply
  14. Prasanth Kuttasseri

     /  July 30, 2020

    Have a doubt. We have DA now. But the issue is that the client laptops through the DA shows single source IP (the DA gateway IP) which makes it difficult for the Proxy to identify the user and activities. Proxy is becoming crazy while it sees the source from single IP and with multiple credentials. Planning to test the “Always On VPN” .
    I have worked in SSL remote access concetrator where the clients will be assigned IP from a pool and the firewall will control the access. Will the Always On VPN also the same concept or different. Can i get individual virtual pool ip for the clients coming out of the VPN server rather than a single source IP (ie : the VPN gateway IP)

    Reply
    • Got it. You won’t have that problem with Always On VPN. Each VPN client is assigned a unique IP address which is routed through the VPN server. There is no NAT taking place when using Always On VPN.

      Reply
      • Prasanth Kuttasseri

         /  August 3, 2020

        Sir Richard…thank you. You made the day as from security perspective this is essential to identify the client IP and the activities involved. Thank you

  15. Mahesh

     /  September 3, 2020

    Hi Richard,
    We are seeing some client disconnection issues with only observation of system event id: 868 and 631. From event log it refers to VPN DNS name resolution point but in actual we dont see this issue.There are other users also connected during the time.
    Is there anyway to check this? We already teted with replacing vpn profile, moving to differe t VPN Server also…

    Reply
    • These commonly occur because of underlying connectivity issue. I’d have a close look at the network connection when the VPN disconnects to determine if it was interrupted at all.

      Reply
  1. Always On VPN IKEv2 Load Balancing with Citrix NetScaler ADC | Richard M. Hicks Consulting, Inc.
  2. Always On VPN IKEv2 Load Balancing Issue with Kemp LoadMaster | Richard M. Hicks Consulting, Inc.
  3. Always On VPN IKEv2 Load Balancing with F5 BIG-IP | Richard M. Hicks Consulting, Inc.
  4. Always On VPN Device Tunnel and Custom Cryptography Native Support Now in Intune | Richard M. Hicks Consulting, Inc.
  5. Always On VPN IPsec Root Certificate Configuration Issue | Richard M. Hicks Consulting, Inc.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: