Hairpin NATs and Martian Packets

This is going to be a bit of a dry post unfortunately. I want to document a problem I encountered with my home network, and what I learned diagnosing and solving it. It relates to Wireguard, Network Address Translation, network interfaces and packet captures. So pre-warned, let’s get into it!

I use the Wireguard VPN to access my home network when I’m out and about on mobile data. Mainly, I use it to access Home Assistant for CCTV and things like that, when I’m out on the go. Everything was working fine when I was out and about, but when I got home and joined my wi-fi network I’d be unable to access Home Assistant until I turned the wireguard client off on my phone. Not the end of the world, and often the best solution is just to turn something off if that solves the problem. But it’s a bit inconvenient, and additionally I had some spare time and wanted to figure out what was going on and hopefully have a learning experience in the process!

My inital understanding of the problem was that my phone was configured with the wireguard endpoint at my public IP address. So when my phone tried to access that IP address it would have to “go out and back in” and that my router “wouldn’t allow it”. It turned out none of this is true!

The simplest way to think about the problem is this:

  • The Wireguard android app on my phone has a set of “allowed IPs” - these are the IP ranges (or prefixes) which should tunnel over the VPN. Typically these would be private IP address ranges on your home network.

  • If I access (send packets to) any IP address in these ranges, then the Wireguard app will encrypt the IP packets destined to the internal ranges, and send the encrypted packet to the VPN endpoint address (which needs to be publicly accessible to be useful).

  • The Wireguard “server” software listens on a known port at the endpoint IP address. It decrypts the payload (the packet for the internal host) and routes it on the private network.

My initial guess for why I couldn’t access any internal hosts when on my home Wi-Fi and with the Wireguard app active, was that somehow my public (external) IP address wasn’t “accessible” from within the LAN behind my router. Once I got access to an SSH session on my router I could see that packets destined for my public IP address were not even leaving my LAN.

Some background - my router has two network interfaces which are important here:

  • br0 - bound to 192.168.1.1 - this is the gateway address for my LAN
  • ppp0 - bound to my external public IP address allocated by my ISP - any traffic destined for the internet (not matching any prefix on my LAN) is handled by this interface, which will send the packets to the external fibre connection.

When running tcpdump on ppp0 I was not seeing any traffic to my external public IP. This was initially surprising, as I know I’m literally sending traffic to the very same IP address which ppp0 is bound to! One simple test to prove this was to ping my external IP address, and then run tcpdump on ppp0 and br0 respectively. I only saw the ICMP traffic on br0. A bit more reading explained why this is:

  • A network interface is normally used when packets either arrive at the host, or leave the host destined for another host
  • Another way to say this is that interfaces are the way in, or out, between the kernel and the outside world
  • The kernel decides what to do with an incoming packet. It either is to be handled locally on this host, or forwarded to another host using another interface
  • The kernel sees that the destination IP address is bound to an interface on the current machine, so it should be handled locally
  • The packet doesn’t get delivered to ppp0 as this would be only needed if the packet needed to be delivered to another host

So that explains the first part of the situation. It’s not true that somehow these packets are “going out and back in” and somehow get lost because of some kind of irrational loop behaviour (but this may just be foreshadowing something …)

That brings us onto the next step in the equation - DNAT, or Destionation Network Address Translation.

I have my router set up to do port forwarding for the Wireguard port. This means that when I access the Wireguard port from outside my LAN, these packets will get forwarded to the internal host specified in the port forwarding rule. This means I can run a server on my public IP address, where the server is actually within my LAN.

Port forwarding really just means DNAT. Here’s how that works:

  • Incoming packets for the wireguard endpoint have destination IP of my public IP address, and destination port of the Wireguard port
  • My router has a DNAT rule set up to rewrite these packets to instead have the destination IP of the internel server running Wireguard
  • So these packets get forwarded to that internal host
  • This is simple “dumb” DNAT - there is no ‘masquerade’ or connection tracking in this case - the source IP and host remainin the same as they were
  • So replies from the Wireguard server would work in one of two possible ways, depending on whether my phone is on the public internet (e.g. 5g mobile internet), or on my internal LAN:
    • When on 5g: the source IP will be the public IP assigned by my network provider: this can be routed no problem through my router as any other internet traffic
    • When on my LAN: the source IP would be the internal private IP on my LAN: this can be routed no problem as my router knows the existence of that host already

So what’s the problem? Well, I’ve not mentioned this yet, but I actually have a second LAN behind the “regular” LAN provided by my internet router. So there is a second level of NAT!

At this point, to keep things relatively readable I’ll just try to outline the problem in all its detail.

My internet router manages a LAN on 192.168.1.0/24. I consider this my “legacy” LAN as it’s baked in to the proprietary internet router, which handles DHCP, NAT, wi-fi, and WAN concerns. To get more control I run another LAN off this one. To achieve this, I have a Raspberry Pi 4 with two network interfaces: one connected to an ethernet port on my internet router, and one connected to a Netgear switch. The second network interface is on my “new” LAN which has prefix 192.168.2.0/24. This “new” LAN has its own wireless AP, which my devices are generally connected to.

The Raspberry Pi is doing NAT between the two LANs. And this is where the problem crept in: the R Pi was SNAT’ing packets on the way out to the legacy LAN (in order to find their way to my public IP address) and then being DNAT’d by the port forwarding rule. This resulted in something strange: the packets had the source and destination IP addresses of the same machine (the Raspberry Pi). When the Pi received these packets it dropped them as something has likely gone wrong. More specifically, where would replies be sent to? It doesn’t make sense to receive packets on an incoming network interface, when the reply address is … yourself.

The situation of SNAT’ing on the way out, and DNAT’ing on the way in is known loosely as ‘hairpin NAT’. And the problematic packets with source/destination the same are loosely known as Martian Packets. (I say loosely, as this concept mostly seems to relate to unroutable source addresses, but I’ve also seen the term used for the situation I’m seeing).

I had traced this problem all the way to tcpdump’ing on the Pi - seeing the packets arriving (destined for Wireguard, which also happens to be running on the Pi) and seeing no packets delivered to Wireguard itself. Initially very confusing.

To round off this post, this was the pathological situation I had got myself into, followed by the solution:

  • Packets destined for the Wireguard “server” are sent by my phone. Destination IP is my public IP, source IP is my phone’s private IP address on the new LAN (192.168.2.x)
  • These packets reached the Raspberry Pi, which is the gateway address for the new LAN. The Pi does NAT (masquerade) - rewriting the source IP to be its own address, and the source port to be possibly something random. The outgoing connection gets monitored by conntrack so that replies can be DNAT’d back to the phone’s IP. Since the outgoing packets are destined for my public IP, the Pi knows to use its default gateway which is the interface connected to the legacy LAN. So the source IP of these packets will be something on the old LAN (192.168.1.x).
  • These packets now arrive at the router’s br0 interface and get DNAT’d (port forwarded) to the internal host for the Wireguard server. This happens to be the address of the interface on the Pi which is on the legacy LAN (192.168.1.x).
  • These packets get routed to the Pi, and this is where the martian packet situation creeps in: the kernel on the Pi drops the packets as the source and destination belong to the current host.

So what was the solution? Well, the whole idea of having to route packets to an adjacent LAN and back to the current one is a bit silly. The fix was as simple as adding a DNAT rule to the Pi which rewrites the destination IP of my public IP to the address of the Pi on the new network, but only when the destination port is the Wireguard port. This ensures that Wireguard traffic stays on the same LAN rather than following the hairpin NAT route and getting DNAT’d and SNAT’d into oblivion.

All this to avoid turning Wireguard on and off when I leave or arrive home.

This has probably been awfully dull, but I find it valuable to write up learning experiences like this, even if just as a reminder for future-me. A future post I’d like to write concerns how Wireguard really works at the network (layer 3) level. I found it interesting to learn about the “Tun” Linux interface, as well as finding the workings of Wireguard a bit counter-intuitive at first. But that will have to wait for a future post!