Weird behaviour with the fake arp on AWS

Hi all,

I have a curious issue on AWS using an IPFire instance on EC2. It’s based on the aws-marketplace/IPFire 2.23 - Core Update 132-f1cc1e91-4677-4b62-8d48-0747fb1ddfda-ami-0b9012d247e056870.4 AMI (although upgraded over time to core update 156).

It’s supposed to be providing access to remote services via an IPSec tunnel. The tunnel works, but I am struggling to allow access to the services from most of my VPC. Here’s a (messy, sorry!) picture, showing the key parts of the architecture. I have anonymised the non-VPC IP addresses by replacing bits of them with letters:

I can ssh from “JB” in the App Pub B subnet to “TB” in the DMZ B subnet. I can then run curl a.b.85.37 and curl --proxy 172.31.130.20 a.b.85.37 on TB and get a sensible response from the remote server both times.

On JB, curl a.b.85.37 doesn’t connect to the remote service directly, as would be expected, but neither does curl --proxy 172.31.130.20 a.b.85.37. I’ve been banging my head against this issue for a while.

Finally, I groaned out loud and installed tshark on IPFire. I traced the green0 interface while attempting curl --proxy 172.31.130.20 a.b.85.37 from JB, and it showed me the following. It shows that the hypervisor, or whatever, is not coming back with an arp response when asked for JB’s mac address.

    1 0.000000000 0a:77:1f:64:1d:4e → 0a:e0:de:66:ea:3a ARP 42 Who has 172.31.130.17? Tell 172.31.130.20
    2 0.000120798 0a:e0:de:66:ea:3a → 0a:77:1f:64:1d:4e ARP 56 172.31.130.17 is at 0a:e0:de:66:ea:3a
    3 4.133435797 172.31.4.175 → 172.31.130.20 TCP 74 51396 → 800 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2472235682 TSecr=0 WS=64
    4 4.133548172 0a:77:1f:64:1d:4e → Broadcast    ARP 42 Who has 172.31.4.175? Tell 172.31.130.20
    5 5.146706933 0a:77:1f:64:1d:4e → Broadcast    ARP 42 Who has 172.31.4.175? Tell 172.31.130.20
    6 5.156463463 172.31.4.175 → 172.31.130.20 TCP 74 [TCP Retransmission] 51396 → 800 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2472236705 TSecr=0 WS=64
    7 6.160035867 0a:77:1f:64:1d:4e → Broadcast    ARP 42 Who has 172.31.4.175? Tell 172.31.130.20
    8 7.172433011 172.31.4.175 → 172.31.130.20 TCP 74 [TCP Retransmission] 51396 → 800 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2472238721 TSecr=0 WS=64
    9 8.363067827 172.31.4.175 → 172.31.130.20 TCP 74 55006 → 22 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2472239912 TSecr=0 WS=64
   10 8.363164255 0a:77:1f:64:1d:4e → Broadcast    ARP 42 Who has 172.31.4.175? Tell 172.31.130.20
   11 9.380513943 172.31.4.175 → 172.31.130.20 TCP 74 [TCP Retransmission] 55006 → 22 [SYN] Seq=0 Win=26883 Len=0 MSS=8961 SACK_PERM=1 TSval=2472240929 TSecr=0 WS=64
   12 9.386667126 0a:77:1f:64:1d:4e → Broadcast    ARP 42 Who has 172.31.4.175? Tell 172.31.130.20
   13 10.400015983 0a:77:1f:64:1d:4e → Broadcast    ARP 42 Who has 172.31.4.175? Tell 172.31.130.20

I thought that arp in AWS VPCs was actually faked in the hypervisor - so it’s not clear to me how it can fail to come back with the answer.

If anyone here can shed any light on whatever stupid thing it is that I have done to cause this problem, I would be very grateful indeed. If you’ve made it this far, thanks for the attention.

I wanted to follow up on this, for people with a similar problem arriving via Google.

This particular issue was because I had GREEN set to the whole VPC (172.31.0.0/16), but IPFire only knew how to route to the “DMZ B” subnet (172.31.130.16/28) where its ENI was (see picture above). I contacted IPFire support (worth every single penny, btw, thank you!), and the answer was to put in a static route to 172.31.0.0/16 via the router in the 172.31.248.16/28 subnet. The router was at 172.31.248.17 per the docs:

Again, very many thanks to Support, I could have been banging my head against this problem for quite a bit longer, I think.

2 Likes

best not to use such a large network if possible unless really needed and in such case need to tune network to allow for the increased ARP usage.
If only have small number of hosts in ARP cache would normally not be an issue but still best practice to use smaller network as possible /16 is pretty large for a network, routing is not an issue supernetting large networks but local ARP caches could fill up without some tuning on hosts/firewalls if really using that large number hosts. if using small number hosts usually not an issue but if any scanning on the network, ping all IP’s could fill up cache, have seen many network go down due to this and in cases whereas using larger networks I increase local ARP cache setting on firewalls at least beyond most defaults