The next question to answer after writing How do Kubernetes and Docker create IP Addresses?! is “How does Docker handle publishing ports?”
In the previous post, we created our own network namespaces, virtual interfaces,
and assigned IP addresses to these virtual interfaces. Now we’ll learn how to make a request to 127.0.0.1:8080
or
192.168.0.100:8080
(a local IP address) and forward the request to an HTTP server running in our network namespace. This will help us understand what Docker does
when the docker run
command is instructed to publish ports.
Docker (version 19.03.6
) uses a couple of techniques including a proxy that binds a port in the host network
namespace and makes requests to a container’s network namespace.
This post focuses on the other technique Docker uses, iptables, which can also be used to forward requests from a port in the host network namespace to an IP address and port residing in another network namespace.
Note: This post only works on Linux. I’m using Ubuntu 19.10, but this should work on other Linux distributions.
This post will assume you’re familiar with network namespaces and virtual interfaces. If not, I recommend going back and reading my previous post, How do Kubernetes and Docker create IP Addresses?!
create network namespaces and virtual devices
First, we’ll need to catch up to where How do Kubernetes and Docker create IP Addresses?! left off.
Let’s make sure IP forwarding is enabled by running:
|
|
Then execute the following bash commands:
|
|
This will create two network namespaces (netns_dustin
and netns_leah
) connected to a virtual
bridge named bridge_home
via veth pairs. 10.0.0.11
and 10.0.0.21
will be assigned to the
virtual ethernet in the netns_dustin
and netns_leah
network namespaces, respectively. Both
network namespaces will be able to communicate with the public internet by using 8.8.8.8
for
DNS.
For a visual aid, our setup now looks like this:
improve iptables commands from the previous post
In the previous post we ran the following commands (don’t run these now!):
|
|
These rules enabled traffic to travel across the bridge_home
virtual bridge and enabled
our network namespaces to communicate with the public internet.
These rules can be simplified and improved by instead running (run these now!):
|
|
Now, what we’re saying is any traffic entering or leaving the bridge_home
interface may be
accepted. And finally, we perform a masquerade on our traffic coming from the 10.0.0.0/24
range,
which covers our veth pairs connected to bridge_home
.
Note: we’ll cover what
MASQUERADE
means in a little bit.
This has an additional benefit of we don’t need to know the name of the physical interface. Also, this enables other physical interfaces to communicate with our network namespaces.
To validate the setup is working as intended, the following commands need to run successfully:
|
|
This will validate the host is able to communicate the network namespaces and that the two network namespaces may communicate with each other as well.
iptables tips
Throughout this post we’re going to be modifying iptables. A helpful command to view all iptables rules is:
|
|
This will print the iptables as-is. Great for seeing what’s going on! I recommend running sudo iptables-save
whenever we’re modifying iptables.
Note:
iptables-save
will show short argument names instead of long ones, so-A
instead of--append
.
In fact, this is a great way to make a back up of iptables. If we were to do:
|
|
we’d create a file named iptables.bak
that we could restore our iptables from. To restore, run:
|
|
We can also restart the computer to get our old iptables back.
use iptables to perform a port forward
Now, on to the fun stuff. Let’s spin up a Python HTTP server in the netns_dustin
network
namespace by running:
|
|
This will start an HTTP server running on port 8080.
Open another terminal and find your local IP address (ip address list
). I’ll
use 192.168.0.100
as the local IP address in this post, but please remember to
replace it with your own.
If we then attempt to run:
|
|
we’ll get an expected error stating:
curl: (7) Failed to connect to 192.168.0.100 port 8080: Connection refused
We need to instruct iptables that when a request is made to port 8080
to direct that traffic to
10.0.0.11:8080
instead. We can do that by running:
|
|
This adds a rule to the nat
table. The nat
table is used for Network Address Translation. In
this case, we’re configuring a DNAT
rule, which stands for Destination Network Address Translation.
This rule will match any packet using TCP destined for port 8080. At that point, it’ll set the
destination IP and port on the packet to 10.0.0.11:8080
.
Note: Why
--match tcp
and--protocol tcp
?--protocol tcp
specifies the rule is for TCP, while--match tcp
loads more specification options related to TCP like--dport
, which we use in the above rule.
Note: There’s also a
SNAT
(Source Network Address Translation), which is similar toDNAT
but instead modifies the source of a packet to an explicit IP address/port. This is whereMASQUERADE
comes in.MASQUERADE
is likeSNAT
, but does not require specifying an IP address in iptables. Updating the source of a packet is automatically handled when usingMASQUERADE
. I think ofMASQUERADE
as a dynamicSNAT
.
Enough with the notes, now we can run:
|
|
Perfect! But now for some bad news. Try running:
|
|
and try running:
|
|
Both of these will “work” and get a response from our own HTTP server! This is undesirable, to say the
least. Fortunately, iptables supports many options for rules. In the iptables rule above,
we specified it needs to match on --dport
(destination port) and TCP protocol. We can also
specify a match on the destination IP.
First delete the previously created iptables rule via:
|
|
and add a more specific rule:
|
|
Now making a request to 192.168.0.101:8080
and www.google.com:8080
both fail, while a request
to 192.1.0.100:8080
still works.
We can improve this rule even more. Unless 192.168.0.100
is a static IP, there’s a chance
our local IP address could change in the future. That would mean we’d have to constantly
update our iptables rule. Bleh!
iptables has another helpful feature we can use.
Once again, delete the previously created rule:
|
|
and add an improved rule via:
|
|
Instead of specifying a destination IP address we instead use --match addrtype --dst-type LOCAL
to match any local destination. If you’d like, you may verify the request still works via:
|
|
enable external traffic to be port forwarded
On a second device such as a phone or another computer on the same network, navigate to
192.168.0.100:8080
in a browser. This will fail.
The rule we created was for the OUTPUT
chain in the nat
table. The OUTPUT
chain is used
for locally created packets. The nat
table has another chain for packets entering the system
from an external source. This chain is named PREROUTING
. Brent Salisbury has created a helpful GitHub Gist,
Linux NetFilter, IP Tables and Conntrack Diagrams,
as an overview of these different tables and chains.
To correctly route external traffic, we can define the exact same rule as above, but append
it to the PREROUTING
chain instead of the OUTPUT
chain. Do so by running:
|
|
Now on the second device, reload 192.168.0.100:8080
and we’ll get a response from our HTTP
server!
create a custom iptables chain
We now have two identical rules on separate chains (PREROUTING
and OUTPUT
) in the nat
table. We can create a custom chain to DRY
these rules.
To create a new chain in the nat
table named DUSTIN
, run:
|
|
We’ll then add our duplicated rule from above to the DUSTIN
chain like so:
|
|
We’ll add another rule in the DUSTIN
chain that will tell iptables to return to the rest of
the chain calling this chain if none of the other rules in DUSTIN
are a match.
sudo iptables \
--table nat \
--append DUSTIN \
--jump RETURN
Note: Technically our
RETURN
rule is not needed. If iptables reaches the end of a custom chain then iptables will proceed with the following rules after the jump to the custom chain. Netfilter describes this in its Target Specifications documentation.
Let’s delete the old duplicated rules in the PREROUTING
and OUTPUT
chains by running:
|
|
Then we can add two rules to instruct the PREROUTING
and OUTPUT
chains to use the DUSTIN
chain whenever a LOCAL
address type is used for the destination.
|
|
We could have associated the --match addrtype --dst-type LOCAL
on the rule in the DUSTIN
chain and omitted it from these two rules. However, in a little bit, we’ll add another rule to
the DUSTIN
chain and since the --match addrtype --dst-type LOCAL
is at the calling chain
(PREROUTING
or OUTPUT
), we don’t have to remember to always add this to every rule in the
DUSTIN
chain. This is what Docker does as well, so we’ll match them.
add another port forward in the custom chain
Let’s start another HTTP server in the netns_leah
network namespace:
|
|
We’ll need to add another rule to the DUSTIN
chain by running:
|
|
If we then make a request to port 8081
by running:
|
|
It’ll fail. Wait, fail?!
So rules in iptables are performed in order. If we look at the existing rules in the DUSTIN
chain
via:
|
|
We’ll see the following output:
Chain DUSTIN (2 references)
target prot opt source destination
DNAT tcp -- anywhere anywhere tcp dpt:http-alt to:10.0.0.11:8080
RETURN all -- anywhere anywhere
DNAT tcp -- anywhere anywhere tcp dpt:tproxy to:10.0.0.21:8081
So append is, well, appending to the end of the chain. We need to make sure the new rule is before
the jump to RETURN
. We can use --insert
instead of --append
to specify where to place the rule.
First, delete the rule we just made:
|
|
and then insert the rule after our other port forwarding rule, but before the RETURN
rule:
|
|
Now the following request will work:
|
|
docker-proxy and local network
So far we’ve been using the 192.168.0.100
IP address, but what happens if we try 127.0.0.1
,
0.0.0.0
, or localhost
? The following will all fail:
|
|
By default, Linux will not consider our iptables rules for the local network. As mentioned in the intro,
Docker gets around this by running another process, docker-proxy
, when running a container exposing ports.
This proxy has a few benefits:
- works for forwarding ports on the local network as this will not use iptables
- works when Docker is configured to not modify iptables
- binds the port on the host network to verify no other process is or could use the same port
We can do something similar using socat
. On Ubuntu, socat
may be installed via:
|
|
Once socat
is installed we can then run:
|
|
This instructs socat
to listen on port 8080 and forward requests to 10.0.0.11:8080
.
Next, we’ll need to make sure a MASQUERADE
is being performed on requests coming from the local network
into our bridge_home
device. This can be configured by running:
|
|
At this point the following requests will work:
|
|
forward ports on the local network using iptables
I’m stubborn and don’t like admitting defeat to computers, so I want to understand how to use
iptables to forward a port on the local network. Be sure to terminate the socat
process (CTRL+c
should do the trick).
We’ll need to enable the route_localnet
option so that we can instruct Linux to properly direct
our port forwarding on the local network via iptables. Run:
|
|
After enabling routing on the local network and adding the MASQUERADE
rule in the previous
section, the following requests will work:
|
|
and without a proxy!
some future research topics
Learning all of this stuff typically ends with me having more questions such as:
- Why does Docker not enable local network routing?
- Update (August 1st, 2020): Tabitha Sable wrote about a Kubernetes CVE where enabling routing on the local network is the culprit. So yeah, bad idea.
- How to better debug iptables? I’m familiar with using
TRACE
andLOG
, but please let me know if you have tools you like to use instead! - Not a question, but I want to learn more about
conntrack
. Docker uses this.conntrack
seems to “remember” the iptables rule result for a given source, so that not every packet from the same source has to traverse all of the iptables rules.
Did I explain something poorly or, worse, wrong? Know the answer to one of the above questions? Then please feel free to reach out and let me know on Twitter, LinkedIn, or GitHub.