Linux kernel TCP smoothed-RTT estimation

Posted: February 18th, 2018 | Author: | Filed under: Linux, networking, tcp | Tags: , , , , | Comments Off on Linux kernel TCP smoothed-RTT estimation

Recently I decided to look under the hood to see how exactly srtt is calculated in Linux. Actual (Exponentially Weighted Moving Average) srtt calculation is a rather straight-forward part but what goes in as input to that calculation under various scenarios is interesting and very important in getting correct rtt estimate.

Also useful to note the difference between Linux and FreeBSD in this regard. Linux doesn’t trust tcp packet Timestamps option provided value whenever possible as middle-boxes can meddle with it.

Basic algorithm is:
For non-retransmitted packets, use saved packet send timestamp and ack arrival time.
For retransmitted packets, use timestamp option and if that’s not enabled, rtt is not calculated for such packets.

Let’s look at the code. I am using net-next.
When a TCP sender sends packets, it has to wait for acks for those packets before throwing them away. It stores them in a queue called ‘retransmission queue’.
When sent packets get acked, tcp_clean_rtx_queue() gets called to clear those packets from the retransmission queue.

A few useful variables in that function are:
seq_rtt_us – uses first packet from ackd range
ca_rtt_us – uses last packet from ackd range (mainly used for congestion control)
sack_rtt_us – uses sacked ack
tcp_mstamp is a tcp_sock member which represents timestamp of most recent packet received/sent. It gets updated by tcp_mstamp_refresh().

For a clean ack (not sack), seq_rtt_us = ca_rtt_us (as there is no range)

If such a clean is also for a non-retransmitted packet,
[sourcecode language=”c”]seq_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, first_ackt);[/sourcecode]

and for a sack which is again for a non-retransmitted packet,
[sourcecode language=”c”]sack_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, sack->first_sackt);[/sourcecode]

Code that updates sack→first_sackt is in tcp_sacktag_one() where it gets populated when the sack is for a non-retransmitted packet.

tcp_stamp_us_delta() gets the difference with timestamp that the stack maintains.

Now tcp_ack_update_rtt() gets called which starts out with:
[sourcecode language=”c”]
/* Prefer RTT measured from ACK’s timing to TS-ECR. This is because
* broken middle-boxes or peers may corrupt TS-ECR fields. But
* Karn’s algorithm forbids taking RTT if some retransmitted data
* is acked (RFC6298).
*/
if (seq_rtt_us < 0)
seq_rtt_us = sack_rtt_us;
[/sourcecode]

For acks acking retransmitted packets, seq_rtt_us would be -ve.
But if there is a SACK timestamp from a non-retransmitted packet, it would use that as it carries valid and useful timestamps.

Then it takes TS-opt provided timestamps only if seq_rtt_us is -ve.
[sourcecode language=”c”]
if (seq_rtt_us < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
flag & FLAG_ACKED) {
u32 delta = tcp_time_stamp(tp) – tp->rx_opt.rcv_tsecr;
u32 delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ);

seq_rtt_us = ca_rtt_us = delta_us;
}
[/sourcecode]

By this point, there is seq_rtt_us that can be fed into tcp_rtt_estimator() that’d generate smoothed-RTT (which is more or less based on SIGCOMM 88 paper by Van Jacobson).


Bhyve setup for tcp testing

Posted: January 9th, 2017 | Author: | Filed under: FreeBSD, tcp, virtualization | Tags: | Comments Off on Bhyve setup for tcp testing

Here is how I test simple FreeBSD tcp changes with dummynet on bhyve. I’ve already wrote down how I do dummynet so I’ll focus on bhyve part.

Caution: Handbook entry on bhyve is the true source. Please refer to it for exact information. This post is super quick and may contain not-entierly-correct things. Also, I am lazy and all this config is what I am using, you may need to tweak a bit here and there.

Setup:
I’ll create 3 bhyve guests: client, router and server:

client            router            server
192.168.1.227     192.168.1.228     192.168.1.229 
10.10.10.10   10.10.10.11
                  10.10.11.11   10.10.11.10

Here, 192.* addresses are for ssh and 10.* are for guests to be able to communicate within themselves.

First, create tap interfaces needed for all bhyve guests:

client has tap0 (ssh), tap1
router has tap2 (ssh), tap3, tap4
server has tap5 (ssh), tap6

ifconfig tap0 create
ifconfig tap1 create
ifconfig tap2 create
ifconfig tap3 create
ifconfig tap4 create
ifconfig tap5 create
ifconfig tap6 create

Now create bridge interfaces for the communication.

bridge0 contains re0, tap0, tap2, tap5
bridge1 contains tap1, tap3
bridge2 contains tap4, tap6

re0 is host interface here.

ifconfig bridge0 create
ifconfig bridge0 addm re0 addm tap0 addm tap2 addm tap5
ifconfig bridge0 up
ifconfig bridge1 create
ifconfig bridge1 addm tap1 addm tap3
ifconfig bridge1 up
ifconfig bridge2 create
ifconfig bridge2 addm tap4 addm tap6
ifconfig bridge2 up

bridge0 would help connect all guests mgmt interfaces to re0 (host interface) so they all can reach out and for us to be able to ssh into them.

bridge1 connects client to router and bridge2 connects router to server.

Now, let’s create VMs.

truncate -s 10G client.img
truncate -s 10G router.img
truncate -s 10G server.img

Setup/install VMs:

sh /usr/share/examples/bhyve/vmrun.sh -c 2 -m 2048M -t tap0 -t tap1 -d client.img -i -I iso client
sh /usr/share/examples/bhyve/vmrun.sh -c 2 -m 2048M -t tap2 -t tap3 -t tap4 -d router.img -i -I iso router
sh /usr/share/examples/bhyve/vmrun.sh -c 2 -m 2048M -t tap5 -t tap6 -d server.img -i -I iso server

Here, ‘iso’ is the path to iso image that you want to install with and last arguments – client, router,server – are VM names.

Start the VMs:

sh /usr/share/examples/bhyve/vmrun.sh -c 2 -m 2048M -t tap0 -t tap1 -d client.img client
sh /usr/share/examples/bhyve/vmrun.sh -c 2 -m 2048M -t tap2 -t tap3 -t tap4 -d router.img router
sh /usr/share/examples/bhyve/vmrun.sh -c 2 -m 2048M -t tap5 -t tap6 -d server.img server

Stop a VM:

bhyvectl --force-poweroff --vm=

To setup networking, you’d need following in rc.conf files:

client:
ifconfig_vtnet0="inet 192.168.1.227 netmask 255.255.255.0"
defaultrouter="192.168.1.1"
ifconfig_vtnet1="inet 10.10.10.10 netmask 255.255.255.0"
static_routes="inet1"
route_inet1="-host 10.10.11.10 10.10.10.11"

router:
ifconfig_vtnet0="inet 192.168.1.228 netmask 255.255.255.0"
defaultrouter="192.168.1.1"
ifconfig_vtnet1="inet 10.10.10.11 netmask 255.255.255.0"
ifconfig_vtnet2="inet 10.10.11.11 netmask 255.255.255.0"

server:
ifconfig_vtnet0="inet 192.168.1.229 netmask 255.255.255.0"
defaultrouter="192.168.1.1"
ifconfig_vtnet1="inet 10.10.11.10 netmask 255.255.255.0"
static_routes="inet1"
route_inet1="-host 10.10.10.10 10.10.11.11"

static route entries make sure routes are setup correctly for client and server to communicate with each other.

router would also need following in /etc/sysctl.conf to be able to pass traffic between client and server.

net.inet.ip.forwarding=1

Try pinging client from server or the other way around to make sure networking is working:

root@server:~ # ping 10.10.10.10
PING 10.10.10.10 (10.10.10.10): 56 data bytes
64 bytes from 10.10.10.10: icmp_seq=0 ttl=63 time=0.718 ms
64 bytes from 10.10.10.10: icmp_seq=1 ttl=63 time=0.999 ms
64 bytes from 10.10.10.10: icmp_seq=2 ttl=63 time=0.553 ms
64 bytes from 10.10.10.10: icmp_seq=3 ttl=63 time=0.553 ms
^C
--- 10.10.10.10 ping statistics ---
4 packets transmitted, 4 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 0.553/0.706/0.999/0.182 ms

Working networking setup on the guest looks something like this:

root@server:~ # ifconfig
vtnet0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=80028<VLAN_MTU,JUMBO_MTU,LINKSTATE>
        ether xx:xx:xx:xx:xx:xx
        inet 192.168.1.229 netmask 0xffffff00 broadcast 192.168.1.255
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet 10Gbase-T 
        status: active
vtnet1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
        options=80028<VLAN_MTU,JUMBO_MTU,LINKSTATE>
        ether xx:xx:xx:xx:xx:xx
        inet 10.10.11.10 netmask 0xffffff00 broadcast 10.10.11.255
        nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL>
        media: Ethernet 10Gbase-T 
        status: active
lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> metric 0 mtu 16384
        options=600003<RXCSUM,TXCSUM,RXCSUM_IPV6,TXCSUM_IPV6>
        inet6 ::1 prefixlen 128
        inet6 fe80::1%lo0 prefixlen 64 scopeid 0x3
        inet 127.0.0.1 netmask 0xff000000
        nd6 options=21<PERFORMNUD,AUTO_LINKLOCAL>
        groups: lo

Working networking setup on the host looks something like this:

tap0: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000 ether xx:xx:xx:xx:xx:xx
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect
status: active
groups: tap
Opened by PID 26035
tap1: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000 ether xx:xx:xx:xx:xx:xx
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect
status: active
groups: tap
Opened by PID 26035
tap2: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000 ether xx:xx:xx:xx:xx:xx
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect
status: active
groups: tap
Opened by PID 26093
tap3: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000 ether xx:xx:xx:xx:xx:xx
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect
status: active
groups: tap
Opened by PID 26093
tap4: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000 ether xx:xx:xx:xx:xx:xx
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect
status: active
groups: tap
Opened by PID 26093
tap5: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000 ether xx:xx:xx:xx:xx:xx
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect
status: active
groups: tap
Opened by PID 25977
tap6: flags=8943<UP,BROADCAST,RUNNING,PROMISC,SIMPLEX,MULTICAST> metric 0 mtu 1500
options=80000 ether xx:xx:xx:xx:xx:xx
nd6 options=29<PERFORMNUD,IFDISABLED,AUTO_LINKLOCAL> media: Ethernet autoselect
status: active
groups: tap
Opened by PID 25977
bridge0: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether xx:xx:xx:xx:xx:xx
inet 192.168.1.224 netmask 0xffffff00 broadcast 192.168.1.255
nd6 options=1 groups: bridge
id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
member: tap5 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 13 priority 128 path cost 2000000
member: tap2 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 10 priority 128 path cost 2000000
member: tap0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 8 priority 128 path cost 2000000
member: re0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 5 priority 128 path cost 20000
bridge1: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether xx:xx:xx:xx:xx:xx
nd6 options=1 groups: bridge
id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
member: tap3 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 11 priority 128 path cost 2000000
member: tap1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 9 priority 128 path cost 2000000
bridge2: flags=8843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST> metric 0 mtu 1500
ether xx:xx:xx:xx:xx:xx
nd6 options=1 groups: bridge
id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
member: tap6 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 14 priority 128 path cost 2000000
member: tap4 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
ifmaxaddr 0 port 12 priority 128 path cost 2000000

Drop a packet

Posted: October 14th, 2015 | Author: | Filed under: FreeBSD, networking, tcp | Tags: , , | Comments Off on Drop a packet

A few months back when I started looking into improving FreeBSD TCP’s response to packet loss, I looked around for traffic simulators which can do deterministic packet drop for me.

I had used dummynet(4) before so I thought of using it but the problem is that it only provided probabilistic drops. You can specify dropping 10% of the total packets, for example. I came across dpd work from CAIA, Swinburne University but it was written for FreeBSD7 and I couldn’t port it forward to FreeBSD11 with reasonable time/efforts as ipfw/dummynet has changed quite a bit.

So I decided to hack dummynet to provide me deterministic drops. Here is the patch: drop.patch
(Yes, it’s a hack and it needs polishing.)

Here is how I use it:
Setup:

client              dummynet          server
10.10.10.10  <--->  10.10.10.11
                    10.10.11.11 <---> 10.10.11.12

Both client and server need their routing tables setup correctly so that they can reach each other.

Dummynet node is the traffic shaping node here. We need to enable forwarding between interfaces:

sysctl net.inet.ip.forwarding=1

We need to setup links (called ‘pipes’) and their parameters on dummynet node like this:

# ipfw add pipe 100 ip from 10.10.11.12 to 10.10.10.10 out 
# ipfw add pipe 101 ip from 10.10.10.10 to 10.10.11.12 out
# ipfw pipe 100 config mask proto TCP src-ip 10.10.11.12 dst-ip 10.10.10.10 pls 3,4,5 plsr 7
# ipfw pipe 101 config mask proto TCP src-ip 10.10.10.10 dst-ip 10.10.11.12

‘pls 3,4,5 plsr 7’ – is the new configuration that the patch provides here.
pls : packet loss sequence
plsr : repeat frequency for the loss pattern

In the example above, it configures the pipe 100 to drop 3rd, 4th and 5th packet and repeat this pattern at every 7 packets going from server to client. So it’d also drop 10th, 11th and 12th packets and so on and so forth.

Side note: delay, bw and queue depth are other very useful parameters that you can set for the link to simulate however you want the link to behave. For example: ‘delay 5ms bw 40Mbps queue 50Kbytes’ would create a link with 10ms RTT, 40Mbps bandwidth with 50Kbytes worth of queue depth/capacity. Queue depth is usually decided based on BDP (bandwidth delay product) of the link. Dummynet drops packets once the limit is reached.

For simulations, I run a lighttpd web-server on the server which serves different sized objects and I request them via curl or wget from the client. I have tcpdump running on any/all of four interfaces involved to observe traffic and I can see specified packets getting dropped by dummynet.
sysctl net.inet.ip.dummynet.io_pkt_drop is incremented with each packet that dummynet drops.

Future work:
* Work on getting this patch committed into FreeBSD-head.
* sysctl net.inet.ip.dummynet.io_pkt_drop increments on any type of loss (which includes queue overflow and any other random error) so I am planning to add a more specific counter to show explicitly dropped packets only.
* I’ve (unsuccessfully) tried adding deterministic delay to dummynet so that we can delay specific packet(s) which can be useful in simulating link delays and also in debugging any delay-based congestion control algorithms. Turns out it’s trickier that I thought. I’d like to resume working on it as time permits.


Improving FreeBSD’s transport layer

Posted: October 9th, 2015 | Author: | Filed under: FreeBSD, networking, tcp | Tags: , , | Comments Off on Improving FreeBSD’s transport layer

FreeBSD network stack is quite stable but lacks some of the improvements/features available in other OSes.

A bunch of us have started an effort to try and identify current problems to improve transport layer (TCP, UDP, SCTP and others) for FreeBSD:

Transport Protocols wiki

Traditionally, freebsd-net has been the mailing list where networking problems get discussed but some have complained it to be too spammy and too focused on NIC drivers related issues. So a new mailing list has been created to specifically talk about transport level protocols: transport@

We’ve also started creating a list of TCP related RFCs and their support for FreeBSD to have a single point of reference.

Plan is to have a coordinated effort to improve TCP, UDP, etc.. so if you are interested in any of those protocols, please join the mailing list and help FreeBSD. :)