Linux kernel TCP smoothed-RTT estimation

Posted: February 18th, 2018 | Author: | Filed under: Linux, networking, tcp | Tags: , , , , | Comments Off on Linux kernel TCP smoothed-RTT estimation

Recently I decided to look under the hood to see how exactly srtt is calculated in Linux. Actual (Exponentially Weighted Moving Average) srtt calculation is a rather straight-forward part but what goes in as input to that calculation under various scenarios is interesting and very important in getting correct rtt estimate.

Also useful to note the difference between Linux and FreeBSD in this regard. Linux doesn’t trust tcp packet Timestamps option provided value whenever possible as middle-boxes can meddle with it.

Basic algorithm is:
For non-retransmitted packets, use saved packet send timestamp and ack arrival time.
For retransmitted packets, use timestamp option and if that’s not enabled, rtt is not calculated for such packets.

Let’s look at the code. I am using net-next.
When a TCP sender sends packets, it has to wait for acks for those packets before throwing them away. It stores them in a queue called ‘retransmission queue’.
When sent packets get acked, tcp_clean_rtx_queue() gets called to clear those packets from the retransmission queue.

A few useful variables in that function are:
seq_rtt_us – uses first packet from ackd range
ca_rtt_us – uses last packet from ackd range (mainly used for congestion control)
sack_rtt_us – uses sacked ack
tcp_mstamp is a tcp_sock member which represents timestamp of most recent packet received/sent. It gets updated by tcp_mstamp_refresh().

For a clean ack (not sack), seq_rtt_us = ca_rtt_us (as there is no range)

If such a clean is also for a non-retransmitted packet,

seq_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, first_ackt);

and for a sack which is again for a non-retransmitted packet,

sack_rtt_us = tcp_stamp_us_delta(tp->tcp_mstamp, sack->first_sackt);

Code that updates sack→first_sackt is in tcp_sacktag_one() where it gets populated when the sack is for a non-retransmitted packet.

tcp_stamp_us_delta() gets the difference with timestamp that the stack maintains.

Now tcp_ack_update_rtt() gets called which starts out with:

/* Prefer RTT measured from ACK's timing to TS-ECR. This is because
 * broken middle-boxes or peers may corrupt TS-ECR fields. But
 * Karn's algorithm forbids taking RTT if some retransmitted data
 * is acked (RFC6298).
if (seq_rtt_us < 0)
        seq_rtt_us = sack_rtt_us;

For acks acking retransmitted packets, seq_rtt_us would be -ve.
But if there is a SACK timestamp from a non-retransmitted packet, it would use that as it carries valid and useful timestamps.

Then it takes TS-opt provided timestamps only if seq_rtt_us is -ve.

if (seq_rtt_us < 0 && tp->rx_opt.saw_tstamp && tp->rx_opt.rcv_tsecr &&
    flag & FLAG_ACKED) {
        u32 delta = tcp_time_stamp(tp) - tp->rx_opt.rcv_tsecr;
        u32 delta_us = delta * (USEC_PER_SEC / TCP_TS_HZ);
        seq_rtt_us = ca_rtt_us = delta_us;

By this point, there is seq_rtt_us that can be fed into tcp_rtt_estimator() that’d generate smoothed-RTT (which is more or less based on SIGCOMM 88 paper by Van Jacobson).

Drop a packet

Posted: October 14th, 2015 | Author: | Filed under: FreeBSD, networking, tcp | Tags: , , | Comments Off on Drop a packet

A few months back when I started looking into improving FreeBSD TCP’s response to packet loss, I looked around for traffic simulators which can do deterministic packet drop for me.

I had used dummynet(4) before so I thought of using it but the problem is that it only provided probabilistic drops. You can specify dropping 10% of the total packets, for example. I came across dpd work from CAIA, Swinburne University but it was written for FreeBSD7 and I couldn’t port it forward to FreeBSD11 with reasonable time/efforts as ipfw/dummynet has changed quite a bit.

So I decided to hack dummynet to provide me deterministic drops. Here is the patch: drop.patch
(Yes, it’s a hack and it needs polishing.)

Here is how I use it:

client              dummynet          server  <--->

Both client and server need their routing tables setup correctly so that they can reach each other.

Dummynet node is the traffic shaping node here. We need to enable forwarding between interfaces:

sysctl net.inet.ip.forwarding=1

We need to setup links (called ‘pipes’) and their parameters on dummynet node like this:

# ipfw add pipe 100 ip from to out 
# ipfw add pipe 101 ip from to out
# ipfw pipe 100 config mask proto TCP src-ip dst-ip pls 3,4,5 plsr 7
# ipfw pipe 101 config mask proto TCP src-ip dst-ip

‘pls 3,4,5 plsr 7’ – is the new configuration that the patch provides here.
pls : packet loss sequence
plsr : repeat frequency for the loss pattern

In the example above, it configures the pipe 100 to drop 3rd, 4th and 5th packet and repeat this pattern at every 7 packets going from server to client. So it’d also drop 10th, 11th and 12th packets and so on and so forth.

Side note: delay, bw and queue depth are other very useful parameters that you can set for the link to simulate however you want the link to behave. For example: ‘delay 5ms bw 40Mbps queue 50Kbytes’ would create a link with 10ms RTT, 40Mbps bandwidth with 50Kbytes worth of queue depth/capacity. Queue depth is usually decided based on BDP (bandwidth delay product) of the link. Dummynet drops packets once the limit is reached.

For simulations, I run a lighttpd web-server on the server which serves different sized objects and I request them via curl or wget from the client. I have tcpdump running on any/all of four interfaces involved to observe traffic and I can see specified packets getting dropped by dummynet.
sysctl net.inet.ip.dummynet.io_pkt_drop is incremented with each packet that dummynet drops.

Future work:
* Work on getting this patch committed into FreeBSD-head.
* sysctl net.inet.ip.dummynet.io_pkt_drop increments on any type of loss (which includes queue overflow and any other random error) so I am planning to add a more specific counter to show explicitly dropped packets only.
* I’ve (unsuccessfully) tried adding deterministic delay to dummynet so that we can delay specific packet(s) which can be useful in simulating link delays and also in debugging any delay-based congestion control algorithms. Turns out it’s trickier that I thought. I’d like to resume working on it as time permits.

Improving FreeBSD’s transport layer

Posted: October 9th, 2015 | Author: | Filed under: FreeBSD, networking, tcp | Tags: , , | Comments Off on Improving FreeBSD’s transport layer

FreeBSD network stack is quite stable but lacks some of the improvements/features available in other OSes.

A bunch of us have started an effort to try and identify current problems to improve transport layer (TCP, UDP, SCTP and others) for FreeBSD:

Transport Protocols wiki

Traditionally, freebsd-net has been the mailing list where networking problems get discussed but some have complained it to be too spammy and too focused on NIC drivers related issues. So a new mailing list has been created to specifically talk about transport level protocols: transport@

We’ve also started creating a list of TCP related RFCs and their support for FreeBSD to have a single point of reference.

Plan is to have a coordinated effort to improve TCP, UDP, etc.. so if you are interested in any of those protocols, please join the mailing list and help FreeBSD. :-)