[e2e] Are Packet Trains / Packet Bursts a Problem in TCP?

Tue Sep 26 12:46:27 PDT 2006

On Sep 26, 2006, at 10:51 AM, Detlef Bosau wrote:

> But what I´m curious about in all (sender) pacing approaches is:  
> Where does the pacing rate stem from? Typically, it´s obtained from  
> one of the numerous TCP formulae (Padhye, Mathis...), but these are  
> first (admittedly rough) estimates and second need some  
> measurements to have an initial value set.
>
> Moreover, widely deployed pacing would practically replace the  
> currently used AIMD mechanism for achieving a fair share, because  
> the currently used AIMD probing leads to "microbursts", which means  
> that the TCP sender _exceeds_ its fair share of rate for short  
> periods of time. This doesn´t matter as TCP is not rate  
> controlled.  And the final rate is corrected by ACK clocking.
> How does this work with sender pacing, i.e. putting a leaky bucket   
> or flow shaper or something like that in the TCP sender?

The big problem in pacing is determining the rate available at the  
bottleneck.

Let's assume a truly perfect case - the bottleneck in the network has  
a total capacity of N bps, and currently there are M sessions using  
it. The M sessions, by magic, are operating in perfect TDM fashion -  
as the last bit of one message departs, the first bit of the next  
message becomes ready to send, and no queue ever forms.

Is there room for one more bit on the wire? Well, if this was a TDM  
network, no; something would get dropped. But it's not. So we have a  
new sender that decides to start a TCP session and sends a SYN. What  
happens is that his packet collides with some other packet, and a  
queue one packet deep results. The new session gets very little  
information about rate out of that - there is clearly capacity for at  
least one SYN per RTT available, but he gets no idea what he is  
competing with. But the other M sessions get a teensy bit of  
information - their Acks get delayed a few nanoseconds, and as a  
result they send their next data a few nanoseconds later. Their  
average RTT becomes some function of sizeof(SYN)/M longer, and as a  
result they each slow down if ever so slightly.

Having no information about all this, the new session probes further,  
sending one or more data segments. Let's assume that it sends several  
(if it doesn't do so in the next RTT, it will do so in some  
subsequent one), and that the receiver replies with two or more Acks.  
The sender can infer that the capacity at the bottleneck is no  
smaller than

            Bits acknowledged only by second Ack
      -------------------------------------------------
      Interval between arrivals of first and second Ack

If its K segments were transmitted by the bottleneck back to back,  
this would be a measure of the rate of the bottleneck (Keshev). If  
packets from some other segments were interleaved, this would give  
the rate at the bottleneck less the capacity used by those  
interleaving segments. It won't EXCEED the rate at the bottleneck,  
but in this example it seriously overestimates the capacity unused by  
other sessions and therefore available to the new session at the  
bottleneck.

Let's assume that it picked some rate and paced traffic to that rate.  
It could do that calculation on every Ack it received, and feed those  
into the moving average. There are a couple of possible scenarios,  
the most benign of which include "the other sessions all slow down so  
that this traffic all goes through at that rate" and "all other  
sessions cease, so that the bottleneck is ONLY carrying this  
session's traffic". In both of these relatively benign cases, the  
above calculation says that the rate available is the same number.  
 From the sender's perspective, is that because that is the rate  
available at the bottleneck? Or is it because that is the rate at  
which it sent its probing traffic? Should it probe at a higher rate?  
If so, at what rate?

So the new session has to decide to what extent it believes this new  
pice of data. Maybe it simply jumps to that new rate. I'll leave the  
analysis to the reader. Or may it includes that new rate estimate in  
some form of moving average. If it does things just right, it will  
ramp up its speed at the rate that the sum  of the competing sessions  
slow down, and all will be cool. Anyone care to lay odds that it  
would be "just right"?

Now, change the scenario. The bottleneck is completely unused. The  
new session comes up, and is trying to infer what capacity is  
available, following the procedures it followed to get things "just  
right". Guess what? It asymptotically approaches fully using the  
link, but takes one heck of a long time to get there.

The big problem in pacing is determining the rate available at the  
bottleneck. And by the way, bottlenecks would have to have heap big  
magic to operate in anything resembling a TDM fashion.

Operational experience in the ATM world gives us another side of  
this. If you give an ATM VC a certain rate and engineer that rate  
through the network, ATM works a lot like TDM works - it uses the  
rate authorized, and doesn't over-use it. If you screw up the  
engineering, ATM acts like it is screwed up.

Hence, I think that what the research, logic, and operational  
experience tell us is that while pacing can be useful to break down  
big bursts, in the end pacing doesn't achieve a what the Internet  
community has been trying to achieve for a long time, which is full  
use of the bottleneck without overburdening it. We can achieve  
engineered use, like ATM, or we can from time to time over-burden a  
bottleneck, but we can't avoid both.

So I will argue that using pacing for events, like spreading out the  
effect of a sequence of lost Acks, can be helpful, but in the end   
the procedures we use in New Reno etc are superior in the general  
case. What we can do to improve those procedures might be related to  
RCP or to some of the work that is being done on other-than-loss- 
triggered procedures; we'll see. But pacing is not a silver bullet.