[e2e] Are Packet Trains / Packet Bursts a Problem in TCP?
Fred Baker
fred at cisco.com
Tue Sep 26 12:46:27 PDT 2006
On Sep 26, 2006, at 10:51 AM, Detlef Bosau wrote:
> But what I´m curious about in all (sender) pacing approaches is:
> Where does the pacing rate stem from? Typically, it´s obtained from
> one of the numerous TCP formulae (Padhye, Mathis...), but these are
> first (admittedly rough) estimates and second need some
> measurements to have an initial value set.
>
> Moreover, widely deployed pacing would practically replace the
> currently used AIMD mechanism for achieving a fair share, because
> the currently used AIMD probing leads to "microbursts", which means
> that the TCP sender _exceeds_ its fair share of rate for short
> periods of time. This doesn´t matter as TCP is not rate
> controlled. And the final rate is corrected by ACK clocking.
> How does this work with sender pacing, i.e. putting a leaky bucket
> or flow shaper or something like that in the TCP sender?
The big problem in pacing is determining the rate available at the
bottleneck.
Let's assume a truly perfect case - the bottleneck in the network has
a total capacity of N bps, and currently there are M sessions using
it. The M sessions, by magic, are operating in perfect TDM fashion -
as the last bit of one message departs, the first bit of the next
message becomes ready to send, and no queue ever forms.
Is there room for one more bit on the wire? Well, if this was a TDM
network, no; something would get dropped. But it's not. So we have a
new sender that decides to start a TCP session and sends a SYN. What
happens is that his packet collides with some other packet, and a
queue one packet deep results. The new session gets very little
information about rate out of that - there is clearly capacity for at
least one SYN per RTT available, but he gets no idea what he is
competing with. But the other M sessions get a teensy bit of
information - their Acks get delayed a few nanoseconds, and as a
result they send their next data a few nanoseconds later. Their
average RTT becomes some function of sizeof(SYN)/M longer, and as a
result they each slow down if ever so slightly.
Having no information about all this, the new session probes further,
sending one or more data segments. Let's assume that it sends several
(if it doesn't do so in the next RTT, it will do so in some
subsequent one), and that the receiver replies with two or more Acks.
The sender can infer that the capacity at the bottleneck is no
smaller than
Bits acknowledged only by second Ack
-------------------------------------------------
Interval between arrivals of first and second Ack
If its K segments were transmitted by the bottleneck back to back,
this would be a measure of the rate of the bottleneck (Keshev). If
packets from some other segments were interleaved, this would give
the rate at the bottleneck less the capacity used by those
interleaving segments. It won't EXCEED the rate at the bottleneck,
but in this example it seriously overestimates the capacity unused by
other sessions and therefore available to the new session at the
bottleneck.
Let's assume that it picked some rate and paced traffic to that rate.
It could do that calculation on every Ack it received, and feed those
into the moving average. There are a couple of possible scenarios,
the most benign of which include "the other sessions all slow down so
that this traffic all goes through at that rate" and "all other
sessions cease, so that the bottleneck is ONLY carrying this
session's traffic". In both of these relatively benign cases, the
above calculation says that the rate available is the same number.
From the sender's perspective, is that because that is the rate
available at the bottleneck? Or is it because that is the rate at
which it sent its probing traffic? Should it probe at a higher rate?
If so, at what rate?
So the new session has to decide to what extent it believes this new
pice of data. Maybe it simply jumps to that new rate. I'll leave the
analysis to the reader. Or may it includes that new rate estimate in
some form of moving average. If it does things just right, it will
ramp up its speed at the rate that the sum of the competing sessions
slow down, and all will be cool. Anyone care to lay odds that it
would be "just right"?
Now, change the scenario. The bottleneck is completely unused. The
new session comes up, and is trying to infer what capacity is
available, following the procedures it followed to get things "just
right". Guess what? It asymptotically approaches fully using the
link, but takes one heck of a long time to get there.
The big problem in pacing is determining the rate available at the
bottleneck. And by the way, bottlenecks would have to have heap big
magic to operate in anything resembling a TDM fashion.
Operational experience in the ATM world gives us another side of
this. If you give an ATM VC a certain rate and engineer that rate
through the network, ATM works a lot like TDM works - it uses the
rate authorized, and doesn't over-use it. If you screw up the
engineering, ATM acts like it is screwed up.
Hence, I think that what the research, logic, and operational
experience tell us is that while pacing can be useful to break down
big bursts, in the end pacing doesn't achieve a what the Internet
community has been trying to achieve for a long time, which is full
use of the bottleneck without overburdening it. We can achieve
engineered use, like ATM, or we can from time to time over-burden a
bottleneck, but we can't avoid both.
So I will argue that using pacing for events, like spreading out the
effect of a sequence of lost Acks, can be helpful, but in the end
the procedures we use in New Reno etc are superior in the general
case. What we can do to improve those procedures might be related to
RCP or to some of the work that is being done on other-than-loss-
triggered procedures; we'll see. But pacing is not a silver bullet.
More information about the end2end-interest
mailing list