[ppml] Longer prefixes burden the FIBs of DFZ routers

Robin Whittle rw at firstpr.com.au
Sun Aug 19 23:12:40 EDT 2007

In "Technical reason why /52,/56,/60,/64 are bad", Christopher
Morrow wrote:

> just a quick clarifying note (not to michael so much as eventually
> robin)
>> These high end routers use trie-based algorithms
>> involving lookups into gigabytes of slow, shared, DRAM. This
>> can only be done some number of bits at a time, where that
>> number may by 5 to probably 20 or maybe 24 at the absolute
>> maximum.
> Actually the normal DFZ sorts of devices as listed below will have
> to do lookups to a full 32 bits on many occasions... I suspect the
> trie-based lookups (and tcam lookups) are able today to do at
> least 32 bit lookups.

TCAM can be as wide as 72 bits, and if the router has enough TCAM
space in its FIB, it doesn't matter how many bits are looked at,
provided they fit within the 72 or 144 bit width of the TCAM.
(However TCAM is expensive, power-hungry, must be soldered to the
main board - can't be upgraded - and can be slow to update when the
classification rules need to be changed.)

There's no such thing as a 32 bit lookup unless what you need to
find is a byte or less and if you have 4 gigabytes of RAM to hold
the array, which no router's FIB has.

The FEC (Forwarding Equivalence Class) information to specify how
the packet should be switched within the router - to which interface
and which input queue of that interfaces - is apparently 16 or 32
bits or more in larger routers.

The high end routers CRS-1, M120 and MX960 do not use TCAM.
According to what I could find out:


they use:

   CRS-1   ~1 gigabyte DDR DRAM (Slow ~55ns latency?) shared by 188
           32 bit CPUs on a humongous ASIC.  This is for a 40Gbps

   M120    64 megabytes RL-DRAM (Reduced latency ~15nsec) controlled
           by an I-chip ASIC.  This is for a 10Gbps FIB.

   MX960   256 megabytes of SRAM - maybe DDR-II SRAM or QDR-II SRAM
           (the latter is the fastest RAM available, with latency
           and cycle times as low as 3.3nsec).  This is for a 40Gbps
           FIB, but maybe it has four I-chips - in which case each
           has 64 megabytes.

> In some cases the lookups actually are longer still as most of the
> DFZ-type routers implement ACL's via the route-table. So some TCAM
> based routers will use 'spare bits' in the tcam entry to cover
> port/protocol, some TCAM based routes are able to do 64+ bits of
> lookup actually since they can do in a single pass src and dest
> lookups (to do both acls and uRPF functionality).

I can imagine this being practical if the TCAM is wide enough, but
this is highly unlikely to scale to IPv6's long addresses.

> I suspect that the DRAM lookup type routers also have 64+ bits of
> lookup since they can do src/dest ip and port/protocol for acl
> functionality and uRPF.

There's no such thing as a single memory cycle 64 bit lookup!

I don't know enough about how these routers work, but I am wary that
people who know even less about of these expensive devices, which we
all pay for, assume routers can easily do things which are either
impossible or extremely costly in terms of memory cycles, power
dissipation, memory utilisation etc.

>>> This sounds really inefficient, compared to IPv4 in which DFZ
>>> routers in practice need to look at 24 bits, at most, of the
>>> destination address of IPv4 packets.  Since only about 1.23%
>>> of the advertised space is for prefixes of 21 bits or more:
> In practice they actually need more than 24 bits, see above and
> consult your local vendor. The DFZ is not made up of just /24 and
> larger routes :(

IPv4 routers need to be capable of analysing all 32 bits of each
packet's destination address, but with a trie-based lookup scheme,
depending on the "stride length" - such as 8 bits, then 6 then 6 or
whatever - most of the packets can be classified in two or three
memory lookups.  The Tree Bitmap algorithm, as described in:

   Tree bitmap: hardware/software IP lookups with incremental
   Source ACM SIGCOMM Computer Communication Review archive
   Volume 34 , Issue 2 (April 2004) Pages: 97 - 122
   Will Eatherton (Cisco Systems), George Varghese (UCSD), Zubin
   Dittia (Jibe Networks) portal.acm.org/citation.cfm?id=997160

(I have a copy of this paper . . .)

involves doing the first 8 bits in RAM on the ASIC with subsequent
reads to slow external shared DRAM, in strides such as 6 bits.
Longer strides are obviously faster at finding the solution, but
require more and more RAM.  The Tree-Bitmap algorithm anticipates
using a bunch of slow DRAM in four banks with the same data, so one
bank can be reading while the rest are recovering from their last
read operation.   So these routers don't necessarily have more than
256 megs of RAM - and the Juniper ones only seem to have 64 megs.

While the problems in routing and addressing are generally thought
to be primarily in the BGP control plane:


and while I don't know as much as I would like about router FIBs, I
really think people who are devising address policy should consider
the burden longer PI prefixes place on FIBs of DFZ routers.

For instance, each VoIP call (50 packets a second in each direction)
between hosts in two IPv6 PI /48 prefixes with 16 DFZ routers
en-route involves those routers collectively working through this
many bits of destination address per second:

  50 * 2 * 48 * 16 = 76,800!

This work is really hard, to be done within a microsecond or so,
with the fastest and most expensive RAM and CPUs.

Cisco had the largest ASIC in history fabricated by IBM to make
their FIB for the CRS-1 - 188 32 bit CPUs and heaps of other stuff
on one chip.  Juniper no-doubt went to similarly Herculean efforts
to create their I-chip and to provide fast ( = expensive, low
density and power-hungry) RAM so the I-chip could perform well.

The router manufacturers say they can do this, and more.  No router
manufacturer would want to be seen saying that they couldn't cope
with higher and higher demands.  But they can't cope with growth in
the number of advertised prefixes, or in the length of those
prefixes, without extremely complex, expensive, brute-force
techniques with all the power dissipation and bulk this entails.

All Internet users have to pay for all this stuff, and the power the
routers dissipate, because we all pay our ISPs to pay other ISPs and
transit providers - and they all have to pay big $$$ for routers
which can handle these longer and more numerous prefixes.

VoIP is supposed to be cheap and lightweight, but to me, with these
/48 PI prefixes, it looks like a huge burden on the DFZ routers.

So in general, to keep the DFZ routers simple, there should be as
few as possible advertised prefixes, with those prefixes being as
short as possible (meaning larger blocks of addresses).

No-one knows how many DFZ routers there are.  The best estimate is
from the iPlane project:


     Singlehomed routers:            87k
     Routers in Default Free Zone:  123k

     Total BGP routers:             210k

These are lower bounds:

   "These numbers are at most a lower bound on the numbers you are
    looking for, and we have no way to tell how tight this bound is.
    iPlane data is incomplete because of several reasons, such as
    limited number of vantage points, blocking of ICMP/UDP by
    routers, MPLS tunnels,....

    And getting realistic numbers for those is an unfeasible task as
    of now, at least for the entire Internet."

It seems the only way of reducing this load on all DFZ routers,
while allowing for millions of multihomed end-user networks (and
ideally for supporting portable address space for end-user networks
which need it) is to surround the DFZ with a ring of Ingress Tunnel
Routers (ITRs) and Egress Tunnel Routers, as is proposed with
LISP-NERD, LISP-CONS, eFIT-APT or Ivip.  Those ITRs and their
control system get a hammering, but make life easy for the DFZ
routers.  This page has links to all these proposals:


   - Robin

More information about the ARIN-PPML mailing list