[ppml] Longer prefixes burden the FIBs of DFZ routers

Tue Aug 21 01:05:09 EDT 2007

Hi Brian,

You wrote:

>> TCAM can be as wide as 72 bits, and if the router has enough TCAM
>> space in its FIB, it doesn't matter how many bits are looked at,
>> provided they fit within the 72 or 144 bit width of the TCAM.
> 
> Not quite - the number of entries must fit in the total amount of TCAM
> memory.

Yes - for brevity I didn't mention this.

> The specific entries themselves aren't relegated to only being
> the full bitwise representation of a prefix, even if that is the simplest
> scheme for storage and lookup.
> 
> E.g., alternatives that use some kind of symbol-mapping scheme, or
> hash scheme, or other way of reducing the maximum number of bits
> required on a lookup, are one way of reducing both total TCAM memory
> used, and number of bits per entry.

OK - I don't understand these hash approaches sufficiently.  My
impression is that they are a messy additional layer of complexity -
with some results taking much longer than most - in a field where
simplicity and speed, down to individual clock cycles, are crucial.

> But, even at 144 bits, i.e. two "slots" per v6 prefix compared to one
> per v4 prefix, if the number of slots used is not unreasonable, TCAM
> can do the job (and do it in one cycle).

TCAM is not used for the FEC (Forwarding Equivalence Class)
classification of packets in the recent high-end routers - CRS-1,
M120 or MX960.  It has some very messy update problems in addition
to the high cost, high power consumption, low physical density etc.
TCAM needs to drive a fast static RAM too.

>> (However TCAM is expensive, power-hungry, must be soldered to the
>> main board - can't be upgraded - and can be slow to update when the
>> classification rules need to be changed.)
> 
> Expensive, yes; power hungry, yes. It is *not* the case that they *must*
> be soldered to the main board - this per Cisco rep at the last NANOG.

OK - they must be using special DIMMs or similar.  The TCAM chips I
am familiar with all have large (hundreds) of ball grid array "pins"
and they need a substantial heatsinks, which greatly limits their
physical density:

http://documentation.renesas.com/eng/products/others/rej03h0001_r8a20211bg.pdf

361 balls, 27mm x 27mm.  266 comparisons per microsecond, for
instance 72 bit input data with 256k rules.  Generally chips come
with detailed power consumption data, but TCAM chip specs often have
no such details.

They are massive comparator farms with input data ("address") lines
and their inverted versions running vertically down the chip and
with comparison lines running horizontally.  On every cycle, on
average half the data lines change state and all (or almost all) the
comparison lines change state.  There is large capacitance on all
these lines and the whole chip is thrashing away, dissipating a lot
of power.

> And ditto the upgradability. They haven't been made FRUs 

Field Replaceable Units.

> in the past, but
> there's nothing intrinsic to them that forces hard-wiring, other than
> design cost on the board itself.
> 
> The TCAM standard has advanced, so that the next several generations will
> have completely compatible pinouts, specifically so that they *can* become
> FRUs. The main idea would be, upgrade TCAMs to higher density units, and
> stack more of them in serial on the main board. More total TCAM space, in
> the same number of "slots".

It is all very well for router manufacturers to crank up their
products with more grunt - which we all pay for.  More and more
TCAMs is not going to solve the overall problems in routing and
addressing.  Unless something is done, every new multihomed end-user
adds at least one prefix, which means one comparison line in the
TCAM of every FIB in every DFZ router - and most have a separate FIB
for one or a few interfaces.  There are at least 123k routers in the
DFZ:

  http://psg.com/lists/rrg/2007/msg00253.html
  http://psg.com/lists/rrg/2007/msg00262.html

>> There's no such thing as a 32 bit lookup unless what you need to
>> find is a byte or less and if you have 4 gigabytes of RAM to hold
>> the array, which no router's FIB has.
> 
> Wrong. Your idea of "what a router is", is just a little limited, which
> is why you believe this to be the case.

I was discussing the hardware based, specialised FIB, dual redundant
power supplies, etc. Cisco-Juniper-et-al. style routers which are
preferred by ISPs for their DFZ routers, due to their physical and
software robustness, I guess.

> Any reasonably big iron server, with multiple PCI-express buses,
> fast and numerous CPUS, and serious enough chip set, can do the job of
> "high-end router". 1 x 10GbE per PCI-E bus, nominally 4 or more,
> and upwards of 512GB of memory, can be put into a (big) box that runs
> routing software (such as quagga).
> 
> Been there, done that, as the saying goes, and yes, it can do DFZ level
> routing and forwarding, with tons of capacity for long prefix lookups.

This is a field I am interested in, for instance for implementing
the ITR (Ingress Tunnel Router) function of my Ivip proposal:

  http://www.firstpr.com.au/ip/ivip/

If you have some URLs of pages describing such systems, please let
me know them.

I don't suggest that such server-based systems will be the norm for
DFZ routers.

> Besides which, there's every reason to believe that one byte can hold
> enough information for the result of a route lookup. Think index into an
> array of objects, each of which includes in interface index and MAC address.
> One byte means 256 such objects, which is likely to be sufficient for the
> majority of devices holding default-free routing tables.

I would have thought that 8 bits would be fine too.  I understand
that in larger routers they typically use 16, 32 bits or perhaps
more, because they are also specifying specific output queues in
particular output interfaces.  Still, for Internet traffic packets,
as far as I know, there isn't any special queuing - but I may be
wrong.

  - Robin