[arin-ppml] A challenge to the assumption that a big DFZ is a problem

Mon Dec 14 23:08:08 EST 2009

In a message written on Mon, Dec 14, 2009 at 11:07:10AM -0800, Ted Mittelstaedt wrote:
> Today I can walk into the store and purchase a PC that has a CPU
> in it that runs at a clock speed of at least 10 times of
> most routers, and has at least 10 times the amount of ram, for
> a quarter of the cost of the annual service contract for most
> DFZ routers  (let alone the hardware cost)

That you're asking this question tells me you don't know how larger
routers (GSR, CRS-1, T640, T1600 etc) are architected at all.  Please
don't take that as an insult either, I suspect only a small fraction
of the folks own the list own such routers, and only a much smaller
fraction of those understand how they work internally.

I'll provide the 10,000 foot view, but beware, that's all it is,
there are a LOT of details at work.

Let's look at a Juniper T1600.  It is a 8 slot box, with each slot
capable of 100Gbits/sec, bidirectional.  Hint, 8 * 1000 * 2 = 1600.
:) So if you're provisioning 10Gbps ethernet, a fairly fast technology
today, you can put 160 10GE ports in the router.

You don't route 1.6Terabits/sec on a CPU.  Or on several CPU's.
The "open source router" community (see www.vyatta.com, as an
example) suggests you can software route ~3-4Gbps on a very well
tuned Nahalem CPU.  To route 160 10GE ports would take 480 CPU's
at that rate!  Even at $500 per CPU, that's 240,000 worth of CPU
alone.  Not to count all the bus interconnections, DRAM, etc.

No, these boxes don't work like that at all.  Rather there is a routing
engine (Juniper's term), or route processor (Cisco's term) which runs a
CPU and does BGP with your neighbors.  This is the "old, slow CPU" that
you're referring to in those high end boxes.  Truth is though, even the
"old, slow CPU's" they use could handle several million routes.  All
they do is run BGP, and create from that a single master copy of the
routing table, generally called the FIB, or Forwarding Information Base.
The distilled version of the routing table, similar to "show route".

The CPU then pushes this table to the linecards, into special memory
called TCAM.  The tcam holds fields like:

10.0.0.0/8 Linecard3Port2

As packets come in, special hardware looks up the TCAM entry, and then
sends the packet out over the switch fabric to the other cards.

TCAM is expensive.  Why?  Well, consider a linecard in your T1600,
dealing with a 100G (bidirectional) flow.

That's:

bits kilobits megabits gigabits speed bidrectional
1000 * 1000   * 1000   * 1000   * 100 * 2

Or 200000000000000 bits/sec.  Or divide by 8, 25000000000000 bytes/sec.
Now, let's say they are all 64 byte packets.

64 / 25000000000000 = .00000000000256 SECONDS PER PACKET.

Let me stack that with 1 nanosecond:

.00000000100000
.00000000000256

It's a lookup every 2 picoseconds.  This takes arrays of crazy fast
TCAM.

So long story short, the vendors guess.  1,000,000 routes on the
internet distils into an 800,000 route FIB, and size the TCAM for
that on each linecard.  Note that generally TCAM is not socketed
and not field upgradable.  Given the speeds it is acutally difficult
to socket, and it's very static sensitive for field upgrades.  So
it's soldered to the board.

When the guess, by the vendor or the ISP, turns out to be wrong the
upgrade cost is not the "old, slow CPU"; indeed that is often working
just fine if only taking 5 minutes to bring up a full table rather
than the 2 minutes people would like.  Rather it's throw out every
linecard and buy new ones.  The penalty for guessing wrong is severe,
it's instant, total junking of all the linecards on your network.

Care to guess what a 10 port 10GE linecard costs for one of these boxes?
I'll assume you get some discount from your vendor, so maybe $400,000.
So your 8 slot box costs 3.2 million to upgrade.  Oh, but the new cards
will be more expensive, more TCAM.  Got a network with 200 core routers
(I can think of some ISP's with more, for sure) and you're "only"
talking a 640 million dollar upgrade, for one ISP, just to handle a
larger table.

Before I go any further, I'm going to tell people up front I'm not going
to engage in nit picking over any of the above.  If you want to design
core routers go work for Junper or Cisco, if you can do it for half the
cost of current designs I'm sure they will pay you a nice sum.  I'm also
sure it can be done both cheaper and more expensively, depending on
circumstance.   I've picked a run of the mill example, almost every ISP
is a special case in something.

So anyway, from the big ISP perspective the situation is this: currently
deployed hardware is what it is.  Unless a multi-hundred million dollar
check falls from the sky, it will be what it is until the next, already
planned equipment refresh.  When it will be what the vendor has already
decided the next gen platform will be (you know it takes 3-5 years to
develop a next gen platform, on a quick ramp, right?).  Also keep in
mind some of the TCAM goes to things like MPLS VPN's, which are growing
on their own.

If these boxes end up exhausting TCAM there will be some upgrades, but
the vast majority of ISP's will turn to filtering to solve the problem.
Remove enough routes so it fits again; at least until the next refresh
cycle.

Lastly, I promise you this, the folks at the top 10 ISP's are all
meeting with Cisco and Juniper several times a year, with real
engineers, not sales folks, and trying to rationalize the cost of the
parts with the needs of the network.  They provide lots of engineering
input on the next generation of parts.  However, everyone involved is
having to commit now to how big those TCAM's will be in 3 years on the
next gen cards, which will be in most of the nextwork in 5-7 years.

Hence my statement on the matter.  On some level it doesn't matter if
the RIR's give away blocks like crazy, or are as stingy as possible.
What matters is that the rate at which blocks are given out roughly
matches the rate that was expected.  We can bend the curve, up or down,
but SLOWLY, as equipment is refreshed.

There is a wall.  It is a 200 foot thick concrete wall.  No matter how
hard, or soft you hit it the wall will not move, you will be splattered.
Fill most core router TCAM's and we're all in for a very bad few years.

-- 
       Leo Bicknell - bicknell at ufp.org - CCIE 3440
        PGP keys at http://www.ufp.org/~bicknell/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 826 bytes
Desc: not available
URL: <https://lists.arin.net/pipermail/arin-ppml/attachments/20091214/78d8e569/attachment.sig>