[arin-ppml] Routing Research Group is about to decide its scalable routing recommendation

Sat Dec 19 00:47:04 EST 2009

Short version:   Responding to Leo and giving my views on the three
                 types of solution to the routing scaling problem:

                   Soup up BGP              No-one suggests this.

                   Core-edge separation     LISP, APT, Ivip & TRRP.
                                            Backwards compatible,
                                            works for IPv4/6 and
                                            requires no host changes.
                                            New functions added to
                                            ISP and end-user
                                            networks.

                   Core-edge elimination    Generally involves host
                                            changes - stack and apps.
                                            Not backwards compatible.
                                            Generally does not add
                                            new functions into
                                            any part of the network.

                Discussion of the number of end-user networks which
                need to be supported, and why I think 10M is the max
                for non-mobile networks.

                Agreeing that LISP involves some delayed or dropped
                initial packets, but pointing out that this problem
                does not exist for APT or Ivip.

Hi Leo,

Thanks for your appreciative reply.  You wrote:

> What I am hearing is that the research community believes that BGP4
> cannot scale to the point where everyone has a provider independent
> prefix.  That if we want to have everyone have a provider
> independent prefix, we need a new technology to be in place first.
>
> I think that's a very important message for the ARIN community to
> understand.

I think this is absolutely true.  No-one has seriously suggested a
way of solving the routing scaling problem by souping up BGP and the
DFZ in its current form.

I think that pretty much everyone agrees the most important problem
is in the RIB and BGP exchanges, and the stability and response times
of the whole BGP DFZ control plane - rather than a problem with the
number of routes the FIB handles.  However, FIB limitations surely
rule out trying to make the DFZ handle tens or millions or more
separate end-user prefixes, even if there were no RIB or BGP control
plane objections.

On problems with BGP:

>> Can you mention those which you think are not well documented?
> 
> I think the effects of how ISP's configure BGP on the performance
> have been relatively poorly studied.  What I see mostly are lab
> tests, there are very few studies tracking CPU usage or propogation
> times in the real Internet and relating that back to protocol or
> configuration weakless.

OK.  While there may be ways of marginally improving the operation of
the DFZ, including by improvements to the BGP protocol, the goal of
scalable routing is far beyond the modest improvements these might bring.

The aim are not just to enable end-user networks who advertise PI
space now to adopt a scalable alternative.  The aim (of most people,
I think) is to extend this so it is attractive to many smaller
end-user networks so they can get portable space, suitable for
multihoming and inbound TE.

I think it should extend to SOHO "networks" even though they might
have most of their hosts behind NAT.  For instance, if I am running a
business and am concerned about the reliability of my DSL line, I
should be able to get a 3G or WiMAX service as a backup, and use my
address space on either.  That is a cheap backup arrangement - since
there's no need for new fibre or wires.  My address space may only be
a single IPv4 address, but if I need it for mail, HTTPS e-commerce
transactions, VoIP etc. I would want it to keep working without a
hitch if the DSL line or its ISP was not working.

Since we are doing this, I think it should extend to mobility too -
with each device getting its own IPv4 address or IPv6 /64.

Mobility support is not formally part of what the RRG is supposed to
be doing, but I think it would be unwise to choose a solution without
at the same time making sure it was a good basis for mobility.

Many people think that a core-edge separation scheme with mobility
(such as the LISP-MN internet draft) must involve the MN being its
own ETR.  That won't work for IPv4 and there are many objections to
this for IPv6.  See my critique:

  http://www.ietf.org/mail-archive/web/lisp/current/msg00749.html
  http://www.ietf.org/mail-archive/web/lisp/current/msg00772.html

The TTR approach to mobility involves the MN using one or more nearby
"Translating Tunnel Routers" as its ETR(s) and also as outgoing
routers, with ITR functions.

  http://www.firstpr.com.au/ip/ivip/TTR-Mobility.pdf

There's no need for new protocols - the mobile hosts, via their TTRs,
communicate normally and with generally optimal path lengths with
ordinary IPv4 or IPv6 hosts.

My rough estimate of the maximum number of non-mobile end-user
networks we would ever need to support is about 10 million.  This is
on the basis of 10 billion people, and one organisation or business
per thousand people being concerned enough about multihoming to want
to get a second connection to a second ISP, and get their own
Scalable PI (SPI - my term) space.

If this is all we need to handle, there's no strong argument for the
LISP-CONS, LISP-ALT or TRRP approach - which is to build a globally
distributed query server network for mapping, because the mapping
itself is too big to store in any one location.  10M mappings will
fit in DRAM of a big server now, and a small server by the time we
get 10M multihomed non-mobile networks.

So in my view, anyone who says we need to cope with more than about
10M EID prefixes (LISP terminology) must be assuming the system
supports mobility.

With mobility, I think a worst case is 10 billion - one for a
cell-phone for everyone on a rather crowded planet.  Clearly this
means IPv6, though the TTR approach will work for both IPv4 and IPv6.

I think this is still feasible to do with a local query server
architecture such as APT or Ivip.  Ivip's IPv6 mapping is the start
and end of the range of SPI space, in units of /64, and a single ETR
address.  This is 32 bytes, so the total mapping database would be
320Gbytes.  That fits on a consumer hard drive today, and should be
fine in server DRAM by the time such adoption is achieved.

>> Core-edge separation schemes (LISP, APT, Ivip and TRRP) don't alter
>> the functions of hosts or most internal routers.  They just provide a
>> new set of end-user address prefixes which are portable to any ISP
>> with an ETR, and which are not advertised directly in the DFZ.  A
>> covering prefix, including many such longer prefixes, is advertised
>> by special ITRs ("Open ITRs in the DFZ" is the Ivip term, "Proxy
>> Tunnel Routers" is the LISP term) so these routers collect packets
>> sent by hosts in networks without ITRs and tunnel those packets to
>> the correct address. APT and TRRP have functionally similar
>> arrangements for packets sent from networks without ITRs.
> 
> I've done some work with the LISP folks, but I'm far from an expert
> on that particular example.  At least in the LISP case, but I suspect
> in all of these, it feels a lot like squeezing a balloon.  That is,
> they do improve the area they are looking to improve, but at the
> expense of reducing the performance in some other way.

I think LISP is not a suitable solution.

LISP-NERD involved all the mapping going to every ITR. Everyone
agreed that this was not going to scale as well as some alternative.
 NERD is no longer being developed.

To me, the best alternative is local (in the ISP or end-user network)
full-database query servers.  This is what APT and Ivip use.

Instead the LISP crew went to the other extreme - a distributed
global query server network, in which no one device has the total
mapping database.  In principle this should scale without limit.
However, these two LISP approaches - CONS and now ALT - are nowhere
near being able to scale to tens or hundreds of millions of EIDs.

CONS was pursued for a few months in 2007.  ALT was launched in
November 2007 and since then I and others have been critical of its
"long-path" problem.  There is no obvious way to shorten the total
path length through the "highly aggregated" ALT network, while
ensuring the network is robust against router or link failure.   See
the recent discussion on the LISP WG list:

  ALT structure, robustness and the long-path problem
  http://www.ietf.org/mail-archive/web/lisp/current/msg01801.html

Two years after ALT's inception, there are no answers.  Some key
people on the list regard it as something worth building to
experiment with - without at the same time thinking that it will be
the best final architecture.

Even if these problems were solved, LISP-ALT would frequently drop
the initial packets in a new communication flow, since it could take
a fraction of a second to several seconds for the ITR to get the
mapping it needs from the global ALT query server system.  (Or longer
if the request or response packet is lost.)

Since the delay is too long to hold on to the packet for, the ITR
drops the initial packet and sends a mapping request.  When the
response arrives, it has to wait for the sending host to send another
packet (assuming the host keeps trying - it might try an alternative
IP address it got from round-robin DNS).  So it is not just the delay
inherent in LISP-ALT's global query server system - the delay
involves waiting for the host to resend, and dropping all resent
packets which the ITR gets before the map reply message arrives.

LISP-ALT's initial packet delays could affect A doing a DNS lookup
for B's address, A sending a packet to B and B sending a packet back
to A.

APT and my Ivip proposal don't have a crowd of people working on
them, but both proposals do not have this major problem of LISP-ALT.
 In both proposals, the ITR gets mapping quickly and reliably from a
local full database query server (the Default Mapper in APT).

So APT and Ivip would not significantly alter performance, in terms
of the time it takes for traffic packets to reach their destination -
while LISP-ALT would significantly delay many initial packets.

> As a result, as an operator, I find it hard to consider these
> solutions "better".  Indeed, without the item being optimized for
> being a very scarce resource it seems unlikely folks are going to
> want to transition to a new technology solely for the same size
> balloon.

A core-edge separation scheme should provide profound benefits, as
long as there were no significant loss of performance.

With Ivip, the ETR could be a device in the ISP, shared by many
end-user networks, or it could be a device in the end-user network -
and that network runs it on one or maybe four IP addresses it gets
from its ISP's address space.  In the latter case, the end-user
network gets a tiny piece of PA space from one, two or more ISPs and
then runs its own ETRs, through which it can use as much SPI space as
it likes.

With Ivip on IPv4, this means an end-user network can (effectively
permanently) rent some SPI space from some entity anywhere in the
world - probably not any of its ISPs - and then use that space as it
wishes, on any PA address at any ISP (with its own ETRs) or using
shared ETRs at ISPs which have installed them.

The end-user network might get 16 IPv4 addresses and split them up so
head office gets four and the branch offices get one each.  If each
office multihomes, then these are stable addresses which can be
relied upon, including for VoIP and VPNs.  The offices can be
anywhere in the world, and it is little fuss to choose different ISPs
at each location.

ISPs don't need to do anything in order to support the second mode -
the end-user network running its own ETR.  Indeed, the ISP would be
unable to prevent this usage.  ISPs which installed their own ETRs
would be more attractive to Ivip-using end-user networks which wanted
to do it this way.

Ideally ISPs and end-user networks would install their own ITRs - no
matter whether or not they used SPI space.  This means they
encapsulate their own packets which are addressed to SPI addresses,
rather than letting the packet go out to the DFZ where it would be
forwarded to the nearest Open ITR in the DFZ (OITRD in Ivip, Proxy
Tunnel Router in LISP - and APT has similar functionality) which
would tunnel it to the correct ETR.

One reason for an ISP installing its own ITRs would be to stop
traffic going out to the DFZ when it really needs to be encapsulated
and sent to an ETR in its own network.  (This is for Ivip, I am not
sure how LISP would work in this respect.)

So there would be little or no investment for an ISP to be able to
support some of its customers using SPI space.

>> The first difficulty is that a new routing protocol can never be
>> introduced to replace BGP4 unless it is fully backwards compatible -
>> and no-one has devised such a thing AFAIK.
> 
> I strongly disagree with this statement.  While it would be vastly
> easier if it were fully backwards compatable that is by no means a
> requirement.  Indeed, many of the schemes proposed are what I would
> call less than backwards compatable.  I know folks may look at
> map-encap as keeping the host stack the same; but the reality is
> to the backbone operator it is a forklift upgrade to new technology
> and stands a really good chance of requiring new hardware (in some
> cases) at the edge.  Rolling out a new routing protocol, even if
> it must just replace BGP, is no harder.

Can you point to any proposal to replace BGP which could be
introduced in a way which provided significant immediate benefits to
the early adoptors (not just dependent on how many others adopt it,
which initially is very low) while also working with the existing system?

>> The second is that it is pretty tricky to come up with a protocol for
>> the Internet's interdomain routing system which could cope with the
>> growth in the number of separately advertised end-user networks.
>>
>> There could be millions or billions of separate prefixes which
>> end-user networks, including mobile devices, need to keep no matter
>> where they physically connect to the Net.
> 
> Here is where I feel like there is a major disconnect.  Operators
> today are concerned with growing the current system.  Perhaps looking
> at a 10 year figure of 1 million IPv4 routes and 300k IPv6 routes,
> using more or less the exsiting schemes.
> 
> What the IETF (researchers, vendors, etc?) seem to be looking at
> is how can we give every man woman and child a provider independant
> prefix and route all of them.  Your "billions of prefixes" case.

Yes, in the case of mobility.  If you ignore mobility, I don't think
many people would argue that we need to support more than a few
million or perhaps my 10 million non-mobile end-user networks.

> That's worthy work, I'm all for seeing if there is a way to do that
> and then assessing if it is a path we want to go down as an industry.

Some of the proposals to the RRG are for IPv6-only and/or involve
host changes - so there's no way they can be adopted widely enough on
a voluntary basis to solve the only routing scaling problem we have
at present: in IPv4's DFZ.  These tend to be core-edge elimination
schemes - which avoid adding much into the network and put a bunch of
new routing and addressing responsibilities on all hosts.  HIP is the
most prominent example of such an architecture.

LISP, APT, Ivip - and TRRP if Bill Herrin is still working on it -
could all, in principle, solve the IPv4 scaling problem, to very
large numbers of end-user networks.  These are all core edge
separation schemes.

LISP-ALT and TRRP rely on a global query server network, so they are
subject to the critique about dropped and delayed initial packets.

There definitely is a desire to provide a new architecture - either
to replace the current architecture (as with the core-edge
elimination schemes) or to add it to the current architecture
(core-edge separation schemes) which will do more than just cope with
the current growth rate in the DFZ routing table.

This is because that growth in numbers still only represents a
fraction of the numbers of end-user networks which want or need
portability, multihoming and/or inbound TE.

> However I feel that it is coming at the expense of the much less
> sexy problem of "how to we keep the status quo going for 10, 20,
> or 30 more years".  It's also a harder problem, which means it will
> almost certianly take more money and effort to solve.

The core-edge separation schemes can keep the current system going,
without any changes to host stacks or apps.  The core-edge
elimination schemes can't - they require new stacks and probably
apps.  I am 100% certain this will not happen within decades.  They
may never happen.  There are fundamental objections to requiring
every host to manage more routing and addressing stuff:

  http://www.firstpr.com.au/ip/ivip/RRG-2009/host-responsibilities/

>> Many of the proposals now being made for the RRG process seem to
>> involve host changes - specifically making the host responsible for
>> more routing and addressing things than in the past.  This is for
>> *every* host, not just for mobile hosts.
> 
> There is an old belief that the Internet succeeded over some other
> technologies in part due to the fact that "routers are dumb" and
> all of the smarts are in the host.  TCP congestion control is often
> cited as an example, let the hosts deal rather than having X.25 or
> Frame Relay style notification in the middle boxes.
> 
> While I think many of the examples given are poor, I think the
> premise is right.  Having to scale the technology in a PC is vastly
> cheaper than in core routers.  If there is a choice of putting
> complexity in the host or in the core router, the host wins hands
> down.

I agree to some extent.  APT and LISP do not allow the ITR function
to be in the sending host.  Ivip has an option for placing the ITR in
the sending host.   This is a good idea except when either of these
are true:

  1 - The host is on a high latency link, such as a wireless mobile
      device, or any non-mobile network relying on a satellite link.

      This will delay the ability of the ITR to get mapping from
      its local query server, and so delay the forwarding of
      initial packets.

  2 - Where it is vital that the host be as simple and low-power
      as possible.  This includes many mobile devices.

A good example of where to do it would be in a web-server farm - so
each server does the ITR work and there is no need to have a separate
ITR to encapsulate the large volume of outgoing packets.

However, this optional placement of the ITR function is very
different from what I think you are suggesting, which is the pattern
followed in the core-edge elimination schemes: have no additional
routing and addressing stuff in the network, and make the hosts do
all the new work.

This is the HIP approach - and it burdens *every* host with a bunch
of new responsibilities.   That might be fine for many hosts, with
plenty of CPU power, RAM etc. AND which are on fast links.  It is a
bad idea for low-power devices on flaky long-latency 3G wireless links.

In the future, if mobility is developed as it should be, there will
be billions of devices, typically connected by a flaky long-latency
3G link.  So to say the whole Internet must follow your principle:

     If there is a choice of putting complexity in the host or in
     the core router, the host wins hands down.

as with HIP, then this is going to make mobile hosts have higher
costs and worse performance than they already do.

> The down side of course is that there are many more devices to upgrade, 
> so adoption is a much harder thing.

The core-edge elimination architectures are not backwards compatible.
 So you have to alter all the hosts, effectively creating a new set
of Internet protocols and addressing - a new Internet which can't
directly work with the current Internet.  A decade and a half after
IPv6 was developed, still there wouldn't be a single end-user who can
do all the normal, wide range, of Internet activities using only an
IPv6 address.

The core-edge separation architectures don't require host changes.
The "optional ITR function in sending host" part of Ivip is just
there to reduce costs when it makes sense.

>> Even if these objections could be overcome, I would still object to
>> it because it significantly slows down the ability to send the first
>> packet of user data.  This slowness depends on the RTT between the
>> hosts, and is greatly exacerbated by a lost packet in the initial
>> management exchange which must precede (AFAIK) the packet which
>> actually contains the user traffic packet.
> 
> Having lived through some technologies with this property in the
> past (e.g. some of the IP over ATM "solutions") and seeing how it
> works in proposals like LISP I have to say this is an achillies
> heal in many of the proposals.  

I think this slowness of delivering the initial packet rules out all
the core-edge elimination schemes - as far as I know, they all
involve hosts doing fancy things to authenticate the other host's
identity before they will send the traffic packet.

Likewise LISP-ALT, TRRP or any other core-edge separation scheme
which relies on a global query server network.

That leaves APT and Ivip as the two proposals which do not delay
initial packets.  APT's Internet Draft is 2 years old, and it has not
yet been submitted as a proposal to the final RRG process.  I assume
it is still being developed and will be part of the RRG's final
deliberations.

> Not just due to first packet slowness,
> but due to the caching of this information that must occur.
> 
> Cache based router designs went from being the most common to
> non-existant a decade ago for a wide range of performance reasons.
> It seems to me most of the proposals bring back the badness in these
> designs, but rather than having it be all in one box, it's distributed
> across multiple routers and hosts across the Internet.
> 
> I can't wait for the day that routing instability causes worldwide
> cache invalidations in some map-encap scheme. :(

Ivip's full database query servers get the latest mapping, as chosen
by the end-user networks themselves (typically from a specialised
monitoring company contracted by the end-user network to choose the
best ETR to use) within a few seconds.  They retain a record of which
ITRs requested mapping within the caching time, and send secure (with
a nonce from the ITR's query) mapping update messages to the ITR. So
the ITR gets the changed mapping within tens of milliseconds of the
local full database query server getting it.  (There are optional
caching query servers between the ITRs and the full database query
servers, but the same thing happens.)

So caching is not necessarily evil.

LISP involves caching - and I haven't really followed the work on how
ETRs can securely update ITR mapping.  I don't support this approach.

In Ivip, ETRs simply decapsulate packets and sometimes communicate
with ITRs for the purpose of Path MTU Discovery management.  ETRs are
not at all involved in Ivip's mapping system.

LISP-ALT's ETRs are the authoritative source of mapping - they are
mapping query servers as well as traffic decapsulators.

  - Robin