[arin-ppml] Modified Header Forwarding for scalable routing

Sun Dec 20 06:21:40 EST 2009

Short version:    Owen DeLong suggested a scalable routing solution
                  would be to have DFZ routers modified to handle
                  packets with modified header structure.  As I
                  understand it, the new packet format would contain
                  the ASN of the destination ISP and the DFZ routers
                  forward it on this basis, without looking at the
                  destination address, which would remain unaltered.

                  I discuss this idea and two other proposals for
                  Modified Header Forwarding:

                     ETR Address Forwarding (EAF) - for IPv4
                     Prefix Label Forwarding (PLF) - for IPv6

                  These are extensions to my Ivip core-edge
                  separation proposal.  These would at least be
                  used in the long-term future - once all DFZ
                  routers had the new capabilities.  In the long-
                  term this would eliminate encapsulation overhead
                  (which is monstrous for IPv6 VoIP packets) and
                  Path MTU Discovery problems.

                  Ideally, they would be used for Ivip's initial
                  introduction - then we wouldn't need to worry about
                  encapsulation overhead or Path MTU Discovery
                  problems - so ITRs and ETRs could be very simple
                  indeed.  But that means having most DFZ routers
                  upgraded before Ivip could be used.

                  All such proposals require upgrades to essentially
                  all DFZ routers.  However, my two only involve
                  FIB changes, and for PLF a minor minor RIB change.
                  Neither requires a new FIB namespace or any change
                  to BGP operations.

Hi Owen,

In "Re: [arin-ppml] Routing Research Group is about to decide its
scalable routing recommendation" you wrote, in part:

> BGP is not the primary issue.  The primary issue we are faced with is the
> scalability of the way we make forwarding decisions in the IDR (Inter-Domain
> Routing) realm.

I understand you are suggesting that if there was a different
algorithm for forwarding packets, including extra information in the
packet headers, then BGP (I guess a suitably modified version of BGP)
would not be such a limiting factor as it is today in the provision
of multihoming, portability and inbound TE to a much larger number of
end-user networks.

> Currently, we make those decisions based on the IP prefix, thus overloading
> the IP addresses task as an end-system identifier with additional topological
> locator semantics.

Another way of saying this is that the IDR (Inter Domain Routing)
system can only use the same bits, or a subset of the same bits, for
forwarding each packet as we use for identifying hosts for
host-to-host communications.

> That is the issue which prevents scaling the routing system to PI for everyone,
> much more than any of the deficiencies in BGP4.  In fact, BGP4 contains a
> superset of that which would be necessary to build a FIB that could forward
> based on destination ASN.  What is needed is a field in the packet header
> for the destination ASN to be inserted at some point close to the origin,
> and a way to do so without affecting host protocol stacks, at least initially.

I think that both the IPv4 and IPv6 headers could be modified to
contain an ASN.

I understand from this that you are suggesting that packets would be
emitted normally by sending hosts, but that they would all be
processed by something resembling an ITR (Ingress Tunnel Router) of
the core-edge separation schemes (LISP, APT, Ivip and TRRP).  This
ITR function would be in the sending host or be in a router in the
ISP or end-user network where the sending host is located.  The ITR
function could also be out in the DFZ: a router advertising prefixes
which match the unmodified packets' destination address.  (This is
Ivip's Open ITR in the DFZ - OITRD - or LISP's "Proxy Tunnel Router".)

In all cases, the ITR would alter the format of the packet header in
a way that suitably modified routers in the DFZ (and also in ISP and
end-user networks, if the ITR function was within those networks, or
in the sending hosts) would forward the packet according to this new
information.

At some point, the modification would have to be reversed, to
reconstitute the packet ready for the destination host.  In core-edge
separation schemes, decapsulation is done by an ETR (Egress Tunnel
Router).  The ETR needs to be at some point in the destination ISP or
end-user network where the packet in its original form will be
forwarded to its proper destination - not to an ITR which would
modify it again.

Maybe it would be good enough to use the ASN directly, since a large
network, or a network with multiple topologically diverse border
routers would use the same ASN for all those border routers and
assuming the ASN is happy to accept incoming traffic for any of its
internal destinations on any of its BRs.

However, it would be possible to extend your proposal to use a
different number than the ASN to control the forwarding of the packet.

What you are suggesting is something like a core-edge separation
scheme - but maybe it is something different, since within the core,
you are forwarding packets according to an ASN number which is added
to the packet somewhere.  ASN is a different namespace from IP
address, whereas core-edge separation schemes keep the one namespace
(IP addresses) and use one subset for scalable end-user network space
("edge" addresses) and retain the rest as "core" addresses, for ISPs
and end-user networks with conventional PI space.

Your proposal is not exactly a core-edge elimination scheme - since
they tend to change the hosts so the applications address hosts by
one kind of address ("edge") and the hosts are physically accessed by
another kind "core", with the routing system operating much as it
does today, forwarding according to destination address, where this
is of the physical address of the host ("core" address AKA "locator")
and is unrelated to its logical application level address ("edge" AKA
"identifier").

Anyway, as I understand it, your proposal needs some things
resembling ITRs, which accept traffic packets with certain ranges of
destination addresses (that of the "edge" subset of all IP addresses)
 and do something to the packet to get them to an appropriate ETR (or
whatever it is which converts the packet to its original form) near
the destination end-user network.  This involves a "mapping lookup"
to produce something which will enable the packet to be forwarded to
the ETR.

With LISP, APT, Ivip and TRRP, the mapping lookup produces an ETR
address, and the packet is encapsulated so it is tunneled to the ETR.

In your proposal, the mapping produces an ASN of the ISP which the
destination network is connected to.  The DFZ forwards the packet to
the correct ASN - to any border router of that ASN, typically the
"closest" (in terms of how the DFZ routers choose their paths).

As I understand it, the effect of your proposal would be:

  1 - If an ASN advertises a prefix on any one of its BRs, it
      must be prepared to accept packets matching this prefix
      on all of its BRs.

      I suspect this is not what any ISP wants - since it gives
      it no control over where incoming traffic arrives.  But
      for the purpose of discussion, I will assume your scheme
      is desirable.

  2 - The DFZ routers wouldn't need a large FIB - at least for
      these modified packets, since the FIB wouldn't be dealing
      with the destination prefixes of these packets which are
      addressed to hosts in end-user networks with the new kind
      of scalable PI space.  This new section of the FIB would
      only match ASNs, and for each ASN, the router would have one
      best path.

      So your proposal reduces FIB size, but makes it more complex.

  3 - The DFZ routers may well be running BGP - but for these
      end-user network prefixes your scheme works for, it is not
      clear why.  Maybe that's the benefit - these end-user
      network prefixes should not be advertised in the DFZ at
      all.  In that case, you radically reduce the RIB size and
      so really do achieve routing scalability, since this is the
      main goal of current scalable routing proposals.

      However, see point 6.

      The DFZ routers would still have RIB and FIB entries for
      prefixes not of the scalable end-user PI type covered by
      your scheme.  But this is scalable.

  4 - Each DFZ router needs to decide a best path to the "nearest"
      ideally (actually any one of them) BR of each ASN.  This is
      a different task from what BGP is intended to achieve.  I
      suppose if every BR advertised at least one conventional
      prefix, then you could extract from the RIB all the information
      you need and choose the apparently nearest path to a BR of
      each ASN, to write this to this new ASN section of the FIB.

  5 - The BRs of all these ASNs would need to recognise the modified
      header and reconstruct the packet so it will be forwarded to
      and recognised by the destination host.

  6 - But what of the "ITR" function - the router somewhere which
      modifies the packet header to install the ASN?

      In the current DFZ, this could be done by a suitably modified
      DFZ router.  All DFZ routers know about all advertised prefixes
      and in the RIB you can easily see the ASN of the BR which
      advertised the prefix.

      But I think a major goal of your proposal is to avoid all these
      scalable PI prefixes from being advertised in the DFZ (3
      above).

      In that case, the "ITR" function can't be a DFZ router of the
      new kind. It has to look up a mapping database - either within
      itself or at some local or distant query server system.

      The mapping function accepts the packet's destination address
      and returns an ASN number.  Before this, the "ITR" function
      needs to recognise whether it the destination address matches
      a conventionally advertised prefix.  So I guess if the
      packet's destination address matches a prefix in the
      conventional IP address FIB, forward it that way.  If not,
      look up the mapping function.  If the mapping is to an ASN,
      modify the packet.  If there is no such mapping, drop the
      packet.

      So these "ITR" devices need to have a complete copy of the
      mapping database, or to be able to query something which
      has it and to cache the result in its "FIB", or whatever
      it uses to handle incoming packets.

      So if your aim is to remove these prefixes from the DFZ, then
      your ITR-like devices do resemble an ITR of LISP, APT etc.
      in that there needs to be a global mapping database, which
      tells the ITRs how the prefixes of scalable end-user address
      space are moved from one ISP to another.

      Then, you get into questions of multihoming service
      restoration.  How does your system respond to an end-user
      network no longer being able to use the link from its ISP
      with ASN AAA, and needing its packets to be forwarded to
      another ISP with ASN BBB?

      If this changed requirement could be transferred to
      all your "ITR" devices in real-time, that would solve
      the problem pretty well.  Ivip does it this way.  The
      other core-edge separation schemes assume this is either
      impossible or undesirable.  So the ITR gets mapping in the
      form of "send these packets to AAA, unless it is unreachable
      - in which case send them to BBB".  Then the ITR has to
      do reachability testing and it gets very complex.  Ivip
      ITRs are simpler.  They do not test for reachability to ETRs.
      They only have one ETR address and the tunnel packets to that
      address.

      If your ITR was like a current DFZ router, then it could
      find out about AAA being unreachable and BBB being the new
      ASN to forward the packets to, by allowing the current
      DFZ to propagate best paths to it.  However, I think the
      main benefit of your scheme would be to avoid advertising
      all these end-user prefixes in the DFZ - so this won't
      work.

      I think that to be really useful for scaling, your
      scheme would need a separate (not BGP-based) mapping
      arrangement by which the ITR functions could quickly and
      reliably decide which ASN to forward the packets to.  I think
      this requires some separate mapping system.

In the mapping systems of core-edge separation schemes, a subset of
the IP address range returns a positive response from the mapping
system.  These are what I call the "Scalable PI" (SPI) addresses -
that subset of the address range which is used by end-user networks
for scalably getting address space which is portable and gives them
multihoming and inbound TE.  (This may also be thought of as the
"edge" subset, or the "identifier" subset of the whole IP address
range.  This subset of the address space would be many separate
prefixes scattered through the range.

The remainder are "core" (AKA "locator" or "RLOC" in LISP), since
these core addresses continue to function as they do today, with all
routers, including DFZ routers, using these addresses to forward
packets all the way to their destination.  Core addresses can still
be used for end-user network PI space, but this is not scalable.

The usual arrangement for a core-edge separation scheme is to
encapsulate the packet and tunnel it to an ETR.  ETRs are always on
"core" addresses.

The total packet, now longer due to the one or more headers used to
encapsulate it, is handled in the same way by DFZ routers as any
other packet, so the DFZ routers don't need any upgraded functionality.

With your suggested scheme, there is no encapsulation and the packet
is not made any longer.  The DFZ routers forward the packet to the
"nearest" BR of the currently mapped ASN for the packet's destination
prefix.  So there's no encapsulation overhead.  If the packet is too
long for one of these DFZ routers, it will presumably convert the
packet back to its original form and send a conventional ICMP Packet
Too Big message, so the sending host will get this message and retry
with a shorter packet.  So there are no thorny PMTUD problems, as
there are with encapsulation.

With what you are suggesting, the DFZ routers need to be upgraded to
recognise the new packet format.  This raises major problems for
widespread voluntary adoption - but I think it is well worth
exploring.  In the long term, it will cost little to progressively
upgrade all DFZ routers with the new function.  So in the long-term,
a core-edge separation scheme could be migrated to this modified
header arrangement without significant cost.   Its just a matter of
making sure all new routers perform the function.  When all DFZ
routers are of this type, the system can be turned on.

I think that using modified headers only makes sense if the core-edge
separation scheme does not need to send any extra information along
with the traffic packet.  LISP does send extra info with each traffic
packet - in a LISP header which itself is behind a UDP header.  Ivip
does not send any extra information and uses simple IP-in-IP
encapsulation.  (I am not sure whether APT or TRRP send extra data.)

If the core-edge separation scheme needs to send extra data with each
traffic packet, then most of the benefits of modifying the header are
lost, since there would still need to be some additional header,
which lengthens the total packet and causes a lot of trouble with
Path MTU Discovery.  Avoiding the encapsulation overhead and PMTUD
complexities are extremely worthwhile goals.

Sorry this was rather long, but I wanted to say that I think modified
header forwarding is worth exploring - and I tried to imagine what
your scheme would involve.

Since you are prepared to consider upgrading all DFZ routers and
modifying the IP header, here are my two proposals which do this.

They are both extensions to Ivip - one for IPv4 and one for IPv6.

ETR Address Forwarding (EAF) - for IPv4

  http://tools.ietf.org/html/draft-whittle-ivip4-etr-addr-forw

  Do away with fragmented packets in the DFZ.  If any host
  sends a fragmentable packet above a certain length - somewhere
  between 1000 and 1500 bytes - it will be dropped.  This should
  not cause much trouble since well written applications have
  been able to use RFC 1191 PMTUD since 1990.  (Google servers
  regularly put out long fragmentable packets, but that could
  easily be changed.)

  Do away with header checksums.  IPv6 gets along OK without them.

  Now we can set the Evil Bit 48 to 1, indicating the header has
  the new EAF format.

  We now have 30 bits to play with:

      1 bit  "More Fragments"
     13 bits "Fragment Offset"
     16 bits "Checksum"

  These 30 bits are now used to specify the IP address of the ETR.

  The DFZ routers recognise such packets and use this 30 bit
  ETR address for forwarding, with their conventional IP address
  FIB, rather than the destination address.

  The DFZ routers do ordinary BGP and run their FIBs as usual
  - with just this potential conversion step added.  The ETR
  addresses are all "core" addresses - in prefixes of ISPs.
  So there's no need to advertise each individual prefix of
  millions of end-user networks using scalable PI space.

  ETRs need to be on addresses ending in binary 00.  They
  reconstruct the original state of the header and forward
  the packet to the destination end-user network.

Prefix Label Forwarding (PLF) - for IPv6

  http://www.firstpr.com.au/ip/ivip/ivip6/

  Grab the 20 bit Flow Label and put it to good use.  A non-zero
  Flow Label indicates this packet has the modified IPv6 header.

  One way of using this would be to use all 2^20 code points to
  refer to ETRs which are reachable via up to 2^20 separately
  advertised prefixes.  These are prefixes advertised by ISPs, each
  one for either a single ETR, or multiple ETRs via some further
  mechanisms I won't mention here.

  2^20 is an upper limit.  Hopefully the system would work fine
  with a few hundred thousand such prefixes.

  In this first approach we have a block of IPv6 address space which
  is divided into 2^20 equal-sized prefixes, and assign these to ISPs
  by some means.  Then, the 20 bits in the modified header is used to
  alter the FIB algorithm from using the destination address (which
  remains unchanged, being the scalable PI address of the destination
  host) to using the address extended from these 20 bits.

  The rest of the FIB operation remains the same.

  There are other ways of using the bits.  The way I am currently
  thinking of is to reserve 2^19 code points for use in the
  way just described - to forward the packet across the DFZ to
  a BR which advertises a particular prefix.  Then reserve the
  other 2^19 code points for private use within each ASN, for
  whatever purposes the network chooses.

  It would also be possible to use all 2^20 code points as
  described above in the DFZ and let ASNs use them for whatever they
  like inside their own networks, but for robustness I think it is
  best to keep the namespaces separate.

Both these schemes keep the FIB of DFZ routers substantially the same
- whereas I think yours involves significant additions, by way of a
new kind of lookup in the ASN namespace, instead of the IP namespace.

Both of them keep BGP going as it does today - except that there is
no need to advertise the prefixes of scalable PI end-user address
space, just as is the case with Ivip etc. using encapsulation.

> A solution which can be incrementally deployed would be vastly superior
> to a solution which requires a flag day since a flag day is pretty much
> impossible in the current internet.

"Incremental deployment" turns out to mean different things to
different people.  I thought it meant "substantial benefits to early
adoptors, irrespective of how few adopters there were, while being
backwards compatible with existing systems and not disrupting
anything".  But other people had a much less restrictive
understanding of the term, so I don't use it any more.

The trouble with your scheme and my two is that they only produce
benefits when all, or essentially all, DFZ routers are upgraded.

An ITR can only convert a packet to the new format if it can be sure
that all routers (DFZ and any internal routers at its end, or the
other) between itself and the ETR will handle the new format.

If there are networks which don't have ETRs, then it doesn't matter
if their BRs and any DFZ routers serving them don't have the
upgrades.  However, every other DFZ router needs to be upgraded.

So I think these schemes are very hard to introduce with a new system.

Ideally, we could introduce Ivip with PLF on IPv6.  It will be a long
time before we really need a scalable routing solution on IPv6, so it
would be great to do it this way from the start.  This would mean we
don't have to worry about PMTUD problems at all, or about
encapsulation overhead.  ITRs and ETRs could be very simple.
Encapsulation overhead with IPv6 is much worse than in IPv4,
especially for short packets such as VoIP.  At least Ivip uses simple
IP-in-IP encapsulation.  LISP uses UDP and its own LISP header too,
so the encapsulation can easily be longer than the VoIP payload.

Ideally we could upgrade all IPv4 DFZ routers, with a firmware
upgrade, before introducing Ivip.  The change is purely in the FIB.
The longer this scalable routing debate goes on, the more DFZ routers
should be fully firmware based and the more it will be practical to
upgrade the lot of them by the time a scalable routing solution is
actually introduced.

If this can't be done - if Ivip is introduced with encapsulation for
either IPv4 or IPv6 or both, then the systems should be designed for
long-term transition to Modified Header Forwarding, at some time in
the future when all the DFZ routers have the upgraded functionality.
 This need not cost much - AFAIK, it is purely a firmware matter for
modern routers, and is not particularly complex.

> I am glad to see IETF working on this problem, but, I believe that all of the
> current solutions are vastly more complicated than they need to be, and,
> encompass changing a lot more than needs to be changed.

I think that if you worked on your proposal in detail, with the goal
of removing all these end-user prefixes from the DFZ, then you would
have some thorny problems to solve with the mapping system and how
the ITRs would forward packets to BBB instead of AAA for multihoming
service restoration.   You could do it with something like Ivip's
real-time mapping system, or you could do it with the slow approach,
and more complex mapping and more complex ITRs of LISP, APT and TRRP.

Then, I don't think your system would be significantly simpler than
these proposals.

I think it would be better to use one of my approaches, rather than
just the ASN, because I am pretty sure that ASNs want to be able to
specify which of their BR gets packets addressed to a given prefix.
Ivip and the other core-edge separation schemes achieve this, but
your's, as far as I know, will send packets to any BR of the ASN
which advertises the prefix at even one of its BRs.  Also, you would
have to figure out what to do if the BRs of two ASNs advertise the
same end-user prefix.

  - Robin          http://www.firstpr.com.au/ip/ivip/