[arin-ppml] Modified Header Forwarding for scalable routing

Wed Dec 23 07:15:39 EST 2009

Owen replied to me offlist, but with his permission I am responding
on-list.  I am trying to understand and critique his proposal - and
contrast it with my two proposals.

Hi Owen,

Thanks for your reply, which is here in full, but with parts of the
my message deleted.

>>> Currently, we make those decisions based on the IP prefix, thus overloading
>>> the IP addresses task as an end-system identifier with additional topological
>>> locator semantics.
>>
>> Another way of saying this is that the IDR (Inter Domain Routing)
>> system can only use the same bits, or a subset of the same bits, for
>> forwarding each packet as we use for identifying hosts for
>> host-to-host communications.
>
> No, that says something different.

OK.

> My point is that the two uses of IP addresses (end system identifier which
> wants to be long duration in nature and possibly transitory with regard to
> topology, such as when a business changes one or more ISPs, and topological
> locator, which works best when there is a single topological locator for each
> given unique zone of network administrative control, e.g. autonomous system).

However we say it, I think we agree that expecting all DFZ routers to
use the IP address, as currently used to identify hosts and sessions,
is not scalable when large numbers of end-user networks want their
own space which is portable and suitable for multihoming.

>> I think that both the IPv4 and IPv6 headers could be modified to
>> contain an ASN.
>
> Sure, but, I don't think it's necessarily worth doing for IPv4.

OK.  I am working on both - and I think there's a real scaling
problem in IPv4 today, with it being difficult to get portable,
multihomable space, and with the only approaches to multihoming both
burdening the entire DFZ and only working for 256 or more IP
addresses.  I think this /24 chunk approach to multihoming and
portability is probably driving address depletion - assuming many
networks which want these things could also get by fine with less
than 256 IPv4 addresses.

>> I understand from this that you are suggesting that packets would be
>> emitted normally by sending hosts, but that they would all be
>> processed by something resembling an ITR (Ingress Tunnel Router) of
>> the core-edge separation schemes (LISP, APT, Ivip and TRRP).  This
>> ITR function would be in the sending host or be in a router in the
>> ISP or end-user network where the sending host is located.  The ITR
>> function could also be out in the DFZ: a router advertising prefixes
>> which match the unmodified packets' destination address.  (This is
>> Ivip's Open ITR in the DFZ - OITRD - or LISP's "Proxy Tunnel Router".)
>
> Correct.
> 
>> In all cases, the ITR would alter the format of the packet header in
>> a way that suitably modified routers in the DFZ (and also in ISP and
>> end-user networks, if the ITR function was within those networks, or
>> in the sending hosts) would forward the packet according to this new
>> information.
>
> Correct.
> 
>> At some point, the modification would have to be reversed, to
>> reconstitute the packet ready for the destination host.  In core-edge
>> separation schemes, decapsulation is done by an ETR (Egress Tunnel
>> Router).  The ETR needs to be at some point in the destination ISP or
>> end-user network where the packet in its original form will be
>> forwarded to its proper destination - not to an ITR which would
>> modify it again.
>
> Yes, when it reaches the destination ASN or the first router that needs
> to forward it to a router which has not passed this capability as a
> negotiated feature in it's neighborship in BGP was my thinking
> on the decapsulation point.
> 
>> Maybe it would be good enough to use the ASN directly, since a large
>> network, or a network with multiple topologically diverse border
>> routers would use the same ASN for all those border routers and
>> assuming the ASN is happy to accept incoming traffic for any of its
>> internal destinations on any of its BRs.
>
> Correct... That is my thinking.
> 
>> However, it would be possible to extend your proposal to use a
>> different number than the ASN to control the forwarding of the packet.
>
> Sure, it's possible, but, that would require information that is not currently
> available in BGP4.  

Yes, but my aim is to do it at an ITR, which is not necessarily
connected with the BGP DFZ, and while also allowing a reduction in
the number of routes in the DFZ.

> OTOH, the AS-PATH/next-hop data already present
> in BGP4 would allow an ASN-Based FIB to be constructed from the
> existing RIB without modification to the wire protocol for BGP4.

Yes - I think the router would look at the path of the prefix and use
only the last ASN.

> Routers participating in this type of forwarding and understanding
> these new format packs would pass this capability as a negotiated
> feature much like 32-bit ASN capability today.

I find this interesting: converting subsets of the DFZ to
handle a modified header scalable routing system, which means not
necessarily requiring all DFZ routers to be upgraded before the
system can be used.

>> What you are suggesting is something like a core-edge separation
>> scheme - but maybe it is something different, since within the core,
>> you are forwarding packets according to an ASN number which is added
>> to the packet somewhere.  ASN is a different namespace from IP
>> address, whereas core-edge separation schemes keep the one namespace
>> (IP addresses) and use one subset for scalable end-user network space
>> ("edge" addresses) and retain the rest as "core" addresses, for ISPs
>> and end-user networks with conventional PI space.
>
> Correct.  The problem I see with the other core-edge separation schemes
> is:
> 	1.	The encapsulation is much higher overhead since you are adding
> 		a lot more data to each packet, and, it's essentially a full tunneling
> 		implementation.  Mine is much lighter weight, along the lines
> 		of ATM encapsulation without the limitation of the small cell
> 		size.

Its lighter-weight than ATM, MPLS or any kind of IP encapsulation -
since the packet doesn't get any longer.  It is zero weight!

> 	2.	The routing semantics for people attempting to debug routing
> 		problems become much more complex in the other schemes.
> 		With my stuff the ASN in the packet and forward on that where
> 		you can scheme, you have the following advantages:
> 
> 		1.	Can be deployed incrementally
> 		2.	The routing semantics remain largely the same and the
> 			FIB is computed from the same existing RIB.
> 		3.	The same commands we use for debugging today still
> 			give you the same information you need. (mostly)

OK - I will concentrate on trying to understand and critique your
scheme, but I agree debugging is an important issue.

>> Your proposal is not exactly a core-edge elimination scheme - since
>> they tend to change the hosts so the applications address hosts by
>> one kind of address ("edge") and the hosts are physically accessed by
>> another kind "core", with the routing system operating much as it
>> does today, forwarding according to destination address, where this
>> is of the physical address of the host ("core" address AKA "locator")
>> and is unrelated to its logical application level address ("edge" AKA
>> "identifier").
>
> Correct.
> 
>> Anyway, as I understand it, your proposal needs some things
>> resembling ITRs, which accept traffic packets with certain ranges of
>> destination addresses (that of the "edge" subset of all IP addresses)
>> and do something to the packet to get them to an appropriate ETR (or
>> whatever it is which converts the packet to its original form) near
>> the destination end-user network.  This involves a "mapping lookup"
>> to produce something which will enable the packet to be forwarded to
>> the ETR.
>
> Correct, although there is no reason the "certain range" could not be ::/0.

OK - here is a big difference between a core-edge separation scheme
and your scheme:

  In a core-edge separation scheme, one subset of the address range
  is given the new title "edge" - I call it "Scalable PI" (SPI).
  This is for end-user networks only and is portable, multihomable
  and usable for inbound TE on a scalable basis.  Therefore it works
  very differently from the currently only approach to this:
  advertising each end-user network prefix in the DFZ.

  In practice, with Ivip at least, this subset would include many
  prefixes scattered through the unicast address space.  Within each
  such prefix, there would be the SPI address space used by many
  end-user networks.  With Ivip, the space can be divided down into
  ranges of any number of IPv4 addresses or IPv6 /64s, so each
  separately mapped range is not necessarily a binary boundary
  prefix.

  The rest of the address remains as it is, and is known as "core"
  address space.

  The ITRs only encapsulate packets whose destination address matches
  one of the prefixes in the SPI subset.  They tunnel these packets
  to ETRs, which always have "core" addresses.

  In your scheme, parts of the DFZ and ultimately the whole DFZ
  change in the way they handle all packets.  There is no concept of
  "core" or "edge".  I assumed there was this distinction because I
  assumed you were trying to reduce the number of routes in the RIB
  of DFZ routers.  But as you write below, this is not the case -
  except perhaps after complete adoption.

  So your scheme is an overall change to the DFZ which alters the
  current functionality of routers forwarding packets to specific
  BRs - the ones which advertise the prefix.  Instead, the packets
  are forwarded to the BR of the ASN which the one or more advertising
  BRs are a part of.  As you write below, "ASN" does not mean "ISP" -
  an ASN could be a subset of the ISP's network.

>> With LISP, APT, Ivip and TRRP, the mapping lookup produces an ETR
>> address, and the packet is encapsulated so it is tunneled to the ETR.
>>
>> In your proposal, the mapping produces an ASN of the ISP which the
>> destination network is connected to.  The DFZ forwards the packet to
>> the correct ASN - to any border router of that ASN, typically the
>> "closest" (in terms of how the DFZ routers choose their paths).
>
> The mapping lookup would, in my thinking, produce one or more
> destination ASNs, possibly with preferences, ala MX records.

OK - so you anticipate the ITR (or whatever you call the device which
alters the format to the new one, putting an ASN in the header) would
somehow obtain mapping which gave it either one ASN, or multiple ASNs
with priorities.

Because you anticipate separate parts of the DFZ being converted to
your system, these "ITRs" would frequently be DFZ routers.  Perhaps
every "ITR" would be a DFZ router.

I guess the "mapping" information would be derived directly from what
the "ITR" device already has at hand via BGP.  For instance, if the
router received two paths for a given prefix, each ending in a
different ASN, then the mapping would contain the two ASNs, with
there being some algorithm for deciding the priorities of each.

How would the ITR function decide which ASN to use when there were
more than one in the mapping?

In LISP, APT and TRRP - and any other core-edge separation schemes I
know of apart from Ivip - the ITR gets a prioritized list of ETR
addresses and tests reachability to the highest priority one.  If
that is not reachable, it tunnels packets to the next highest
priority one which it is able to reach.  (I think this is very
messy - Ivip ITRs get a single ETR address and always tunnel to that.)

But your "ITR" function doesn't get from the mapping information an
IP address to tunnel packets to - it gets one or more ASNs.  The BGP
information in the DFZ router may have multiple paths for a prefix,
with various ASNs and with different BRs within the one ASN.  But
that doesn't help your system at all, I think, since you are not
deciding which BR to send the packets too - just which ASN.  I can't
think of a way your "ITR" function could predict exactly which BR of
the ASN the packet would be forwarded to, so I can't think of how you
could test reachability to any one BR, or to the ASN in general.

> The rest is generally correct, with the caveat that along the way
> the packet may be reverted to prefix forwarding and then
> re-encapsulated again as it traverses areas of capable and
> non-capable router peerings.

OK - I hadn't anticipated this when I replied to you.

Your plan involves some DFZ routers accepting an ordinary packet and
performing what I call an "ITR" function on it.  However that term is
for tunnelling to a specific end-point.  In your system this router
changes the header format to contain an ASN, and then forwards it to
a neighbour which has previously been determined as being the best
path towards any BR of this ASN.

The modified format packet may then be forwarded in the same way by
the next DFZ router and it may be forwarded this way to the BR, which
does the "ETR" function and turns the header back to a normal state.

However, the packet may reach a DFZ router, which has been upgraded
to handle your system, and that router decides one of:

  1 - The next hop for the packet (based solely on the ASN in its
      header) is not a BR of that ASN.

  2 - It is a BR of the ASN, but that this BR hasn't been upgraded to
      handle the modified packets.  This can be determined by the
      BR's BGP messages to this DFZ router.

  3 - The next hop router is not the BR of the ASN and has not been
      upgraded to handle the modified packets.

In all three cases, the router now performs an "ETR" function: it
restores the header to the normal state and then forwards the packet
according to its conventional destination address FIB algorithm.

So a packet could pass through one "island" of modified header
forwarding, to a DFZ router which causes it to be forwarded normally,
and then to one or more other islands before arriving at the BR.
When it gets to the BR, it may be in ordinary form or modified header
form.

>> As I understand it, the effect of your proposal would be:
>>
>>  1 - If an ASN advertises a prefix on any one of its BRs, it
>>      must be prepared to accept packets matching this prefix
>>      on all of its BRs.
>>
>>      I suspect this is not what any ISP wants - since it gives
>>      it no control over where incoming traffic arrives.  But
>>      for the purpose of discussion, I will assume your scheme
>>      is desirable.
>
> Perhaps an ISP that wants that kind of control needs more than
> one ASN.  Many ISPs have multiple ASNs, so, let's talk about how
> we define the term Autonomous System.  You are using the term
> above as synonymous with ISP.  Let us instead state that an AS
> is a collection of prefixes with a common routing policy.
> 
> Thus, if you want different ingress rules for prefix A than for
> prefix B, prefixes A and B should be in different Autonomous
> systems.

OK.

This would probably lead to many ISPs wanting a significant number of
ASNs.  Maybe one router can advertise some routes with one ASN and
some with another - I am not sure if BGP allows this.  If not, then
the ISP would face some tricky questions deciding how to partition
all its BRs into one ASN or another.

I can't see how either approach would be an improvement for ISPs over
the current arrangement where a BR advertises a prefix and the DFZ
forwards packets to that BR - or to as many BRs as advertise the same
prefix.  It seems your proposal would greatly reduce the ISP's
control of which BR packets arrive at.

> The gain here is that you have N FIB destinations where N is
> the number of ingress policies rather than N*X FIB destinations
> where N is the number of ingress policies and X is the average
> number of prefixes represented in each policy.

OK  - in the modified routers IF you could do away with the
conventional IP-address based FIB, then the ASN-based FIB would be
simpler because there are a smaller number of ASNs than there are
prefixes.  (At least there is supposed to be - what if an ISP decided
it wanted to retain its former level of control and therefore
registered a separate ASN for each prefix, including each prefix it
advertises of an end-user network?)

> So, yes, this will require some TE users today to modify their
> TE tactics, but, they should be able to preserve their strategy
> with only minor modifications to the network for new tactics
> at least in most cases I know of.

I still don't see how your system would improve the inbound TE
control when compared to the current arrangements.

>>  2 - The DFZ routers wouldn't need a large FIB - at least for
>>      these modified packets, since the FIB wouldn't be dealing
>>      with the destination prefixes of these packets which are
>>      addressed to hosts in end-user networks with the new kind
>>      of scalable PI space.  This new section of the FIB would
>>      only match ASNs, and for each ASN, the router would have one
>>      best path.
>>
>>      So your proposal reduces FIB size, but makes it more complex.
>
> Actually,think it simplifies the FIB as well since the FIB really only
> needs to contain the following now:
> 
> ASN	Next-Hop-IP	Next-Hop-Interface	(policy)

> I'm curious where you see additional complexity.

Yes - but that is only when the DFZ router doesn't need to forward
packets according to their destination address.  Until then, it needs
its full IP-address based FIB plus the new ASN based one.

I (and many other people) think that scalable routing proposals need
to provide continual scaling benefits, in proportion to how much they
are adopted - not just provide improvements once they are 100% adopted.

In a fully adopted scheme, you may be able to get all DFZ routers
using only an ASN FIB, but the "ITR-like" devices which modify the
header need to do this at "wire-speed": in their FIBs.  So you still
need an FIB in the "ITR-like" devices which works from the
destination IP address to determine whether to modify the header, and
if so to what contain which ASN - according to some algorithm which
has previously chosen which of potentially multiple prioritized ASNs
to use.

>>  3 - The DFZ routers may well be running BGP - but for these
>>      end-user network prefixes your scheme works for, it is not
>>      clear why.  Maybe that's the benefit - these end-user
>>      network prefixes should not be advertised in the DFZ at
>>      all.  In that case, you radically reduce the RIB size and
>>      so really do achieve routing scalability, since this is the
>>      main goal of current scalable routing proposals.
>>
>>      However, see point 6.
>>
>>      The DFZ routers would still have RIB and FIB entries for
>>      prefixes not of the scalable end-user PI type covered by
>>      your scheme.  But this is scalable.
>
> You'd probably still have the full RIB, at least for quite some time.
> Obviously, once this became ubiquitous, the EBGP RIB could
> be reduced to AS Path information, but, that would be a long-term
> savings, not something immediate and would require significant
> changes to BGP and more coordination, but, could eventually
> be achieved.

But if the RIB only contains the ASN of the BR advertising the
prefix, how could the router decide which path was shortest, if there
were two or more paths from two or more neighbours for the same
prefix, with the same ASN?  Or if they were for different ASNs?  This
comes back to my question about how the "mapping" data could be made
to include priorities.

> BGP is preserved for two reasons.  It's a convenient method for
> distributing the AS PATH information (the new scalable PI
> addresses for ESIs is the existing IPv6 address in my scheme,
> btw). 

I don't understand the section in brackets - can you elaborate?  Is
this the ATM End System Address mentioned in RFC 2492?

> Second, by preserving BGP as is, it should be possible
> to allow this to be implemented as islands which eventually
> grow until there is more land and the oceans become lakes,
> then puddles, then non-existant as the code is upgraded to
> support this new capability.

OK - I can imagine how your islands would work.  However, I wonder
about potential routing loops.  Maybe in an island the packet gets
forwarded according to one algorithm, which causes it to leave the
island, where another algorithm (the one currently used in the DFZ)
decides its next path.  Maybe that will put it back in the same
island and the process will repeat.

Also, I remain to be convinced that any ISP actually wants to degrade
(in my view) the specificity of its incoming traffic control by
having the DFZ forward packets to any BR in an ASN just because one
BR in the ASN advertises the prefix.

> Further, my understanding of the scaling limits is that the issue
> is with TCAM (FIB only) much moreso than RAM (RIB+FIB), so,
> I'm less worried bout shrinking the RIB.

This is what I thought three years ago . . .

Reading the RAWS material:

  http://tools.ietf.org/html/rfc4984
  http://www.iab.org/about/workshops/routingandaddressing/

I thought that the FIB was the big problem, since it had to deal with
every traffic packet, ideally in a fraction of a microsecond.  I
figured BGP and the RIB could cope with many more prefixes, since
this was just non-traffic-related parleys between router CPUs, at a
relaxed pace.

I made a magnificent plan for IPv4 router FIBs to be made with
high-speed static RAM to a resolution of /24 - and wrote it up as an
Internet Draft.  On the RAM list, which followed RAWS and preceded
the current incarnation of the RRG, no-one was interested.  I soon
learned that the BGP control plane and the RIB functions of routers
are the real problem in scalable routing.

I believe that both the FIB and the RIB of all DFZ routers needs to
be limited to the prefixes of ISPs and some or all currently
advertising end-user networks, but not lots more.

So if your scheme doesn't alter the RIB situation until there is a
complete conversion in the DFZ, I don't think it would solve the
scaling problem as well as other proposals.

>>  4 - Each DFZ router needs to decide a best path to the "nearest"
>>      ideally (actually any one of them) BR of each ASN.  This is
>>      a different task from what BGP is intended to achieve.  I
>>      suppose if every BR advertised at least one conventional
>>      prefix, then you could extract from the RIB all the information
>>      you need and choose the apparently nearest path to a BR of
>>      each ASN, to write this to this new ASN section of the FIB.
>
> Let's look at this a little differently, because I believe it maps to todays
> forwarding fairly well...
> 
> Let's say that 2620:0:930::200:2/48 maps to ASN1734
> Now, let's say that your AS Paths for 1734 are:
> 	2958 59392 3921 591 392 701 9323 6939 1734
> 	2394 9080 29348 270 892 3749 3948 238472 10565 1734
> 	2348 29837 234 2342 83 283 234283 29384 23423 8121 1734
> 
> You decide, first, that the shortest AS PATH is the one starting 2958.
> Next, you look up your immediate next hop for 2958, and, poof,
> off it goes in that direction.
> 
> Make sense?

OK - that gives your FIB a single mapping result: a particular next
hop neighbour router.

What about priorities?

The whole idea of an end-user network multihoming is to use two or
more physical links and two or more separate ISPs.  Then, the
scalable routing system needs to respond quickly when ISP AAA
disappears, or the link from AAA to the end-user site goes down (two
totally different things, and the latter one may be hard to detect
from an "ITR") so the packets are tunneled to ISP BBB instead.

This needs to happen without changing any DFZ advertisements - and to
achieve scalable routing in terms of RIB and the DFZ control plane,
the end-user prefix can't be advertised in the DFZ anyway.

I don't see how your scheme would respond in the short term to
achieve multihoming failure restoration.  You discuss this more below.

>>  5 - The BRs of all these ASNs would need to recognise the modified
>>      header and reconstruct the packet so it will be forwarded to
>>      and recognised by the destination host.
>
> See above where I described advertising this capability in BGP
> as a negotiated feature, like 32-bit ASNs.

OK.

>>  6 - But what of the "ITR" function - the router somewhere which
>>      modifies the packet header to install the ASN?
>>
>>      In the current DFZ, this could be done by a suitably modified
>>      DFZ router.  All DFZ routers know about all advertised prefixes
>>      and in the RIB you can easily see the ASN of the BR which
>>      advertised the prefix.
>>
> Maybe, but, not ideal in my opinion.
> 
>>      But I think a major goal of your proposal is to avoid all these
>>      scalable PI prefixes from being advertised in the DFZ (3
>>      above).
>
> Correctttt.

OK - then how does any DFZ router know about the end-user prefix in
order to determine the ASN of the one or more BRs which advertise it?

>>      In that case, the "ITR" function can't be a DFZ router of the
>>      new kind. It has to look up a mapping database - either within
>>      itself or at some local or distant query server system.
>
> It can be a function of any DFZ router, but, we need to provide an
> alternate means of mapping.

Yes - because the prefixes won't be advertised in the DFZ.

>>      The mapping function accepts the packet's destination address
>>      and returns an ASN number.  Before this, the "ITR" function
>>      needs to recognise whether it the destination address matches
>>      a conventionally advertised prefix.  So I guess if the
>>      packet's destination address matches a prefix in the
>>      conventional IP address FIB, forward it that way.  If not,
>>      look up the mapping function.  If the mapping is to an ASN,
>>      modify the packet.  If there is no such mapping, drop the
>>      packet.
>
> Sounds reasonable.

OK - but now we are talking about a core-edge separation scheme,
where the "ITR" or whatever has a conventional IP-address-based FIB,
and when it finds a packet's destination address doesn't match any
prefix in that FIB (which contains no SPI "edge" prefixes) then it
uses the mapping function to see if the packet matches an SPI prefix
- and if so what are the one or more ASNs with priorities, to
consider forwarding it towards.

>>      So these "ITR" devices need to have a complete copy of the
>>      mapping database, or to be able to query something which
>>      has it and to cache the result in its "FIB", or whatever
>>      it uses to handle incoming packets.
>
> I was thinking the latter.

OK - only LISP-NERD went so far as to put a full copy of the mapping
database in each ITR.  NERD is no longer being developed.

APT and Ivip have local full-database query servers - local to the
ISP or end-user network which contains the ITRs.  LISP-ALT (and all
the other LISP mapping approaches: CONS, DNS, DHT . . .) use a fully
distributed, non-centralised, and therefore global system of mapping
resolution.  That leads to serious problem with the time it takes to
get the mapping - so initial packets may be dropped or be so delayed
they are more trouble than they are worth.

LISP-ALT routers drop packets for which they have no mapping - and
then send a map request.  Once the map reply arrives, they put that
in their cache and wait for the sending host to send another packet -
which is then encapsulated immediately and tunneled to the ETR.
(This doesn't count however the ITR tests which of multiple ETRs are
reachable now . . .)

>>      So if your aim is to remove these prefixes from the DFZ, then
>>      your ITR-like devices do resemble an ITR of LISP, APT etc.
>>      in that there needs to be a global mapping database, which
>>      tells the ITRs how the prefixes of scalable end-user address
>>      space are moved from one ISP to another.
>
> Correct.

OK - so then you don't need to rely on BGP at all.  Yet you
previously wrote that your scheme had the benefit that the DFZ
routers already had the required information to decide which ASN to
use, based on their BGP communications.

Perhaps you mean use BGP during deployment and switch to a separate
method when it is fully deployed.

>>      Then, you get into questions of multihoming service
>>      restoration.  How does your system respond to an end-user
>>      network no longer being able to use the link from its ISP
>>      with ASN AAA, and needing its packets to be forwarded to
>>      another ISP with ASN BBB?
>
> Two possibilities:
> 
> 	1.	Extension header field "Beenthere" where the AAA
> 		ASN is appeneded and a lookup is performed to find
> 		an additional candidate ASN, replace AAA in header,
> 		forward towards BBB.

I think this means the packet gets longer, adding to traffic loads
and leading to Path MTU Discovery problems which are really tricky to
solve.  Extension headers are only for IPv6 anyway.

> 	2.	Depend on updating the map quickly and dynamically.

Welcome to the RRG debates!  Most people assume it is not possible to
do real-time (a few seconds) updates of mapping to ITRs.  I think it
is possible and more than desirable - I think it is the only proper
way to do a core-edge separation scheme.  So Ivip is currently the
only proposal involving real-time mapping updates to local query
servers, which push the changes (secured with a nonce from the map
query) to the ITRs which requested the mapping and which would still
be caching time the previously received mapping information.

So Ivip's mapping consists of a single ETR address - and the ITR
never has to test reachability to the ETR.

> I think option 1 is more likely to work more often, but, they could
> be combined.

My Modified Header Forwarding approaches avoid using extension
headers and, within the context of Ivip, the ITRs get changed mapping
within seconds of whatever mechanism changes the mapping in response
to TE needs, multihoming failure restoration or portability.

The best place to discuss Ivip or my two Modified Header Forwarding
approaches is the RRG, since this is one of the proposals which is
being considered.

>>      If this changed requirement could be transferred to
>>      all your "ITR" devices in real-time, that would solve
>>      the problem pretty well.  Ivip does it this way.  The
>>      other core-edge separation schemes assume this is either
>>      impossible or undesirable.  So the ITR gets mapping in the
>>      form of "send these packets to AAA, unless it is unreachable
>>      - in which case send them to BBB".  Then the ITR has to
>>      do reachability testing and it gets very complex.  Ivip
>>      ITRs are simpler.  They do not test for reachability to ETRs.
>>      They only have one ETR address and the tunnel packets to that
>>      address.
>
> My favorite is somewhat of a hybrid... The map looks  a lot like
> an MX lookup receiving a set of candidates with optional preferences.
> On possibilty is  to include an "Encapsulating Router" IP in the
> additional packet header fields so that if a re-mapping occurs,
> the encapsulating router can be notified.  

I don't clearly understand this, but I think you may be considering a
global distributed query server system such as LISP-ALT (or now
several more RRG proposals) with some kind of secure update to the
ITRs which requested the mapping.  Global systems don't scale well,
due to delay problems and considering how many ITRs might be
requesting the mapping and needing to be informed in the event of a
TE, multihoming or portability change.

> By using the beenthere field I suggested above, we preserve loop
> detection.

But I think this means adding headers.

>>      If your ITR was like a current DFZ router, then it could
>>      find out about AAA being unreachable and BBB being the new
>>      ASN to forward the packets to, by allowing the current
>>      DFZ to propagate best paths to it.  However, I think the
>>      main benefit of your scheme would be to avoid advertising
>>      all these end-user prefixes in the DFZ - so this won't
>>      work.
>
> Correct.

In which case BGP couldn't be used and you would need either need a
mapping system which worked to all ITRs in real-time (like Ivip) or
you could use a slower mapping distribution system, with multiple
options of ASNs or whatever to forward the packet to, and then
require the ITR to choose between the options, such as by some
reachability testing.  That reachability testing takes time - and I
think all the core-edge separation schemes apart from Ivip suffer
from problems with ITRs finding it slow and expensive to test
reachability - when the ITR is really supposed to forward all packets
without delay.  Without first testing, how can the ITR know which ETR
address to use?

>>      I think that to be really useful for scaling, your
>>      scheme would need a separate (not BGP-based) mapping
>>      arrangement by which the ITR functions could quickly and
>>      reliably decide which ASN to forward the packets to.  I think
>>      this requires some separate mapping system.
>
> Correct.

OK.

>> In the mapping systems of core-edge separation schemes, a subset of
>> the IP address range returns a positive response from the mapping
>> system.  These are what I call the "Scalable PI" (SPI) addresses -
>> that subset of the address range which is used by end-user networks
>> for scalably getting address space which is portable and gives them
>> multihoming and inbound TE.  (This may also be thought of as the
>> "edge" subset, or the "identifier" subset of the whole IP address
>> range.  This subset of the address space would be many separate
>> prefixes scattered through the range.
>
> Sure, although in my scheme, there is no reason that subset could
> not become the entire IP space.

OK - so then it is not a core-edge separation scheme.  Then, there's
no need to advertise any prefixes in the DFZ, except perhaps just one
prefix from each BR of an ASN, so the DFZ routers can determine paths
to ASNs.  This "skeleton set" of prefixes is the bare minimum which
needs to be advertised in order that the DFZ routers can know paths
to at least one BR of each ASN.

Then, before the packets get to the DFZ, or at the first DFZ router,
there is a mapping lookup and the packet's header is modified to
contain an ASN.  The rest of the DFZ routers forward it towards any
BR of that ASN.

>> The remainder are "core" (AKA "locator" or "RLOC" in LISP), since
>> these core addresses continue to function as they do today, with all
>> routers, including DFZ routers, using these addresses to forward
>> packets all the way to their destination.  Core addresses can still
>> be used for end-user network PI space, but this is not scalable.
>
> In my scheme, the core locator space is a completely separate namespace,
> so, duplicates (of sort)  are possible. (e.g. AS1734 while numerically
> comparable to 0::6c6, there is not really any overlap between them).

OK - so your scheme is ultimately (once fully deployed) a core-edge
elimination scheme:  There is a whole new namespace for identifying
where packets need to be forwarded to in the DFZ - AKA the "locator"
address.  This namespace is ASNs.

Whereas other core-edge elimination schemes do this work at the host,
and all routers then operate on the "locator" address, yours has the
hosts behaving as they do now, sending and receiving packets with IP
addresses ("edge" addresses).  Some or all of the edge networks'
routers also handle packets with IP addresses.  Only when the packets
traverse the core do you switch to the new core namespace for
determining forwarding.

The other core-edge elimination schemes do their work in the hosts
for at least two reasons:

  1 - To avoid new or changed functionality in the network - though
      they still need some kind of mapping system to securely find
      out the current locator address for a host with a given edge
      ("identifier") address.

  2 - To enable the hosts to respond quickly when one or the other
      or both change their physical location - likewise without
      having to alter the operation of any routers in the network
      itself.

On this basis, I think your scheme, once fully deployed, would be
network-based core-edge elimination scheme, while all the others I
know of are host-based.  (However there has been a flurry of
proposals to the RRG in recent days and I haven't read them all yet.)

>> The usual arrangement for a core-edge separation scheme is to
>> encapsulate the packet and tunnel it to an ETR.  ETRs are always on
>> "core" addresses.
>
> Sure.  This is a little bit different in implementation, but, not completely
> different in concept.

OK.

>> The total packet, now longer due to the one or more headers used to
>> encapsulate it, is handled in the same way by DFZ routers as any
>> other packet, so the DFZ routers don't need any upgraded functionality.
>
> And the fact that the packet is longer is an additional reason to include
> the encapsulating router ID (so PMTU-D responses can be sent back
> correctly and translated by the encapsulating router back to the origin)

This is how LISP and APT work.  Ivip uses the sending host's IP
address in the outer header, which enables ETRs to easily enforce ISP
BR filtering which rejects incoming packets with source addresses
matching a prefix of the ISP's network.  The ETR does this by
dropping packets whose inner address is different from the outer
address.  If LISP and APT are to enforce this existing filtering,
they need to do it directly on the inner address.  Such filtering,
for more than a handful of prefixes, is extremely expensive - and
routers need to use TCAM to do it acceptably quickly.

Ivip has a different approach to PMTUD management than to require the
ITR to recognise and correctly handle Packet Too Big messages.
Requiring ITRs to recognise PTBs is very tricky - since to do it
securely requires a nonce in the traffic packet's LISP/APT header
(which must be cached in the ITR for a few seconds) and also relies
on the router to send back enough of the too-long packet to include
this nonce.  I think most routers do this, but RFC 1191 only requires
it sends back the IP header and the next one.  In LISP and I think
APT, that is the UDP header.

>> With your suggested scheme, there is no encapsulation and the packet
>> is not made any longer.  The DFZ routers forward the packet to the
>> "nearest" BR of the currently mapped ASN for the packet's destination
>> prefix.  So there's no encapsulation overhead.  If the packet is too
>> long for one of these DFZ routers, it will presumably convert the
>> packet back to its original form and send a conventional ICMP Packet
>> Too Big message, so the sending host will get this message and retry
>> with a shorter packet.  So there are no thorny PMTUD problems, as
>> there are with encapsulation.
>
> There is additional data in the packet (the ASN), but, it's less encapsulation
> overhead than the other schemes.  The rest is correct, mostly, with
> the exception that PMTUD does get slightly thorny.

My two Modified Header Forwarding proposals, when used with Ivip, do
not lengthen the packet at all.  So at the router where the next hop
MTU is not big enough, some CPU firmware will need to turn the packet
back into its original form, like an ETR, and then perform the usual
PTB operation on it.  The result goes back to the sending host (NOT
the ITR) and the PTB contains the correct length to which the sending
host should keep its packets.  So RFC 1191 PMTUD works fine without
any ITR involvement.

Part of the PMTUD problem with encapsulation is that the PTB
generated by the router, even if it went to the sending host, has the
wrong number in it.  The sending host needs to get a PTB with that
number minus the length of the encapsulation headers.  Only the ITR
can generate this, but in Ivip, I have another method which doesn't
rely on ITRs getting PTBs.  The ITR probes the PMTU all the way to
the ETR when it gets a packet longer than it had previously sent.  It
is complex, but I think it can be done.  This doesn't need to be done
with Modified Header Forwarding.

>> With what you are suggesting, the DFZ routers need to be upgraded to
>> recognise the new packet format.  This raises major problems for
>> widespread voluntary adoption - but I think it is well worth
>> exploring.  In the long term, it will cost little to progressively
>> upgrade all DFZ routers with the new function.  So in the long-term,
>> a core-edge separation scheme could be migrated to this modified
>> header arrangement without significant cost.   Its just a matter of
>> making sure all new routers perform the function.  When all DFZ
>> routers are of this type, the system can be turned on.
>
> Correct, the DFZ routers would eventually all need new software to
> accommodate this, but, the existing hardware should suffice.  

OK - I am glad to find support for this.

> The nice
> thing is that by having it as a negotiated feature in BGP, like 32-bit
> ASNs, it could be adopted in islands without requiring widespread
> adoption day 1.  That seems to me like it might overcome some of
> the voluntary adoption hurdles.

Yes.  I don't think my approaches are amenable to this "island" model
of adoption, but I will keep it in mind.

>> I think that using modified headers only makes sense if the core-edge
>> separation scheme does not need to send any extra information along
>> with the traffic packet.  LISP does send extra info with each traffic
>> packet - in a LISP header which itself is behind a UDP header.  Ivip
>> does not send any extra information and uses simple IP-in-IP
>> encapsulation.  (I am not sure whether APT or TRRP send extra data.)
>
> This would send extra data, but, less extra data than LISP.  I don't know
> for sure about APT or TRRP, either, but, I believe they send at least an
> extra 256 bits.

OK.

>> If the core-edge separation scheme needs to send extra data with each
>> traffic packet, then most of the benefits of modifying the header are
>> lost, since there would still need to be some additional header,
>> which lengthens the total packet and causes a lot of trouble with
>> Path MTU Discovery.  Avoiding the encapsulation overhead and PMTUD
>> complexities are extremely worthwhile goals.
>
> Sure, but, I think they're inevitable.  

Ivip is the only proposal so far which uses pure IP-in-IP
encapsulation (no extra data) or pure Modified Header Forwarding (no
extra packet length at all).  If you have time to critique Ivip and
let me know privately, or on the RRG list, what you think, I would
very much appreciate it.

> However, I will say that if you extend
> the header only by 4 octets (ASN) + 16 octets (IPv6 address of encapsulating
> router), that 20 octet add is less likely to trip PMTU issues than many of the
> 36+ octet adds in things like LISP, etc.  I also think that having the PMTU
> issue reduced to "send the Packet Too Big to the encapsulating router with
> original source+dest in payload" which would then result in the encapsulating
> router taking the suggested MTU-20 and forwarding that back to the origin
> with an ICMP message that looked just like one that would have come
> from the reporting router if there hadn't been an encapsulation (other than
> reducing the recommended MTU by 20) is pretty straight forward.

You could do this, but as I mentioned above, doing it securely is
very expensive.

>> Sorry this was rather long, but I wanted to say that I think modified
>> header forwarding is worth exploring - and I tried to imagine what
>> your scheme would involve.
>
> Hopefully I've clarified enough for your imagination to take it further.

Indeed you have.

>> Since you are prepared to consider upgrading all DFZ routers and
>> modifying the IP header, here are my two proposals which do this.
>
> Sort of... I think it needs to be possible to implement in islands rather
> than requiring a flag day.

I haven't figured out a way of doing it yet - but there would be no
need to upgrade DFZ routers if the only ISP networks behind them did
not have any SPI-using end-user networks.

> I'll try to look at your suggestions a little later today.  I need to get moving
> towards work right now.
> 
> Owen

OK - Thanks!

  - Robin