[arin-tech-discuss] Preparation guide for RPKI 'surprise' outages (Was: Notice of upcoming maintenance to ARIN’s RPKI infrastructure)
Job Snijders
job at fastly.com
Thu Jun 3 12:03:23 EDT 2021
Dear all,
ARIN announced an upcoming 'surprise' maintenance in July 2021. Full
details have not yet been disclosed - to make it a real surprise! :-)
I think this RPKI experiment is useful, as it can help ARIN better
understand its role and responsibilities in the ecosystem, which will
help making more informed decisions.
I'd like to share some notes how to assess your operational model (aka
'risk') and how to prepare. Preparing for this event will also help
with unannounced surprise maintenances. The below checklist probably is
good to confirm every few months in most operations.
A) [ ] Have RPKI ROAs been created for my IP prefixes?
Check whether anyone in your organization created RPKI ROAs in ARIN's
online portal. You can check this either by logging into the portal,
or checking an external tool such as http://irrexplorer.nlnog.net/
(check for resources where the RIR column shows ARIN, and the RPKI
column is non-empty).
-- > If your prefixes are not covered by RPKI ROAs, you can stop
reading, the upcoming maintenance will not affect your routes. <--
B) [ ] Are my validators up to date?
Ask the engineering team whether the latest recommended version of
the choosen validator has been qualified, tested, and burned-in.
As we don't know /what/ exactly ARIN will change, if you'll want to
use a validator that is known to have a strong cryptographic posture
derived from well-regarded industry-standard crypto libraries.
The RIPE NCC Validator is not supported beyond July 1st, 2021, so any
software defects uncovered by the ARIN experiment will not be fixed
by RIPE NCC.
I personally recommend using the latest version of NIC.MX's FORT, or
OpenBSD's rpki-client.
C) [ ] Is my organization monitoring my RPKI ROAs?
In order for us to be in a position to even complain about a RPKI
service outage, the RPKI needs to be monitored of course! :-)
NTT's BGPalerter can be used to monitor both BGP routes _and_ RPKI
ROAs. This free tool can alert you when RPKI ROAs unexpected
disappear, or appear, and also alert about BGP route visibility.
It'll depend on the exact type of failure mode the surprise
maintenance will trigger, what alerts one can get out of the tool.
https://github.com/nttgin/BGPalerter
D) [ ] Have I correctly configured my BGP Routing Policies?
It is of paramount importance that operators only use Validated RPKI
ROA data to reject RPKI invalid BGP routes. A common mistake is to
configure your EBGP routers to associate a BGP Community (or other
BGP Path Attribute) with a route dependent on the RPKI validation
state.
The problem with associating BGP Communities with the Validation
State, is that any change in the Validation State will trigger BGP
MESSAGES to be send in all kinds of directions with a new
('not-found') BGP Community associated. Use of RFC 8097 Communities
is also not recommended for the same reasons.
Many operators attach BGP Communities based on RPKI State to 'see how
many routes would be affected', but this increased observability is
also the cause itself to routing instability.
The correct and robust way to configure RPKI ROV in routing policy is
outlined on this guide (for multiple vendors):
https://bgpfilterguide.nlnog.net/guides/reject_invalids/
--> Policies that mark 'valid' or 'not-found' BGP routes with a
BGP Community, will see/trigger BGP routing churn for
~ 35,000 BGP routes in the Default-Free Zone. <--
If you are unsure what your routing policy does, feel free to send me
a copy and I'll read it to confirm with you. Your RPKI-related
routing policies really should be a simple and short as the guide
outlines.
E) [ ] Are any of my customer onboarding processes dependent on RPKI?
Some cloud providers have embraced the practise of requiring their
Bring-Your-Own-IP (BYOIP) customers to issue RPKI Route Origin
Authorizations with the cloud provider's ASN.
Sometimes this is a one-off check. If such a one-off check happens
during ARIN's surprise maintenance, the check might fail, and thus
needs to be repeated at a later moment to progress the onboarding
process.
----
I look forward to more data and information as it becomes available. I
appreciate ARIN initiating the coordination of some lightweight RPKI
fire drills. :-)
I'm available for questions on-list and off-list.
Kind regards,
Job
More information about the arin-tech-discuss
mailing list