[arin-tech-discuss] Issue for Delegated Users within ARIN's RPKI Repository - Outage Report - Corrected dates

Mark Kosters markk at arin.net
Tue Nov 24 09:32:30 EST 2020


Ah - now I see the problem,  here is the outage report with corrected dates..

Summary

On Nov 20 at 2:30PM EST (UTC-5), ARIN updated the software that generates the RPKI repository.   On Nov 21 at 9:48PM EST (UTC-5), we were notified by a 3rd party that validators no longer were fetching ROAs from organizations that had selected the delegated option.  Upon review, ARIN Engineering discovered that a certificate was not included in the manifest for each delegated organization. The fix was to include that certificate in the manifest for each delegated organization was deployed at 1:20AM EST (UTC-5) on Nov 22.  At that time, ROAs from the affected delegated repositories could then again be fetched and validated.

ARIN's hosted RPKI customers were not affected by this outage in any way. 

Root Cause

The root cause of this failure was a software bug that was introduced by the RPKI repository generator. 

Scope of Issue

This bug meant that validators would not fetch information from the delegated repositories during the affected period.  ARIN has nine delegated organizations and affected approximately 180 ROAs that may have disappeared from the global RPKI system for approximately 35 hours and 40 minutes starting on Nov 20 at 2:30PM EST (UTC-5). Depending on how validation is setup by the ISPs who use RPKI, the route origins associated with these 180 ROA’s may have remained in the secure state or became unsecure during this period.

After Action Items

ARIN will add additional delegated repository tests to prevent this type of operational issue to happen again. Additionally, as planned, ARIN will be adding additional improvements to its external monitoring that uses various validators to ensure that the repository is working as intended.

Regards,
Mark



On 11/24/20, 8:57 AM, "Mark Kosters" <markk at arin.net> wrote:

    
    
    On 11/24/20, 6:30 AM, "Job Snijders" <job at ntt.net> wrote:
    
        Dear Mark,
        
        On Mon, Nov 23, 2020 at 09:32:53PM +0000, Mark Kosters wrote:
        > On Nov 19 at 2:30PM EST (UTC-5), ARIN updated the software that generates the RPKI repository.
        > On Nov 20 at 9:48PM EST (UTC-5), we were notified by a 3rd party that validators no longer were fetching ROAs from organizations that had selected the delegated option.
        
        Can you elaborate on why it appears there was a delay between the
        software update having taken place, and the problem becoming visible?
        
        From my measurements the problem became visible at 19:22 UTC on November
        20nd. The RPKI stack from an end-to-end perspective is an interesting
        waterfall of timers, the above question is for my own edification on how
        this all works.
    
    If I got my timing right, looks like you must have received the updated repository as we were pushing out the software updates.
        
        > Upon review, ARIN Engineering discovered that a certificate was not included in the manifest for each delegated organization.
        > The fix was to include that certificate in the manifest for each delegated organization was deployed at 1:20AM EST (UTC-5) on Nov 21.
        
        A fix was deployed on November ***22nd***, right?
    
    Good catch.
        
    Thanks,
    Mark 
    
    



More information about the arin-tech-discuss mailing list