[arin-tech-discuss] FW: [arin-ppml] Just so it is recorded here (DNSSEC.. ) outages today..

Wed Mar 9 15:50:14 EST 2016

Hi Chris

Answers in-line:

On 3/9/16, 1:20 PM, "arin-tech-discuss-bounces at arin.net on behalf of Nate
Davis" <arin-tech-discuss-bounces at arin.net on behalf of ndavis at arin.net>
wrote:

>On 3/9/16, 11:34 AM, "Christopher Morrow" <christopher.morrow at gmail.com>
>wrote:
>
>>Thanks!
>>(I have a few questions, which may not be answerable here, I suppose..
>>if they can be answered that'd be cool though)
>>
>>On Tue, Mar 8, 2016 at 12:59 PM, Nate Davis <ndavis at arin.net> wrote:
>>>
>>> ARIN's DNS process moves DNS data from the internal database to a
>>>Secure64
>>> DNSSEC appliance to a hidden distribution master. From the hidden
>>> distribution
>>> master, zones are fetched to name server constellations from ARIN,
>>> VeriSign, and PCH.
>>>
>>> About two weeks ago a script was run that reset the serial on a zone in
>>> the database. This script was run to accommodate an inter-RIR network
>>
>>This script sounds like something that should/would happen
>>periodically? (whenever there's an xfer I guess?) is that correct?

Not even that frequently. It only needs to be run when we initially set up
a /8 for out-of-region transfers. This marks the /8 in our system so that
we can start doing things like retrieving, validating, and aggregating the
RIR snippets to put into our published zone file, and eventually do the
right things to work with RPKI and so on. Of course, after this weekend¹s
deploy, we will no longer need to run this script as the system will
automatically detect this and mark the zone.

>>> This incident exposed a gap in our monitoring that we are fixing. Our
>>
>>is/was the gap: "Make sure serial is monotonically increasing"
>>or is/was it: "If you are going to backup the serial, be sure to force
>>a reload on all masters via process X"
>>
>>(ie: If I make a serial change, what other things should I look for?
>>what monitoring gap do I also have?)

No, it was the soa checking went from the distribution master out to the
anycast cloud. We have had incidents in the past where various nodes where
not fetching the latest zone within a reasonable interval. So, we added
checks that would make sure the soa would update within a "reasonable
interval". If the node did not update within a reasonable interval,
on-call people got notified to escalate.

Unfortunately, we did not do the same monitoring going on internally
within our provisioning flow.  We did not monitor appropriately for our
internal nodes. That has now been fixed.

>>For dnssec I suppose you'd be doing the above but pulling rrsig for
>>the SOA and making sure they are all the same.

What we want to do is to catch it before the sig expires. Do you have any
ideas?

Thanks,
Mark