Another Patch Night, another Long “it broke” night

It seems every couple weeks when I schedule some downtime to do updates and go over things, almost 50% of the time I get stuck on something where a patch or update breaks something, and often in some crazy way. With virtualization, there’s an easy fix: snapshots. But what happens when you test something, looks good, and delete the snapshots? Oops.

In this case it’s a tale of two VOIP SBCs (session border controllers). One is a brand agnostic and directs things for Teams direct routing, the other is linked to an on-prem PBX (still) and is basically redundant. Normally they happily talk; the PBX can make outbound calls through both SBCs, and Teams users can call the PBX going the other direction.

So after getting both SBCs up to the latest versions (which is usually fairly safe), I did an outbound test call from the PBX, all was well, perfect audio, so I figured “ok drive is a little low on space, I better delete the snapshots lest I forget later”. Yeah…I came to regret that decision later when I noticed a little error on one of the SBCs. Turns out, while outbound calls worked fine, calls (or anything) coming inbound didn’t work at all, even options packets weren’t going through. Rats!

So I spent hours carefully fiddling with settings, knowing full well this exact setup was perfect before, so it couldn’t be something “drastically” different. I was perplexed, and after 2am, had to give up for the night. The next day with a fresh cup of coffee I was back at it trying to find the answer. The weird thing is, SBC2 was getting the options packets (seen via packet capture), but not replying to them? Weird! I thought even the sip trunk had to be re-created, and tried a bunch of other things, all to no avail.

Thankfully with virtualization, I had a backup of SBC2 from a few months ago. I could shut down the current one, restore this copy, and boot that up and at least eliminate it from the mix. Restoration worked well, however the problem persisted, which means SBC1 was the culprit.

And then the AH HAH moment. In checking the packet capture again, I see SBC2 actually replying to the OPTIONS packet (with an error) from SBC1, saying it “couldn’t resolve the VIA header”. Sure enough, I did not notice that SBC1 was using a hostname in the VIA header! Since things are natted, DNS will not work!

Eventually I found a feature under the NIC settings of SBC1 that may be new – “use reverse DNS”. That caused SBC1 to reverse DNS itself and use it’s own hostname instead of IP. In turning that off, it only sent the IP, which made the options packets and calling work again. WHEW.

HOURS of work due to 2 tiny, subtle changes: one SBC didn’t error on a bad options packet (so no idea what’s wrong), and the other was a tiny network change hidden in an advanced menu.

Moral of the story: DON’T DELETE THE DAMN SNAPSHOTS UNTIL A DAY LATER!