[aprssig] first.aprs.net

Wed Oct 4 21:26:29 EDT 2006

> In fact, my experience is that I've had more failures that affected  
> end-users with multiple site systems than with non-redundant single- 
> site systems.  Things failing over when they shouldn't, not failing  
> over when they should, etc.

And that's exactly what I mean by requiring a lot of effort.  You really
have to have the design and test procedures right to be sure it's going to
work.  Too easy to miss some subtle DNS problem or failover logic error.

When I left my last job, I had to hand over a high-reliability clustered
system I'd maintained for years, originally migrated from VMS.  There was no
way to get someone else up to speed on the system fast enough, so it was
restructured to use a single active server booting from a SAN, with an
identical machine as a cold standby.  In case of a server failure, it'd just
be a matter of switching the SAN connections over.  Not as slick as the
original setup, but much simpler to understand and maintain.  I think
they've finally got it back in its proper configuration again.

The moral of the story is that you have to balance the complexity of your
high-reliability system against your tolerance for downtime and available
resources.

Scott
N1VG