Monday, October 19, 2009 8:52 PM
We apologise to our Customers for the recent period of unscheduled downtime.
At 21:30 on the 17th of October 2009, we lost our primary firewall. Our engineers attempted to troubleshoot the issue remotely, before
dispatching an engineer with a spare unit to site just after 10pm.
After a heroic battle with the roadworks on the M4, engineers arrived
on site to find a unresponsive (though seemingly alive from the lights
blinking away) firewall. A reset was attempted, but failed, and the
unit was then removed from the rack and replaced with another unit. Service was resumed at approx 02:00 on the 18th of October.
At approx 09:00 on the 18th of October, the backup firewall also failed. This took us a while to source a replacement unit, and partial service was resumed after a lengthly session in the datacentre at approx 14:00hrs. Full service was restored by 17:00hrs
At approx 07:00 on the 19th of October, we lost connectivity again, however this was due to a fibre break somewhere in Redbus Sovereign House in London, on our up stream's network. They were onto the issue immediately and service was restored by 09:00hrs.
From examination of the two firewall units, our original firewall appears to have suffered a PSU malfunction, which has damaged the main logic board, causing it to appear to power up but not do anything. After being left off for 12 hours, the PSU would no longer turn on, and powering the logic board from a normal ATX PSU failed to bring it back to life. The logic board was a hybrid pc design, incorporating some standard PC components (VIA CPU for example, and standard PC133 memory) with custom security ASIC's and compact flash OS card, which sadly isn't replaceable without just buying a new unit.
The backup firewall appears to have faulty network interfaces, and died with fatal exceptions output to the console port. We were able to wipe it's OS, and reinstall, but the same error re-appeared.
Our post incident analysis came to the conclusion that while having a single firewall was clearly a vulnerability, we could not have predicted that both the primary and backup firewalls would fail within hours of one another. When we initially purchased the primary firewall, high-availability was an expensive option on what was a fairly expensive firewall, so the decision was taken to purchase a new unit, and then have a second hand unit stored back at base. It's possible that storing it for 4 years in a non-temperature controlled environment was a contributory factor to the failure, due to expansion of components as they change temperature.
We will shortly be purchasing a high-availability pair of firewalls, as well as a third identical unit to keep at base.
Colocation/Dedicated Server Customers: If you had a
custom firewall configuration or VPN into your private LAN, then this
will require re-configuring. Unfortunately there appears to be no hard
copy of the rules we had setup prior to this failure, and as the unit
appears to be BER, we have no way of retrieving them. If you need a rule created, please contact us and we will endeavour to set it up as soon as possible.
Regards,
Alex Threlfall
Network Operations Manager
Cyberprog New Media