Outage Report, Swan – 6/12/2014
On Thursday, June 12, 2014 at approximately 1:30 pm, clients located on the Swan virtual server became unresponsive. After a brief investigation, the cause appeared to be related to failing hardware of the host server (identified as TX113). When this issue was discovered, technicians at the TX data center began to move the virtual server off of the failing host server and on to a new host server (identified as TX106). Services became fully operational at approximate 2:20 pm.
At approximately 6:00 pm, the server appeared to go down a second time. After running a quick test, it was determined that traffic to Swan was being routed back to the old server (TX113) and not to the new server hosting the sites (TX106). The issue was believed to be fixed at the TX data center at approximately 6:20 pm, with the problem being an incomplete migration of the virtual server.
Unfortunately the same problem resurfaced again at approximately 7:25 pm – roughly an hour after the server came back online. The issue was investigated by a senior level technician and the issue was properly identified as being a routing problem with switching gear at the data center. The problem was corrected at approximately 8:05 pm, and the senior tech at the TX data center spent additional time to verify that the programming would hold. At 8:30 pm, the senior tech cleared the incident and confirmed that the routing programming should not revert back to the original server.
We sincerely apologize for this outage. While realistically we understand that problems will arise from time to time, we make every effort to ensure downtime is minimal and every step to prevent downtime is taken. When the outage did occur and the issue was identified, the virtual server was moved to new hardware and, aside from a programming issue with the switching gear, the issue was resolved. The switching gear issue caused additional downtime, and we feel that was preventable. We will work with the TX data center to try to ensure steps are taken to ensure its chances of reoccurring is minimized.
Leave a Reply