Well I understand that people are probably very upset with us by now, considering the length of the most recent down time, and I completely understand why. Allow me to explain the events of the past few days though so that you can understand what happened and what measures were put in place to prevent it from happening:
On Wednesday May 5th we noticed that several of our Seagate hard drives were failing in some of our development and testing servers. After further investigation, several client servers also had hard drives that were about to fail. Once we laid out the statistics, we became aware of the fact that all of the drives which were failing were of the same Seagate model. We used these drives in over 100 of our servers. The drives were not in critical stage yet, so we scheduled a maintenance window for Thursday May 6th and alerted all clients giving them 24 hour notice. We then proceeded to order an entire replacement of hard drives from Western Digital.
When we arrived at the data center Thursday afternoon and started to transfer data to new drives, every single one of the drives failed in the process and became completely corrupted. Luckily, because of the way we have setup this system, all of the actual data and operating system configurations are stored on NAS units which have partitioned sections for each server. The drive in the server simply serves as a connection between the internet and the data.
The only problem was that the partitions on the NAS are not labeled because they are identified with an encrypted key and not a name, so in order to get systems back online we had to pick a random partition, mount it to a server, and boot it up to see where it belonged (the process also included loading the software we have to load on the actual server drives to get the setup working correctly).
So basically from Thursday evening to this morning (Saturday), all of our technicians did nothing except work non-stop mounting partitions to servers to get clients back online. One by one, clients came back online as they were connected to the correct server. To help get everything up ASAP, we also had to contract out with a 3rd party server administration team to work through the night and most of the day Friday (we use them from time to time and they are fully confident and trusted). This cost us over $3,000.
Our sincere apologies for the inconveniences we caused, we know they were harmful and annoying, and we truly are sorry. I do want to make the point though that this issue had nothing to do with us, and was all the fault of Seagate sending us over 100 bad hard drives (the model was apparently defective).
We will do everything in our power to make things easier for ProjectAvalon, including improving the uptime. We had everything perfect, we really did, until this had to happen.
So please, bear with us here, this could have happened to anyone and it was simply TPTB that caused it to happen to us.
Thank you for taking the time to read this post, and please enjoy your weekend
--
Karl