Recent Downtime

**hostingtechs** · 8th May 2010 19:05

Well I understand that people are probably very upset with us by now, considering the length of the most recent down time, and I completely understand why. Allow me to explain the events of the past few days though so that you can understand what happened and what measures were put in place to prevent it from happening:

On Wednesday May 5th we noticed that several of our Seagate hard drives were failing in some of our development and testing servers. After further investigation, several client servers also had hard drives that were about to fail. Once we laid out the statistics, we became aware of the fact that all of the drives which were failing were of the same Seagate model. We used these drives in over 100 of our servers. The drives were not in critical stage yet, so we scheduled a maintenance window for Thursday May 6th and alerted all clients giving them 24 hour notice. We then proceeded to order an entire replacement of hard drives from Western Digital.

When we arrived at the data center Thursday afternoon and started to transfer data to new drives, every single one of the drives failed in the process and became completely corrupted. Luckily, because of the way we have setup this system, all of the actual data and operating system configurations are stored on NAS units which have partitioned sections for each server. The drive in the server simply serves as a connection between the internet and the data.

The only problem was that the partitions on the NAS are not labeled because they are identified with an encrypted key and not a name, so in order to get systems back online we had to pick a random partition, mount it to a server, and boot it up to see where it belonged (the process also included loading the software we have to load on the actual server drives to get the setup working correctly).

So basically from Thursday evening to this morning (Saturday), all of our technicians did nothing except work non-stop mounting partitions to servers to get clients back online. One by one, clients came back online as they were connected to the correct server. To help get everything up ASAP, we also had to contract out with a 3rd party server administration team to work through the night and most of the day Friday (we use them from time to time and they are fully confident and trusted). This cost us over $3,000.

Our sincere apologies for the inconveniences we caused, we know they were harmful and annoying, and we truly are sorry. I do want to make the point though that this issue had nothing to do with us, and was all the fault of Seagate sending us over 100 bad hard drives (the model was apparently defective).

We will do everything in our power to make things easier for ProjectAvalon, including improving the uptime. We had everything perfect, we really did, until this had to happen.

So please, bear with us here, this could have happened to anyone and it was simply TPTB that caused it to happen to us.

Thank you for taking the time to read this post, and please enjoy your weekend

--
Karl

**Swami** · 8th May 2010 19:18

A big kiss from the Swami for all the hard work you put into AV2............

**haibane** · 8th May 2010 19:22

Hi Karl and all the hosting technicians. I originally intended to write something, erm, sarcastic, but then decided not to as I don't know **** about what's going on, and there's no point anyway and wouldn't help to improve anything, so let me just say this: Thanks for bringing PA back, and good luck in teh future (^__^ )

**Wood** · 8th May 2010 19:50

This sounds entirely reasonable to me:

Posted by hostingtechs (here)
On Wednesday May 5th we noticed that several of our Seagate hard drives were failing in some of our development and testing servers. After further investigation, several client servers also had hard drives that were about to fail. Once we laid out the statistics, we became aware of the fact that all of the drives which were failing were of the same Seagate model. We used these drives in over 100 of our servers. The drives were not in critical stage yet, so we scheduled a maintenance window for Thursday May 6th and alerted all clients giving them 24 hour notice. We then proceeded to order an entire replacement of hard drives from Western Digital.

However, this is very weird:

Posted by hostingtechs (here)
When we arrived at the data center Thursday afternoon and started to transfer data to new drives, every single one of the drives failed in the process and became completely corrupted.

Posted by hostingtechs (here)
The only problem was that the partitions on the NAS are not labeled because they are identified with an encrypted key and not a name, so in order to get systems back online we had to pick a random partition, mount it to a server, and boot it up to see where it belonged (the process also included loading the software we have to load on the actual server drives to get the setup working correctly).

Given that, what is your procedure to replace a single server that has failed then? To check all the NAS partitions? To find an unmounted one? I believe you should have been aware for a long time of the potential problems with your procedure (that is, that in case of many servers going down you would have trouble identifying the partitions). Apparently the solution is to keep a list of which partition belongs to which server. Such a list can be hand-made or created with scripts. I do not understand why you did not have it ready.

Posted by hostingtechs (here)
So please, bear with us here, this could have happened to anyone and it was simply TPTB that caused it to happen to us.

I do not see other popular forums going down so often, not at all. It also looks as if both PA and PC have been fixed around the same time, towards the end of the time frame you have mentioned. So it seems you have brought up all your other servers first.

**hostingtechs** · 8th May 2010 20:20

Posted by Wood (here)

However, this is very weird:

Yes, getting shipped 100 bad drives is very weird. I am wondering if UPS dropped the boxes or something in transit because I can't find any complaints about the model.

Posted by Wood (here)
Given that, what is your procedure to replace a single server that has failed then? To check all the NAS partitions? To find an unmounted one? I believe you should have been aware for a long time of the potential problems with your procedure (that is, that in case of many servers going down you would have trouble identifying the partitions). Apparently the solution is to keep a list of which partition belongs to which server. Such a list can be hand-made or created with scripts. I do not understand why you did not have it ready.

If a single server fails, then one partition becomes unmounted and we go and re-mount it as it will be the only one un-mounted. There is no way of keeping track of them because the key changes. We use a modified version of Citrix's XenServer to accomplish this setup, and it is the way that Xen is set up that causes this.

Posted by Wood (here)
I do not see other popular forums going down so often, not at all. It also looks as if both PA and PC have been fixed around the same time, towards the end of the time frame you have mentioned. So it seems you have brought up all your other servers first.

What happened this week could have happened to any site. In regards to past downtime, we've resolved all of those issues.

And let me correct you. ProjectCamelot was brought up about 10 hours before ProjectAvalon, I don't call that "fixed around the same time." The order that clients were brought up in was random, and we had many clients left to bring up after both PA and PC were restored, so again I need to correct you there. We did not bring up all of our other servers before we brought up PA and PC.

Posted by haibane (here)
Hi Karl and all the hosting technicians. I originally intended to write something, erm, sarcastic, but then decided not to as I don't know **** about what's going on, and there's no point anyway and wouldn't help to improve anything, so let me just say this: Thanks for bringing PA back, and good luck in teh future (^__^ )

Thank you, I appreciate that. And yes, arguing gets us no where and wastes my time that I could be using to increase the performance of the PA servers

--
Karl

**Wood** · 8th May 2010 20:50

Posted by hostingtechs (here)
Yes, getting shipped 100 bad drives is very weird. I am wondering if UPS dropped the boxes or something in transit because I can't find any complaints about the model.

The weird part, to me, is that all of them failed at the exact same time. It might be related to the stress of the backup, but I still think it is weird. Even if they were dropped or something I think not all of them should fail at the same time since they had not had the same amount of use (different machines) and it is unlikely they had the exact amount of damage. I understand they could fail at about the same time (same month) but not in the same hour.

Regarding the partitions, I am not a sys admin and I believe your explanation. I think it is a bad design though (not yours but of the software you use), and the issue with multiple servers failing at the same time is still there. If the keys change why not run a script on every machine after each key change (or periodically) to store in a db what NAS partition belongs to it (or the key or a hash of the key)? It should be a few hours task (at most) and it would have saved you a lot of money an ill will this time and maybe in the future. And I still find it hard to believe there is no other way to identify a partition. After all it is stored in one or more hard disks so the physical partition id should not change.

Posted by hostingtechs (here)
And let me correct you. ProjectCamelot was brought up about 10 hours before ProjectAvalon, I don't call that "fixed around the same time." The order that clients were brought up in was random, and we had many clients left to bring up after both PA and PC were restored, so again I need to correct you there. We did not bring up all of our other servers before we brought up PA and PC.

Ok, I do not check PC as often as PA. I was under the impression they were up at about the same time but I guess I was wrong and let my previous remarks influence me.

Thanks for your work and lets hope this is the last failure in several months. The uptime since the new forum was launched has not been impressive.

**hostingtechs** · 8th May 2010 21:01

Posted by Wood (here)
The weird part, to me, is that all of them failed at the exact same time. It might be related to the stress of the backup, but I still think it is weird.

Yes, very weird to me too, but who knows. And yes, that is something we thought of, but our NAS units back up at different times and the servers use different NAS units so we don't think this is the case.

Posted by Wood (here)
Regarding the partitions, I am not a sys admin and I believe your explanation. I think it is a bad design though (not yours but of the software you use), and the issue with multiple servers failing at the same time is still there. If the keys change why not run a script on every machine after each key change (or periodically) to store in a db what NAS partition belongs to it (or the key or a hash of the key)? It should be a few hours task (at most) and it would have saved you a lot of money an ill will this time and maybe in the future. And I still find it hard to believe there is no other way to identify a partition. After all it is stored in one or more hard disks so the physical partition id should not change.

That's a good idea (writing the program to find the changed key), but putting it into action may be difficult since Citrix Xen is not open source. We'll definitely look into it though. The general idea of it is pretty weird, but we wouldn't have picked Citrix Xen to use unless the benefits outweighed the disadvantages which they do.

Posted by Wood (here)
Thanks for your work and lets hope this is the last failure in several months. The uptime since the new forum was launched has not been impressive.

Yes, I understand and we do take responsibility. You can rest assured we are working hard to keep everything up and running well and are looking forward to providing rock solid hosting for you guys.

**norman** · 8th May 2010 21:03

Well, I'm very glad it's all up and running ok again. Thank you (very much) for the hard and stressful effort it must have been.

A bad batch of hard drives is a likely explanation but until I hear that these same model of H/D have done the same thing elsewhere I'll tend to see it as a location specific incident, with all that implies.

Let's hope all is really well.

**hostingtechs** · 8th May 2010 22:22

Posted by norman (here)
Well, I'm very glad it's all up and running ok again. Thank you (very much) for the hard and stressful effort it must have been.

A bad batch of hard drives is a likely explanation but until I hear that these same model of H/D have done the same thing elsewhere I'll tend to see it as a location specific incident, with all that implies.

Let's hope all is really well.

No problem, it's what we do

Thanks for the comments.

**Vidya Moksha** · 9th May 2010 03:01

The forum is incredibly slow at the moment, and some posts dont even make it before they are timed out. Is this related to the recent problems or a separate issue? If its separate issue any ideas what's causing the slowness?

**hostingtechs** · 9th May 2010 03:03

Posted by Vidya Moksha (here)
The forum is incredibly slow at the moment, and some posts dont even make it before they are timed out. Is this related to the recent problems or a separate issue? If its separate issue any ideas what's causing the slowness?

Probably separate issue. Forum is fine for me, can other people please posts their speed thoughts?

**hostingtechs** · 9th May 2010 03:06

If people can post their thoughts on the speed and overall usability that would be great.

Thanks.

**Vidya Moksha** · 9th May 2010 03:15

it took me 2 mins to open this post, i will see how long it takes to post this reply

=========

2.20 seconds to post the reply

and 3 minutes to edit this post.. painfully slow

(will be another 2 mins now, this post alon with updates is over 10 minutes

(my broadband is fine and other sites are ok - just this one - im in NZ)

**xbusymom** · 9th May 2010 05:47

thanks for all your hard work and expertise... I know it can be extremely frustrating dealing with a myraid of details that seem to go wrong all at the same time...

(*I would like to see all the complainers jump into your shoes and do better ... or at least be patient while the people with the know-how sort through all the possibilities)

keep up the great work- I appreciate you and this board!

**Anchor** · 9th May 2010 09:43

The only time I have seen large numbers of disks fail simultaneously like that was power related (bad power surge owing to faulty PSU).

**Swanny** · 9th May 2010 09:44

Rubbish

................

**Sunny d** · 9th May 2010 10:06

slow slower slowest....!

**bluestflame** · 9th May 2010 11:41

thank you for maintaining this site, i'm sure people understand these things happen and there's nothing to be gained from venting thier angst at the providers and maintainers of the site , it's up again now and surly will improve in performance , I know it is an exercise that didn't cost me $3000 dollars , i hope things run smoothly for you mob in future ~☼~ ...

**zebowho** · 9th May 2010 12:37

Its unbearably slow for me...took approx 30 min to get 4 pages to post (main page 1st (plus login), then pg 1 of this thread, 2nd page & "advanced" to post). Its been like this since your outage and I've often given up on logging in. Thanks for the hard work and updates (LoL now that I've been able to read one!!)

By the way my inet connection is good, bandwidth and throughput high and everything else pops quick with pa2 being the exception as well as my system is clean...using two malware scanners and anti-virus!!
-z

P.S. Hope this posts...:D

**blue777** · 9th May 2010 12:48

I would not say this was slow...but I wrote this yesterday.......

Thread: Recent Downtime

Thread Tools

Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Re: Recent Downtime

Similar Threads

most recent project camelot teleconference?

Wondering what those odd sphere-like features are in recent STEREO EUVI images?

recent earthquakes

Bookmarks

Bookmarks

Posting Permissions