PDA

View Full Version : Recent Downtime



hostingtechs
8th May 2010, 19:05
Well I understand that people are probably very upset with us by now, considering the length of the most recent down time, and I completely understand why. Allow me to explain the events of the past few days though so that you can understand what happened and what measures were put in place to prevent it from happening:

On Wednesday May 5th we noticed that several of our Seagate hard drives were failing in some of our development and testing servers. After further investigation, several client servers also had hard drives that were about to fail. Once we laid out the statistics, we became aware of the fact that all of the drives which were failing were of the same Seagate model. We used these drives in over 100 of our servers. The drives were not in critical stage yet, so we scheduled a maintenance window for Thursday May 6th and alerted all clients giving them 24 hour notice. We then proceeded to order an entire replacement of hard drives from Western Digital.

When we arrived at the data center Thursday afternoon and started to transfer data to new drives, every single one of the drives failed in the process and became completely corrupted. Luckily, because of the way we have setup this system, all of the actual data and operating system configurations are stored on NAS units which have partitioned sections for each server. The drive in the server simply serves as a connection between the internet and the data.

The only problem was that the partitions on the NAS are not labeled because they are identified with an encrypted key and not a name, so in order to get systems back online we had to pick a random partition, mount it to a server, and boot it up to see where it belonged (the process also included loading the software we have to load on the actual server drives to get the setup working correctly).

So basically from Thursday evening to this morning (Saturday), all of our technicians did nothing except work non-stop mounting partitions to servers to get clients back online. One by one, clients came back online as they were connected to the correct server. To help get everything up ASAP, we also had to contract out with a 3rd party server administration team to work through the night and most of the day Friday (we use them from time to time and they are fully confident and trusted). This cost us over $3,000.

Our sincere apologies for the inconveniences we caused, we know they were harmful and annoying, and we truly are sorry. I do want to make the point though that this issue had nothing to do with us, and was all the fault of Seagate sending us over 100 bad hard drives (the model was apparently defective).

We will do everything in our power to make things easier for ProjectAvalon, including improving the uptime. We had everything perfect, we really did, until this had to happen.

So please, bear with us here, this could have happened to anyone and it was simply TPTB that caused it to happen to us.

Thank you for taking the time to read this post, and please enjoy your weekend :)

--
Karl

Swami
8th May 2010, 19:18
A big kiss from the Swami for all the hard work you put into AV2............

:kiss3::whistle:

haibane
8th May 2010, 19:22
Hi Karl and all the hosting technicians. I originally intended to write something, erm, sarcastic, but then decided not to as I don't know **** about what's going on, and there's no point anyway and wouldn't help to improve anything, so let me just say this: Thanks for bringing PA back, and good luck in teh future (^__^ )

Wood
8th May 2010, 19:50
This sounds entirely reasonable to me:

On Wednesday May 5th we noticed that several of our Seagate hard drives were failing in some of our development and testing servers. After further investigation, several client servers also had hard drives that were about to fail. Once we laid out the statistics, we became aware of the fact that all of the drives which were failing were of the same Seagate model. We used these drives in over 100 of our servers. The drives were not in critical stage yet, so we scheduled a maintenance window for Thursday May 6th and alerted all clients giving them 24 hour notice. We then proceeded to order an entire replacement of hard drives from Western Digital.

However, this is very weird:

When we arrived at the data center Thursday afternoon and started to transfer data to new drives, every single one of the drives failed in the process and became completely corrupted.



The only problem was that the partitions on the NAS are not labeled because they are identified with an encrypted key and not a name, so in order to get systems back online we had to pick a random partition, mount it to a server, and boot it up to see where it belonged (the process also included loading the software we have to load on the actual server drives to get the setup working correctly).
Given that, what is your procedure to replace a single server that has failed then? To check all the NAS partitions? To find an unmounted one? I believe you should have been aware for a long time of the potential problems with your procedure (that is, that in case of many servers going down you would have trouble identifying the partitions). Apparently the solution is to keep a list of which partition belongs to which server. Such a list can be hand-made or created with scripts. I do not understand why you did not have it ready.


So please, bear with us here, this could have happened to anyone and it was simply TPTB that caused it to happen to us.
I do not see other popular forums going down so often, not at all. It also looks as if both PA and PC have been fixed around the same time, towards the end of the time frame you have mentioned. So it seems you have brought up all your other servers first.

hostingtechs
8th May 2010, 20:20
However, this is very weird:
Yes, getting shipped 100 bad drives is very weird. I am wondering if UPS dropped the boxes or something in transit because I can't find any complaints about the model.





Given that, what is your procedure to replace a single server that has failed then? To check all the NAS partitions? To find an unmounted one? I believe you should have been aware for a long time of the potential problems with your procedure (that is, that in case of many servers going down you would have trouble identifying the partitions). Apparently the solution is to keep a list of which partition belongs to which server. Such a list can be hand-made or created with scripts. I do not understand why you did not have it ready.
If a single server fails, then one partition becomes unmounted and we go and re-mount it as it will be the only one un-mounted. There is no way of keeping track of them because the key changes. We use a modified version of Citrix's XenServer to accomplish this setup, and it is the way that Xen is set up that causes this.




I do not see other popular forums going down so often, not at all. It also looks as if both PA and PC have been fixed around the same time, towards the end of the time frame you have mentioned. So it seems you have brought up all your other servers first.
What happened this week could have happened to any site. In regards to past downtime, we've resolved all of those issues.

And let me correct you. ProjectCamelot was brought up about 10 hours before ProjectAvalon, I don't call that "fixed around the same time." The order that clients were brought up in was random, and we had many clients left to bring up after both PA and PC were restored, so again I need to correct you there. We did not bring up all of our other servers before we brought up PA and PC.




Hi Karl and all the hosting technicians. I originally intended to write something, erm, sarcastic, but then decided not to as I don't know **** about what's going on, and there's no point anyway and wouldn't help to improve anything, so let me just say this: Thanks for bringing PA back, and good luck in teh future (^__^ )

Thank you, I appreciate that. And yes, arguing gets us no where and wastes my time that I could be using to increase the performance of the PA servers :)

--
Karl

Wood
8th May 2010, 20:50
Yes, getting shipped 100 bad drives is very weird. I am wondering if UPS dropped the boxes or something in transit because I can't find any complaints about the model.
The weird part, to me, is that all of them failed at the exact same time. It might be related to the stress of the backup, but I still think it is weird. Even if they were dropped or something I think not all of them should fail at the same time since they had not had the same amount of use (different machines) and it is unlikely they had the exact amount of damage. I understand they could fail at about the same time (same month) but not in the same hour.

Regarding the partitions, I am not a sys admin and I believe your explanation. I think it is a bad design though (not yours but of the software you use), and the issue with multiple servers failing at the same time is still there. If the keys change why not run a script on every machine after each key change (or periodically) to store in a db what NAS partition belongs to it (or the key or a hash of the key)? It should be a few hours task (at most) and it would have saved you a lot of money an ill will this time and maybe in the future. And I still find it hard to believe there is no other way to identify a partition. After all it is stored in one or more hard disks so the physical partition id should not change.


And let me correct you. ProjectCamelot was brought up about 10 hours before ProjectAvalon, I don't call that "fixed around the same time." The order that clients were brought up in was random, and we had many clients left to bring up after both PA and PC were restored, so again I need to correct you there. We did not bring up all of our other servers before we brought up PA and PC.
Ok, I do not check PC as often as PA. I was under the impression they were up at about the same time but I guess I was wrong and let my previous remarks influence me.

Thanks for your work and lets hope this is the last failure in several months. The uptime since the new forum was launched has not been impressive.

hostingtechs
8th May 2010, 21:01
The weird part, to me, is that all of them failed at the exact same time. It might be related to the stress of the backup, but I still think it is weird.
Yes, very weird to me too, but who knows. And yes, that is something we thought of, but our NAS units back up at different times and the servers use different NAS units so we don't think this is the case.




Regarding the partitions, I am not a sys admin and I believe your explanation. I think it is a bad design though (not yours but of the software you use), and the issue with multiple servers failing at the same time is still there. If the keys change why not run a script on every machine after each key change (or periodically) to store in a db what NAS partition belongs to it (or the key or a hash of the key)? It should be a few hours task (at most) and it would have saved you a lot of money an ill will this time and maybe in the future. And I still find it hard to believe there is no other way to identify a partition. After all it is stored in one or more hard disks so the physical partition id should not change.
That's a good idea (writing the program to find the changed key), but putting it into action may be difficult since Citrix Xen is not open source. We'll definitely look into it though. The general idea of it is pretty weird, but we wouldn't have picked Citrix Xen to use unless the benefits outweighed the disadvantages which they do.





Thanks for your work and lets hope this is the last failure in several months. The uptime since the new forum was launched has not been impressive.
Yes, I understand and we do take responsibility. You can rest assured we are working hard to keep everything up and running well and are looking forward to providing rock solid hosting for you guys.

norman
8th May 2010, 21:03
Well, I'm very glad it's all up and running ok again. Thank you (very much) for the hard and stressful effort it must have been.

A bad batch of hard drives is a likely explanation but until I hear that these same model of H/D have done the same thing elsewhere I'll tend to see it as a location specific incident, with all that implies.

Let's hope all is really well.

hostingtechs
8th May 2010, 22:22
Well, I'm very glad it's all up and running ok again. Thank you (very much) for the hard and stressful effort it must have been.

A bad batch of hard drives is a likely explanation but until I hear that these same model of H/D have done the same thing elsewhere I'll tend to see it as a location specific incident, with all that implies.

Let's hope all is really well.

No problem, it's what we do :)

Thanks for the comments.

Vidya Moksha
9th May 2010, 03:01
The forum is incredibly slow at the moment, and some posts dont even make it before they are timed out. Is this related to the recent problems or a separate issue? If its separate issue any ideas what's causing the slowness?

hostingtechs
9th May 2010, 03:03
The forum is incredibly slow at the moment, and some posts dont even make it before they are timed out. Is this related to the recent problems or a separate issue? If its separate issue any ideas what's causing the slowness?

Probably separate issue. Forum is fine for me, can other people please posts their speed thoughts?

hostingtechs
9th May 2010, 03:06
If people can post their thoughts on the speed and overall usability that would be great.

Thanks.

Vidya Moksha
9th May 2010, 03:15
it took me 2 mins to open this post, i will see how long it takes to post this reply

=========

2.20 seconds to post the reply

and 3 minutes to edit this post.. painfully slow :(

(will be another 2 mins now, this post alon with updates is over 10 minutes

(my broadband is fine and other sites are ok - just this one - im in NZ)

xbusymom
9th May 2010, 05:47
thanks for all your hard work and expertise... I know it can be extremely frustrating dealing with a myraid of details that seem to go wrong all at the same time...

(*I would like to see all the complainers jump into your shoes and do better ... or at least be patient while the people with the know-how sort through all the possibilities)

keep up the great work- I appreciate you and this board!

Anchor
9th May 2010, 09:43
The only time I have seen large numbers of disks fail simultaneously like that was power related (bad power surge owing to faulty PSU).

Swanny
9th May 2010, 09:44
Rubbish

................

Sunny d
9th May 2010, 10:06
slow slower slowest....!

bluestflame
9th May 2010, 11:41
thank you for maintaining this site, i'm sure people understand these things happen and there's nothing to be gained from venting thier angst at the providers and maintainers of the site , it's up again now and surly will improve in performance , I know it is an exercise that didn't cost me $3000 dollars , i hope things run smoothly for you mob in future ~☼~ ...

zebowho
9th May 2010, 12:37
Its unbearably slow for me...took approx 30 min to get 4 pages to post (main page 1st (plus login), then pg 1 of this thread, 2nd page & "advanced" to post). Its been like this since your outage and I've often given up on logging in. Thanks for the hard work and updates (LoL now that I've been able to read one!!)

By the way my inet connection is good, bandwidth and throughput high and everything else pops quick with pa2 being the exception as well as my system is clean...using two malware scanners and anti-virus!!
-z

P.S. Hope this posts...:D

blue777
9th May 2010, 12:48
I would not say this was slow...but I wrote this yesterday.......

Gita
9th May 2010, 13:14
http://www.asiteforme.com/tutor/a-woman-showing-frustration-at-her-computer-~-pgi0318.jpg
http://change-production.s3.amazonaws.com/photos/wordpress_copies/animalrights/2009/11/frustration.jpg
http://www.istockphoto.com/file_thumbview_approve/6540051/2/istockphoto_6540051-young-nerdy-lady-screaming-out-of-frustration.jpg
http://api.ning.com/files/*sf-w3DBVv1UeHKCYykyzJAS2fTe2d4usfsbz2u*vd12fGO1YZTMlI OZNsnTUz4t75Oox4c5XgNfCX10G7hK8tSrKFuJuwQL/frustration.jpg

hostingtechs
9th May 2010, 13:14
(my broadband is fine and other sites are ok - just this one - im in NZ)
Keep in mind that because of how the internet works in connecting through many people and going all over the place, just because other sites seem fine doesn't mean that your ISP doesn't have a problem. If all the sites you are trying to connect to are in Texas by coincidence; your ISP might have a problem connecting to Virginia and thus causing trouble for PA for you. I'm not saying this is the case, especially since many people are having trouble, I'm just pointing this out for in general.




thanks for all your hard work and expertise... I know it can be extremely frustrating dealing with a myraid of details that seem to go wrong all at the same time...

(*I would like to see all the complainers jump into your shoes and do better ... or at least be patient while the people with the know-how sort through all the possibilities)

keep up the great work- I appreciate you and this board!
I would like to see that too, I really would. And thank you for your comments; they are appreciated.




The only time I have seen large numbers of disks fail simultaneously like that was power related (bad power surge owing to faulty PSU).
Same here, but the power is fine (we've been testing that since we moved into this facility with power monitoring equipment). Also, if there was a power problem, we would have had other drives besides just that one Seagate model go bad. I'm really leaning towards the conclusion that they simply sent us a defective batch of drives, and since drives all being made by robots these days, once they make one mistake they're going to repeat it over and over and over.




slow slower slowest....!
Can you please be more descriptive so we can get a better idea on actual timing? Thanks!




thank you for maintaining this site, i'm sure people understand these things happen and there's nothing to be gained from venting thier angst at the providers and maintainers of the site , it's up again now and surly will improve in performance , I know it is an exercise that didn't cost me $3000 dollars , i hope things run smoothly for you mob in future ~☼~ ...
Thanks, we appreciate that.




I would not say this was slow...but I wrote this yesterday.......
I'm a little confused what you mean, can you please elaborate? Thanks!

hostingtechs
9th May 2010, 13:24
Well, I'm probably going to get yelled at for saying this, but I really think the possibility of a vBulletin malfunction should be looked in to as it is 100% possible.

If there were other PHP/MySQL scripts on the site I'd have people check those, but there aren't so we can't test that. How about the ProjectAvalon home page which is static html? Has anyone had problems with loading that or the interview files?

Thanks.

--
Karl

hostingtechs
9th May 2010, 19:13
Just finished changing some back end settings in hopes that it increases vBulletin's speed.

Please let us know your thoughts.

Thanks.

--
Karl

Sunny d
9th May 2010, 19:35
logging in : 1,5 min
redirecting to page 95 sec
trying to post 6 times. I dont have problems on other forums or sites, only here :(

hostingtechs
9th May 2010, 19:44
logging in : 1,5 min
redirecting to page 95 sec
trying to post 6 times. I dont have problems on other forums or sites, only here :(

Okay, we will continue to look into it.

--
Karl

Wood
9th May 2010, 20:11
Please check these two reports of load times:
http://host-tracker.com/check_res_ajx/4850338-0/share/
http://host-tracker.com/check_res_ajx/4850355-0/share/

hostingtechs
9th May 2010, 20:47
Added another instance into the cluster, as well as increased RAM for all the cloud upstream processors.

http://host-tracker.com/check_res_ajx/4850480-0/share/

Wood
9th May 2010, 21:04
Without the slash at the end I think it only gets an http redirect (see the page size, 185 bytes).
This is a report for http://projectavalon.net/forum4/ : http://host-tracker.com/check_res_ajx/4850508-0/share/
And this is for http://projectavalon.net/forum4/forum.php : http://host-tracker.com/check_res_ajx/4850525-0/share/
I don't know what is the difference between those two but the page sizes are slightly different.

Could it be a database bandwidth problem?

Wood
9th May 2010, 21:33
I've just run some more tests from a machine in a datacentre in France.
"time wget http://projectavalon.net/forum4": 0m1.046s 0m1.150s 0m12.115s 0m1.571s 0m1.578s
Similar results for "time wget http://projectavalon.net/forum4/"
From the page size it seems wget resolves the redirect.
Apparently after this last, brief, shutdown things are better.

hostingtechs
9th May 2010, 21:33
Without the slash at the end I think it only gets an http redirect (see the page size, 185 bytes).
This is a report for http://projectavalon.net/forum4/ : http://host-tracker.com/check_res_ajx/4850508-0/share/
And this is for http://projectavalon.net/forum4/forum.php : http://host-tracker.com/check_res_ajx/4850525-0/share/
I don't know what is the difference between those two but the page sizes are slightly different.

Could it be a database bandwidth problem?

Just realized that...I've been running those reports for a few minutes now, and they're getting better each time. It will take the load balancer a little bit of time since we added another instance and it needs to wait for people to leave and new ones to connect.

Already checked into that, but we don't think that is the case.

hostingtechs
9th May 2010, 21:39
http://host-tracker.com/check_res_ajx/4850642-0/share/ :)

By the way thanks very much Wood for the host-tracker idea. It's very handy and works better than the other site we were using!

--
Karl

Wood
9th May 2010, 21:52
By the way thanks very much Wood for the host-tracker idea. It's very handy and works better than the other site we were using!

You're welcome :)

The 'new posts' link is still not impressive though: http://host-tracker.com/check_res_ajx/4850700-0/share/
I use it a lot and I've found it is usually (not just today) much slower than checking the main forum page.

hostingtechs
9th May 2010, 21:55
The 'new posts' link is still not impressive though: http://host-tracker.com/check_res_ajx/4850700-0/share/
I use it a lot and I've found it is usually (not just today) much slower than checking the main forum page.
Hmm, yeah it's slower than the rest of the pages for me (7.5 seconds to fully load). It's doing a pretty intense db query, so it's naturally going to take longer, but I'll see what we can dig up.

--
Karl

Decibellistics
10th May 2010, 00:27
If people have become extremely upset with you guys for the faultiness of technology it is a sad sad sad day in the world of this project. Kudos to you techs

TECHNICIANS SHOULD RULE THE WORLD :) lololol

The biggest problem the news would have to report on in the world is how Bob Smob fudged up his soldering job HA!

hostingtechs
10th May 2010, 01:19
Kudos to you techs

Thank you.

--
Karl

Swami
10th May 2010, 06:54
Good work dudes.... :thumb:

Browsing like "the speed of light" now..!

Sunny d
10th May 2010, 08:58
Its getting better :peace: !!! thx

Lucrum
10th May 2010, 11:19
aaahh...defective hardware, so annoying when a full batch of them go wrong.

Cheers for giving info on the matter though, you have my complete understanding as I used to be a tech myself. ;)

hostingtechs
10th May 2010, 11:30
Yes! This is great to hear :)

Swanny
10th May 2010, 21:34
How long until it goes wrong again??

hostingtechs
10th May 2010, 21:57
How long until it goes wrong again??

Let's hope for the best :)

--
Karl

Swanny
10th May 2010, 21:58
Quote Originally Posted by blue777
I would not say this was slow...but I wrote this yesterday.......



I'm a little confused what you mean, can you please elaborate? Thanks!

It's called humour ;)

hostingtechs
10th May 2010, 21:59
Quote Originally Posted by blue777
I would not say this was slow...but I wrote this yesterday.......



It's called humour ;)

Hehe ok :)

hostingtechs
14th May 2010, 21:59
Hi Everyone,

I just wanted to check up and see how things were going. Can people please post some feedback as to site speed?

Thanks!

--
Karl

Etherios
14th May 2010, 22:09
atm is doing fine.. but it was down for a few hours a while ago...

hostingtechs
14th May 2010, 22:11
atm is doing fine.. but it was down for a few hours a while ago...

What time was that? Also, how long was it down, and was it totally down, or just slow? If it was totally down, was it giving any errors or just nothing would load?

Thanks.

lilac
15th May 2010, 00:51
I have been experiencing slow loading for the last 2-3 days.

Etherios
15th May 2010, 01:02
What time was that? Also, how long was it down, and was it totally down, or just slow? If it was totally down, was it giving any errors or just nothing would load?

Thanks.

it was saying that it couldnt find the page... and type a new url (the generic msg etc)

i dont remember times... it lasted for at least 2 hours today... ill take not if it happens again

Baelsfire
15th May 2010, 01:13
As soon as i saw "seagate" everything made sense ahaaaaaaaaaaaaha! (Ive had problems with their drives in the past, as has friends)

bluestflame
15th May 2010, 01:22
hostingtechs thanks for the work you do

hostingtechs
15th May 2010, 04:42
hostingtechs thanks for the work you do

No problem :)




it was saying that it couldnt find the page... and type a new url (the generic msg etc)

i dont remember times... it lasted for at least 2 hours today... ill take not if it happens again

Okay, thanks.




As soon as i saw "seagate" everything made sense ahaaaaaaaaaaaaha! (Ive had problems with their drives in the past, as has friends)

Ahh, well we won't be using Seagate again.