+ Reply to Thread
Page 1 of 3 1 3 LastLast
Results 1 to 20 of 52

Thread: Recent Downtime

  1. Link to Post #1
    Unsubscribed
    Join Date
    5th April 2010
    Location
    Northern Virginia
    Posts
    159
    Thanks
    0
    Thanked 0 times in 0 posts

    Default Recent Downtime

    Well I understand that people are probably very upset with us by now, considering the length of the most recent down time, and I completely understand why. Allow me to explain the events of the past few days though so that you can understand what happened and what measures were put in place to prevent it from happening:

    On Wednesday May 5th we noticed that several of our Seagate hard drives were failing in some of our development and testing servers. After further investigation, several client servers also had hard drives that were about to fail. Once we laid out the statistics, we became aware of the fact that all of the drives which were failing were of the same Seagate model. We used these drives in over 100 of our servers. The drives were not in critical stage yet, so we scheduled a maintenance window for Thursday May 6th and alerted all clients giving them 24 hour notice. We then proceeded to order an entire replacement of hard drives from Western Digital.

    When we arrived at the data center Thursday afternoon and started to transfer data to new drives, every single one of the drives failed in the process and became completely corrupted. Luckily, because of the way we have setup this system, all of the actual data and operating system configurations are stored on NAS units which have partitioned sections for each server. The drive in the server simply serves as a connection between the internet and the data.

    The only problem was that the partitions on the NAS are not labeled because they are identified with an encrypted key and not a name, so in order to get systems back online we had to pick a random partition, mount it to a server, and boot it up to see where it belonged (the process also included loading the software we have to load on the actual server drives to get the setup working correctly).

    So basically from Thursday evening to this morning (Saturday), all of our technicians did nothing except work non-stop mounting partitions to servers to get clients back online. One by one, clients came back online as they were connected to the correct server. To help get everything up ASAP, we also had to contract out with a 3rd party server administration team to work through the night and most of the day Friday (we use them from time to time and they are fully confident and trusted). This cost us over $3,000.

    Our sincere apologies for the inconveniences we caused, we know they were harmful and annoying, and we truly are sorry. I do want to make the point though that this issue had nothing to do with us, and was all the fault of Seagate sending us over 100 bad hard drives (the model was apparently defective).

    We will do everything in our power to make things easier for ProjectAvalon, including improving the uptime. We had everything perfect, we really did, until this had to happen.

    So please, bear with us here, this could have happened to anyone and it was simply TPTB that caused it to happen to us.

    Thank you for taking the time to read this post, and please enjoy your weekend

    --
    Karl

  2. Link to Post #2
    Netherlands Avalon Retired Member
    Join Date
    18th March 2010
    Location
    Under sealevel
    Age
    57
    Posts
    1,741
    Thanks
    208
    Thanked 1,172 times in 408 posts

    Default Re: Recent Downtime

    A big kiss from the Swami for all the hard work you put into AV2............


  3. Link to Post #3
    Czech Republic Avalon Member haibane's Avatar
    Join Date
    18th March 2010
    Location
    Right behind you XD
    Posts
    214
    Thanks
    900
    Thanked 372 times in 128 posts

    Default Re: Recent Downtime

    Hi Karl and all the hosting technicians. I originally intended to write something, erm, sarcastic, but then decided not to as I don't know **** about what's going on, and there's no point anyway and wouldn't help to improve anything, so let me just say this: Thanks for bringing PA back, and good luck in teh future (^__^ )

  4. Link to Post #4
    Spain Avalon Retired Member
    Join Date
    16th March 2010
    Age
    48
    Posts
    773
    Thanks
    268
    Thanked 623 times in 203 posts

    Default Re: Recent Downtime

    This sounds entirely reasonable to me:
    Quote Posted by hostingtechs (here)
    On Wednesday May 5th we noticed that several of our Seagate hard drives were failing in some of our development and testing servers. After further investigation, several client servers also had hard drives that were about to fail. Once we laid out the statistics, we became aware of the fact that all of the drives which were failing were of the same Seagate model. We used these drives in over 100 of our servers. The drives were not in critical stage yet, so we scheduled a maintenance window for Thursday May 6th and alerted all clients giving them 24 hour notice. We then proceeded to order an entire replacement of hard drives from Western Digital.
    However, this is very weird:
    Quote Posted by hostingtechs (here)
    When we arrived at the data center Thursday afternoon and started to transfer data to new drives, every single one of the drives failed in the process and became completely corrupted.

    Quote Posted by hostingtechs (here)
    The only problem was that the partitions on the NAS are not labeled because they are identified with an encrypted key and not a name, so in order to get systems back online we had to pick a random partition, mount it to a server, and boot it up to see where it belonged (the process also included loading the software we have to load on the actual server drives to get the setup working correctly).
    Given that, what is your procedure to replace a single server that has failed then? To check all the NAS partitions? To find an unmounted one? I believe you should have been aware for a long time of the potential problems with your procedure (that is, that in case of many servers going down you would have trouble identifying the partitions). Apparently the solution is to keep a list of which partition belongs to which server. Such a list can be hand-made or created with scripts. I do not understand why you did not have it ready.

    Quote Posted by hostingtechs (here)
    So please, bear with us here, this could have happened to anyone and it was simply TPTB that caused it to happen to us.
    I do not see other popular forums going down so often, not at all. It also looks as if both PA and PC have been fixed around the same time, towards the end of the time frame you have mentioned. So it seems you have brought up all your other servers first.

  5. Link to Post #5
    Unsubscribed
    Join Date
    5th April 2010
    Location
    Northern Virginia
    Posts
    159
    Thanks
    0
    Thanked 0 times in 0 posts

    Default Re: Recent Downtime

    Quote Posted by Wood (here)

    However, this is very weird:
    Yes, getting shipped 100 bad drives is very weird. I am wondering if UPS dropped the boxes or something in transit because I can't find any complaints about the model.



    Quote Posted by Wood (here)
    Given that, what is your procedure to replace a single server that has failed then? To check all the NAS partitions? To find an unmounted one? I believe you should have been aware for a long time of the potential problems with your procedure (that is, that in case of many servers going down you would have trouble identifying the partitions). Apparently the solution is to keep a list of which partition belongs to which server. Such a list can be hand-made or created with scripts. I do not understand why you did not have it ready.
    If a single server fails, then one partition becomes unmounted and we go and re-mount it as it will be the only one un-mounted. There is no way of keeping track of them because the key changes. We use a modified version of Citrix's XenServer to accomplish this setup, and it is the way that Xen is set up that causes this.



    Quote Posted by Wood (here)
    I do not see other popular forums going down so often, not at all. It also looks as if both PA and PC have been fixed around the same time, towards the end of the time frame you have mentioned. So it seems you have brought up all your other servers first.
    What happened this week could have happened to any site. In regards to past downtime, we've resolved all of those issues.

    And let me correct you. ProjectCamelot was brought up about 10 hours before ProjectAvalon, I don't call that "fixed around the same time." The order that clients were brought up in was random, and we had many clients left to bring up after both PA and PC were restored, so again I need to correct you there. We did not bring up all of our other servers before we brought up PA and PC.



    Quote Posted by haibane (here)
    Hi Karl and all the hosting technicians. I originally intended to write something, erm, sarcastic, but then decided not to as I don't know **** about what's going on, and there's no point anyway and wouldn't help to improve anything, so let me just say this: Thanks for bringing PA back, and good luck in teh future (^__^ )
    Thank you, I appreciate that. And yes, arguing gets us no where and wastes my time that I could be using to increase the performance of the PA servers

    --
    Karl

  6. Link to Post #6
    Spain Avalon Retired Member
    Join Date
    16th March 2010
    Age
    48
    Posts
    773
    Thanks
    268
    Thanked 623 times in 203 posts

    Default Re: Recent Downtime

    Quote Posted by hostingtechs (here)
    Yes, getting shipped 100 bad drives is very weird. I am wondering if UPS dropped the boxes or something in transit because I can't find any complaints about the model.
    The weird part, to me, is that all of them failed at the exact same time. It might be related to the stress of the backup, but I still think it is weird. Even if they were dropped or something I think not all of them should fail at the same time since they had not had the same amount of use (different machines) and it is unlikely they had the exact amount of damage. I understand they could fail at about the same time (same month) but not in the same hour.

    Regarding the partitions, I am not a sys admin and I believe your explanation. I think it is a bad design though (not yours but of the software you use), and the issue with multiple servers failing at the same time is still there. If the keys change why not run a script on every machine after each key change (or periodically) to store in a db what NAS partition belongs to it (or the key or a hash of the key)? It should be a few hours task (at most) and it would have saved you a lot of money an ill will this time and maybe in the future. And I still find it hard to believe there is no other way to identify a partition. After all it is stored in one or more hard disks so the physical partition id should not change.

    Quote Posted by hostingtechs (here)
    And let me correct you. ProjectCamelot was brought up about 10 hours before ProjectAvalon, I don't call that "fixed around the same time." The order that clients were brought up in was random, and we had many clients left to bring up after both PA and PC were restored, so again I need to correct you there. We did not bring up all of our other servers before we brought up PA and PC.
    Ok, I do not check PC as often as PA. I was under the impression they were up at about the same time but I guess I was wrong and let my previous remarks influence me.

    Thanks for your work and lets hope this is the last failure in several months. The uptime since the new forum was launched has not been impressive.
    Last edited by Wood; 8th May 2010 at 20:55.

  7. Link to Post #7
    Unsubscribed
    Join Date
    5th April 2010
    Location
    Northern Virginia
    Posts
    159
    Thanks
    0
    Thanked 0 times in 0 posts

    Default Re: Recent Downtime

    Quote Posted by Wood (here)
    The weird part, to me, is that all of them failed at the exact same time. It might be related to the stress of the backup, but I still think it is weird.
    Yes, very weird to me too, but who knows. And yes, that is something we thought of, but our NAS units back up at different times and the servers use different NAS units so we don't think this is the case.



    Quote Posted by Wood (here)
    Regarding the partitions, I am not a sys admin and I believe your explanation. I think it is a bad design though (not yours but of the software you use), and the issue with multiple servers failing at the same time is still there. If the keys change why not run a script on every machine after each key change (or periodically) to store in a db what NAS partition belongs to it (or the key or a hash of the key)? It should be a few hours task (at most) and it would have saved you a lot of money an ill will this time and maybe in the future. And I still find it hard to believe there is no other way to identify a partition. After all it is stored in one or more hard disks so the physical partition id should not change.
    That's a good idea (writing the program to find the changed key), but putting it into action may be difficult since Citrix Xen is not open source. We'll definitely look into it though. The general idea of it is pretty weird, but we wouldn't have picked Citrix Xen to use unless the benefits outweighed the disadvantages which they do.



    Quote Posted by Wood (here)
    Thanks for your work and lets hope this is the last failure in several months. The uptime since the new forum was launched has not been impressive.
    Yes, I understand and we do take responsibility. You can rest assured we are working hard to keep everything up and running well and are looking forward to providing rock solid hosting for you guys.

  8. Link to Post #8
    Avalon Member norman's Avatar
    Join Date
    25th March 2010
    Location
    too close to the hot air exhaust
    Age
    68
    Posts
    9,111
    Thanks
    10,029
    Thanked 56,721 times in 8,381 posts

    Default Re: Recent Downtime

    Well, I'm very glad it's all up and running ok again. Thank you (very much) for the hard and stressful effort it must have been.

    A bad batch of hard drives is a likely explanation but until I hear that these same model of H/D have done the same thing elsewhere I'll tend to see it as a location specific incident, with all that implies.

    Let's hope all is really well.
    ..................................................my first language is TYPO..............................................

  9. Link to Post #9
    Unsubscribed
    Join Date
    5th April 2010
    Location
    Northern Virginia
    Posts
    159
    Thanks
    0
    Thanked 0 times in 0 posts

    Default Re: Recent Downtime

    Quote Posted by norman (here)
    Well, I'm very glad it's all up and running ok again. Thank you (very much) for the hard and stressful effort it must have been.

    A bad batch of hard drives is a likely explanation but until I hear that these same model of H/D have done the same thing elsewhere I'll tend to see it as a location specific incident, with all that implies.

    Let's hope all is really well.
    No problem, it's what we do

    Thanks for the comments.

  10. Link to Post #10
    Avalon Retired Member Vidya Moksha's Avatar
    Join Date
    17th March 2010
    Location
    virtually? here
    Age
    57
    Posts
    238
    Thanks
    0
    Thanked 44 times in 24 posts

    Default Re: Recent Downtime

    The forum is incredibly slow at the moment, and some posts dont even make it before they are timed out. Is this related to the recent problems or a separate issue? If its separate issue any ideas what's causing the slowness?

  11. Link to Post #11
    Unsubscribed
    Join Date
    5th April 2010
    Location
    Northern Virginia
    Posts
    159
    Thanks
    0
    Thanked 0 times in 0 posts

    Default Re: Recent Downtime

    Quote Posted by Vidya Moksha (here)
    The forum is incredibly slow at the moment, and some posts dont even make it before they are timed out. Is this related to the recent problems or a separate issue? If its separate issue any ideas what's causing the slowness?
    Probably separate issue. Forum is fine for me, can other people please posts their speed thoughts?

  12. Link to Post #12
    Unsubscribed
    Join Date
    5th April 2010
    Location
    Northern Virginia
    Posts
    159
    Thanks
    0
    Thanked 0 times in 0 posts

    Default Re: Recent Downtime

    If people can post their thoughts on the speed and overall usability that would be great.

    Thanks.

  13. Link to Post #13
    Avalon Retired Member Vidya Moksha's Avatar
    Join Date
    17th March 2010
    Location
    virtually? here
    Age
    57
    Posts
    238
    Thanks
    0
    Thanked 44 times in 24 posts

    Default Re: Recent Downtime

    it took me 2 mins to open this post, i will see how long it takes to post this reply

    =========

    2.20 seconds to post the reply

    and 3 minutes to edit this post.. painfully slow

    (will be another 2 mins now, this post alon with updates is over 10 minutes

    (my broadband is fine and other sites are ok - just this one - im in NZ)
    Last edited by Vidya Moksha; 9th May 2010 at 03:22.

  14. Link to Post #14
    United States Avalon Member xbusymom's Avatar
    Join Date
    16th March 2010
    Location
    Kansas City MO
    Posts
    874
    Thanks
    359
    Thanked 1,188 times in 396 posts

    Default Re: Recent Downtime

    thanks for all your hard work and expertise... I know it can be extremely frustrating dealing with a myraid of details that seem to go wrong all at the same time...

    (*I would like to see all the complainers jump into your shoes and do better ... or at least be patient while the people with the know-how sort through all the possibilities)

    keep up the great work- I appreciate you and this board!

  15. Link to Post #15
    Australia Avalon Member Anchor's Avatar
    Join Date
    10th February 2010
    Location
    NSW, Australia
    Language
    English
    Age
    60
    Posts
    4,601
    Thanks
    11,212
    Thanked 25,835 times in 3,731 posts

    Default Re: Recent Downtime

    The only time I have seen large numbers of disks fail simultaneously like that was power related (bad power surge owing to faulty PSU).
    -- Let the truth be known by all, let the truth be known by all, let the truth be known by all --

  16. Link to Post #16
    Avalon Member Swanny's Avatar
    Join Date
    17th March 2010
    Location
    GFL are not real
    Age
    47
    Posts
    561
    Thanks
    0
    Thanked 37 times in 23 posts

    Default Re: Recent Downtime

    Rubbish

    ................
    .
    Swanny is the waskly wabbit.
    Temet Nosce
    Don't worry.....Be happy 44

  17. Link to Post #17
    Avalon Member Sunny d's Avatar
    Join Date
    17th March 2010
    Age
    59
    Posts
    71
    Thanks
    12
    Thanked 21 times in 3 posts

    Default Re: Recent Downtime

    slow slower slowest....!

  18. Link to Post #18
    Australia Avalon Member bluestflame's Avatar
    Join Date
    21st April 2010
    Location
    a spark
    Age
    52
    Posts
    2,819
    Thanks
    16,584
    Thanked 8,500 times in 1,808 posts

    Default Re: Recent Downtime

    thank you for maintaining this site, i'm sure people understand these things happen and there's nothing to be gained from venting thier angst at the providers and maintainers of the site , it's up again now and surly will improve in performance , I know it is an exercise that didn't cost me $3000 dollars , i hope things run smoothly for you mob in future ~☼~ ...

  19. Link to Post #19
    Avalon Member zebowho's Avatar
    Join Date
    22nd April 2010
    Location
    Here and now :)
    Posts
    184
    Thanks
    2,971
    Thanked 764 times in 147 posts

    Default Re: Recent Downtime

    Its unbearably slow for me...took approx 30 min to get 4 pages to post (main page 1st (plus login), then pg 1 of this thread, 2nd page & "advanced" to post). Its been like this since your outage and I've often given up on logging in. Thanks for the hard work and updates (LoL now that I've been able to read one!!)

    By the way my inet connection is good, bandwidth and throughput high and everything else pops quick with pa2 being the exception as well as my system is clean...using two malware scanners and anti-virus!!
    -z

    P.S. Hope this posts...:D
    A single thought is a seed....Imagination is the water!

  20. Link to Post #20
    Unsubscribed
    Join Date
    17th March 2010
    Posts
    670
    Thanks
    0
    Thanked 0 times in 0 posts

    Default Re: Recent Downtime

    I would not say this was slow...but I wrote this yesterday.......

+ Reply to Thread
Page 1 of 3 1 3 LastLast

Similar Threads

  1. most recent project camelot teleconference?
    By robinr1 in forum General Discussion
    Replies: 0
    Last Post: 23rd June 2010, 05:08
  2. Wondering what those odd sphere-like features are in recent STEREO EUVI images?
    By NASA in forum Solar Activity, Reports and Discussions
    Replies: 19
    Last Post: 12th May 2010, 16:17
  3. recent earthquakes
    By subque in forum Earthquakes, Latest reports and discussions
    Replies: 16
    Last Post: 11th April 2010, 13:54

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts