VPSLink - Unresponsive Support for 17+ Hours

Update (1/26/08 @ 11:50pm): Wow - it’s been 17 hours since I last heard from VPSLink regarding my issue with their VPS. Joel Spolsky recently experienced something similar. If I had been going for six nines, my six sigma would be blown for probably the next 100 years =)

Luckily my server isn’t that important. But really? 17 hours? Wow. Even DreamHost will get back to you within 1-2 hours.

VPSLink

The thing I love about VPSLink is that they can partition a new VPS for you within hours.

Normally I have not had any problems with VPSLink, but I kept bumping into memory allocation issues with only 1GB RAM on my VPS, so yesterday I bit the bullet and upgraded to their most powerful option @ $129.95 per month.

That’s when thing started to go downhill.

After submitting a ticket, I got this response:

Your VE should have been migrated to a Link-6 node, but for whatever
reason, that never happened. I am doing so now. This should fix the problem.

But 11 hours later when I had a chance to take a look at the issue again, the server was still down.

When I restarted via the control panel, the CPU was load spiking for no apparent reason.

I run about 4-5 rails apps on there with 2-3 mongrels each. Collectively they drive less than 1k visitors / day.

So it’s a bit surprising when my CPU load spikes to 15, 17+!

I’m no expert on this subject, but I believe that means the server is busting through 1,500%+ of its available CPU cycles!

Pretty much the server is unusable when the CPU load is this high.

I’ll keep this post updated to let you guys know how VPSLink responds…

Update…

In an attempt to investigate further, I turned off all possible applications that were running on the box like: nginx, mongrels, mysqld, sendmail.

This is the list when I do a ps aux:

[root@videolockr init.d]# ps aux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0   1936   672 ?        Ss   17:24   0:00 init [3]
root     24101  0.0  0.0   1584   564 ?        Ss   17:24   0:00 syslogd -m 0
dbus     24130  0.0  0.0   2916   704 ?        Ss   17:24   0:00 dbus-daemon –system
root     28168  0.0  0.0   4924  1112 ?        Ss   17:24   0:00 /usr/sbin/sshd
root     28220  0.0  0.0   2156   804 ?        Ss   17:24   0:00 xinetd -stayalive -pidfile /var/run/xinetd.pid
root     29869  0.0  0.1   7780  2344 ?        Ss   17:25   0:00 sshd: deploy [priv]
deploy   29925  0.0  0.0   7928  1712 ?        R    17:25   0:00 sshd: deploy@pts/0
deploy   29926  0.0  0.0   2436  1360 pts/0    Ss   17:25   0:00 -bash
root     30428  0.0  0.0   2656  1080 pts/0    S    17:25   0:00 su
root     30552  0.0  0.0   2308  1364 pts/0    S    17:26   0:00 bash
root     30701  0.0  0.1   7780  2344 ?        Ss   17:26   0:00 sshd: deploy [priv]
deploy   31785  0.0  0.0   7780  1700 ?        S    17:26   0:00 sshd: deploy@pts/1
deploy   31786  0.0  0.0   2440  1384 pts/1    Ss   17:26   0:00 -bash
root     32059  0.0  0.0   3152  1112 ?        Ss   17:27   0:00 crond
xfs      32288  0.0  0.0   3088  1152 ?        Ss   17:27   0:00 xfs -droppriv -daemon
deploy    3810  0.0  0.0   2060   992 pts/1    S+   17:47   0:00 top
root      5935  0.0  0.0   2032   820 pts/0    R+   17:49   0:00 ps aux

More on the top command and load averages can be found here, which says:

The current time, how long the system has been run-ning, how many users are currently logged on, and the sys-tem load averages for the past 1, 5, and 15 minutes.

So, after stopping all of these processes, eventually the server load went down a bit as reported by “top”. You can see the results below (this is after all of those processes have been stoppe

Tags: , , ,

You can Bookmark this entry on del.icio.usbookmark this, digg this entrydigg this or check the See this page in technoraticosmos

5 Responses to “VPSLink - Unresponsive Support for 17+ Hours”

  1. Gravatar Icon 1 Charles 

    Dude, funny you should post about this today…I’m just in the process of wrangling a new VPS myself. I’ve had a dedicated box for the last couple of years but can’t justify the cost anymore so am migrating to a VPS and finding the whole thing pretty painful :(

  2. Gravatar Icon 2 David Rasch 

    Load of 15 just means 15 processes are waiting for the CPU. This could be a symptom of a completely healthy machine. Our webservers hover around 15 during a normal operating day (down from 25-30 before we got APC up and running). That’s not to say things aren’t going wonky on your server, but Load isn’t going to be the best metric for identifying this :)

    -David

  3. Gravatar Icon 3 Shanti A. Braford 

    David - I’m actually really curious to learn more about this.

    How do I tell then if the CPU really is overutilized?

  4. Gravatar Icon 4 David Rasch 

    In order to tell if the CPU is overutilized, you’ll need to have some control against which to measure it either:
    1. do some benchmarking
    2. monitor some simple benchmark over time

    One thing you could do is monitor and trend the page-load time for pages on your site over time. This would allow you to see when there were instances of your relatively low utilization, but relatively poor performance.

    The real problem with Load is that on servers like the Web servers I described and Mail servers is that when lots of processes run to handle individual requests, this can drive the load up under semi-normal operations. One of the primary things you should see are relatively slow changes in load. Load shouldn’t change from 1 to 15 over 10 minutes unless your traffic has also spiked substantially (not too likely unless you discover some event, like a blog post or a new link, and that should show up in your analytics).

  5. Gravatar Icon 5 Shanti A. Braford 

    David,

    I gotcha. So the only way to tell if something is up is by diff’ing the current load against historical averages?

    In this particular case, the server was basically unusable, so I knew something was up. “top” load usually hovered around 1 historically but had jumped to 15.

    Was basically looking for a way to “prove” that the server was acting foo bar.

    i.e. that it wasn’t any of my scripts that started acting rogue.


Shanti A. Braford blogs here.

If you really want to know, just read this.



  

Powered by FeedBlitz