Update (1/26/08 @ 11:50pm): Wow - it’s been 17 hours since I last heard from VPSLink regarding my issue with their VPS. Joel Spolsky recently experienced something similar. If I had been going for six nines, my six sigma would be blown for probably the next 100 years =)
Luckily my server isn’t that important. But really? 17 hours? Wow. Even DreamHost will get back to you within 1-2 hours.
The thing I love about VPSLink is that they can partition a new VPS for you within hours.
Normally I have not had any problems with VPSLink, but I kept bumping into memory allocation issues with only 1GB RAM on my VPS, so yesterday I bit the bullet and upgraded to their most powerful option @ $129.95 per month.
That’s when thing started to go downhill.
After submitting a ticket, I got this response:
Your VE should have been migrated to a Link-6 node, but for whatever reason, that never happened. I am doing so now. This should fix the problem.
But 11 hours later when I had a chance to take a look at the issue again, the server was still down.
When I restarted via the control panel, the CPU was load spiking for no apparent reason.
I run about 4-5 rails apps on there with 2-3 mongrels each. Collectively they drive less than 1k visitors / day.
So it’s a bit surprising when my CPU load spikes to 15, 17+!
I’m no expert on this subject, but I believe that means the server is busting through 1,500%+ of its available CPU cycles!
Pretty much the server is unusable when the CPU load is this high.
I’ll keep this post updated to let you guys know how VPSLink responds…
Update…
In an attempt to investigate further, I turned off all possible applications that were running on the box like: nginx, mongrels, mysqld, sendmail.
This is the list when I do a ps aux:
[root@videolockr init.d]# ps aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.0 0.0 1936 672 ? Ss 17:24 0:00 init [3] root 24101 0.0 0.0 1584 564 ? Ss 17:24 0:00 syslogd -m 0 dbus 24130 0.0 0.0 2916 704 ? Ss 17:24 0:00 dbus-daemon –system root 28168 0.0 0.0 4924 1112 ? Ss 17:24 0:00 /usr/sbin/sshd root 28220 0.0 0.0 2156 804 ? Ss 17:24 0:00 xinetd -stayalive -pidfile /var/run/xinetd.pid root 29869 0.0 0.1 7780 2344 ? Ss 17:25 0:00 sshd: deploy [priv] deploy 29925 0.0 0.0 7928 1712 ? R 17:25 0:00 sshd: deploy@pts/0 deploy 29926 0.0 0.0 2436 1360 pts/0 Ss 17:25 0:00 -bash root 30428 0.0 0.0 2656 1080 pts/0 S 17:25 0:00 su root 30552 0.0 0.0 2308 1364 pts/0 S 17:26 0:00 bash root 30701 0.0 0.1 7780 2344 ? Ss 17:26 0:00 sshd: deploy [priv] deploy 31785 0.0 0.0 7780 1700 ? S 17:26 0:00 sshd: deploy@pts/1 deploy 31786 0.0 0.0 2440 1384 pts/1 Ss 17:26 0:00 -bash root 32059 0.0 0.0 3152 1112 ? Ss 17:27 0:00 crond xfs 32288 0.0 0.0 3088 1152 ? Ss 17:27 0:00 xfs -droppriv -daemon deploy 3810 0.0 0.0 2060 992 pts/1 S+ 17:47 0:00 top root 5935 0.0 0.0 2032 820 pts/0 R+ 17:49 0:00 ps aux
More on the top command and load averages can be found here, which says:
The current time, how long the system has been run-ning, how many users are currently logged on, and the sys-tem load averages for the past 1, 5, and 15 minutes.
So, after stopping all of these processes, eventually the server load went down a bit as reported by “top”. You can see the results below (this is after all of those processes have been stoppe
Shanti A. Braford blogs here.
If you really want to know, just read this.




Dude, funny you should post about this today…I’m just in the process of wrangling a new VPS myself. I’ve had a dedicated box for the last couple of years but can’t justify the cost anymore so am migrating to a VPS and finding the whole thing pretty painful
Load of 15 just means 15 processes are waiting for the CPU. This could be a symptom of a completely healthy machine. Our webservers hover around 15 during a normal operating day (down from 25-30 before we got APC up and running). That’s not to say things aren’t going wonky on your server, but Load isn’t going to be the best metric for identifying this
-David
David - I’m actually really curious to learn more about this.
How do I tell then if the CPU really is overutilized?
In order to tell if the CPU is overutilized, you’ll need to have some control against which to measure it either:
1. do some benchmarking
2. monitor some simple benchmark over time
One thing you could do is monitor and trend the page-load time for pages on your site over time. This would allow you to see when there were instances of your relatively low utilization, but relatively poor performance.
The real problem with Load is that on servers like the Web servers I described and Mail servers is that when lots of processes run to handle individual requests, this can drive the load up under semi-normal operations. One of the primary things you should see are relatively slow changes in load. Load shouldn’t change from 1 to 15 over 10 minutes unless your traffic has also spiked substantially (not too likely unless you discover some event, like a blog post or a new link, and that should show up in your analytics).
David,
I gotcha. So the only way to tell if something is up is by diff’ing the current load against historical averages?
In this particular case, the server was basically unusable, so I knew something was up. “top” load usually hovered around 1 historically but had jumped to 15.
Was basically looking for a way to “prove” that the server was acting foo bar.
i.e. that it wasn’t any of my scripts that started acting rogue.