We've been noticing some 'blips', during which Maui fights bravely but ultimately fails to schedule jobs. This is generally considered rather sub-optimal.
The root of it was Maui was failing with an error:
ERROR: cannot get node info: Premature end of message
That Maui error results in Maui taking a break for 15 minutes, before trying to schedule anything again. Which is fair enough, in the face of communication errors. Only ... Maui doesn't speak to anything except the Torque server. Which is running on the same host.
So what's actually happening here is that Torque can't talk to some node or other, and reporting that to Maui, which is then breaking. It didn't seem right that a communication failure to a single node once should stop jobs from starting elsewhere, which prompted some deeper investigation.
Looking for obvious correlations, we noticed that the scheduling blips happened right when we're running lots of analysis jobs - exactly when we don't want scheduler blips! However, it wasn't an obvious correlation, in that sometimes running 1000 jobs at once was fine, other times 400 caused things to gum up.
More worry-some than sub-optimal scheduling was that during the same time period we got occasional errors from the CE's, of the form:
BLAH error: submission command failed (exit code = 1)
(stderr:pbs_iff: cannot read reply from
cannot connect to server svr016.gla.scotgrid.ac.uk
(errno=15007) Unauthorized Request
Dissecting that down, the BLAH part is CREAM saying it can't submit the job, so we're looking at the pbs_iff part. The purpose of pbs_iff is to authenticate the current user to the Torque server, so that the job is run with the correct user id (and can be checked with the ACL's on the server, if appropriate). The next part with qsub is just reporting that it's not able to talk to the server.
The root problem is pbs_iff not able to communicate, after which the rest of the qsub is failing for lack of authentication. This is a problem, because these are jobs that are already accepted by the CREAM CE, and shouldn't be failed here. (If a site can't cope with the jobs, the CE should be disabled, so it never accepts the jobs - that's the signal to the submitter/WMS to try elsewhere.)
How does all this link back to the network issues? Well, our cluster is split into two rooms - liked by a couple of fibres.
During analysis, we can see 2 GB per second (yes, that's in bytes) in traffic leaving the disk servers. Roughly half the disk and about half of the CPUs [see later!] are in each room; that implies that given a random distribution half that traffic has to pass through the fibre link.
And, yep, that's the problem right there. The Torque server unable to shout loud enough to talk to the nodes when the link is full, or be heard from some of the CE's. Digging into the stats shows that the link is running at 83% average utilisation, over the past month. So when analysis hits, it wipes out any other traffic.
For the moment, then, I've put a cap on the number of analysis jobs until we can resolve this, as mitigation. And sent Mark off to find some more fibre and ports on the switches!
Some interesting sums: Turns out we have nearer 1/3 the CPU upstairs, and 2/3 (1200 job slots) downstairs. Disk is close to 1/2 each. Matching this up with the planning number of 5 MB per second 'disk spindle to analysis cpu' bandwidth suggests that we need 3 GB per second, or 24 Gbs-1 bandwidth between the rooms to run at full capacity. Compared to 10 Gbs-1 at the moment.
Hrm. No wonder we were having difficulty! On the other hand, it's probably been this link that's the limiting factor in our analysis throughput, so we should be able to roughly double our peak throughput of analysis jobs once that link is upgraded.
That, and not have the scheduler taking a wee nap during peak times.