Kaboom

Big boom and work this morning. And I mean BIG.

Got a lovely wakeup call today. Though I wish it was Mickey Mouse waking me up to a day at the DWorld, instead it was the bosslady telling me that something was wrong and work and that I needed to get there as soon as possible. This transpired at about 12:45AM. I was a little groggy to say the least.

At work, I find a whole bunch of water on the floor, and a few other people there working on a plan to get it cleaned up. I jumped to action to help, not yet knowing what had happened.

After 45 minutes of cleanup efforts, including multiple shopvacs and lots of squeegies to clean up the mess I investigated. Turns out one of our engines running on test blew up. And when I say blew up, I mean BLEW up!

This engine is running a very abusive test. See, normally your vehicle engine has some coolant inside of it to keep things from getting too toasty. As it runs through the engine, this coolant itself gets hot, and is cooled off via the radiator. A nice symbiotic relationship.

For this test, we run the engine as hard as it can run and basically stop cooling the coolant, letting everything get superhot. 265 degrees F, as a matter of fact. The astute of you will notice that 265 F is 53 degrees above boiling - the system is pressurized to allow such a thing. So our engine is now one gigantic pressure cooker.

Normally after running the engine this way for a minute or so, we then flush it with super cold coolant. This in itself is very very abusive, which is what we are trying to do - accelerate wear and tear.

However, last night, about 6 seconds after the engine went into SUPERHEAT mode - the computer program controlling the whole thing locked up. This mysteriously happened at “11:59:59″. Without the computer to monitor the engine, it was left to its own protection schemes to take care of things. Only, in this case, all of those protection schemes were turned off, because we have to turn them off to be this abusive to the engine.

So the engine continued to run at 265 degrees for what I estimate to be a little over 5 minutes.

The engine didn’t like running at that condition. So, it blew up. Most likely, one of the pistons siezed, which cascaded into a big kaboom. An oil line blew off, started a big fire, and set off the sprinkler system in the room. About .01 seconds after the sprinkler went off, the fire stopped (it’s a BADASS sprinkler). The engine is toast.

However, no notification of any of this occurred to the person on call, because the computer system didn’t know any of it was going on. It was merrily locked up waiting for the day to change. In fact, the only reason anyone found out about it was because the water buildup in the test cell got to be so big that it caused one of the doors to open and that action set off a motion detector for the alarm system.

Of course this sucks because it’s an obscure bug which I have not been able to recreate. This particular version of the software has been running for a substantial period of time; the same version was running in the test cell next door when the “event” happened, and it continued running perfectly fine while its neighbor was dying.

For those CS dork types who want the specifics: the application running is a multithreaded app, with the main thread being a GUI thread. Somehow, one of the subthreads locked a mutex and never gave it back, so the GUI thread never came out of a wait condition. So, as a first line of defense, I’ve implented a handler to handle SIGALRM and set a 6 second alarm function internally to the application - that way, if a thread every locks up again, in 6 seconds it runs a local script which can energize the proper outputs to shut the system down and notify someone that it had to do so. Oddly enough, I had started this implementation the week before Christmas, but never finished it because I was put on other projects.

What did we learn from this?

  1. A redundant watchdog system is probably a good idea to have. Put that as priority one on my checklist. I never implemented one because: I would have to maintain (adding to a list of maintainables that’s too long already), we never had a problem like this before, and since it would add some extra level of complexity to the system, there would be complaints about any time there were false positives.
  2. Taking your only engine test cell programmer (that would be me) and putting him on other projects so he can’t keep up with making the proper changes to the test cell code is a bad idea.
  3. It would be good to have some sensors which track if water is building up on the floor. This is the Nth time (where N > 20) where we’ve had a large amount of water end up on the floor. I ordered 5 sensors today, at 8 bucks a piece. I’d say that’s a good investment

The good thing that came out of this is that I’m back working on the stuff I should be working on. Also, since the bossman has taken over for me on what I was working on, he’s made a bunch of statements like “I can’t believe you’re doing xxx this way, it’s so difficult. We should add yyy to the system to make it much easier”. To which I smirk, noting that I requested that he added feature yyy to the system a long time ago, but he never did because he wasn’t the one going through the rigamarole. Score.

Side note: I have video of the engine blowing up. I will post it here as soon as I can get it off of the computer it’s currently on.

One Response to “Kaboom”

  1. m3 Says:

    when’s that video coming?
    i want to see