Technical ramblings: January 2006

Technical ramblings

Wednesday, January 18, 2006

If only good people wrote code.

It happens here that the really good developers find themselves stuck fixing bugs that were created by poor programmers. You'd think the ideal would be the other way around: good programmers writing great code, and the newbies spend their time diagnosing issues in that code on the rare occassion they surface.

However, scheduling pressures cause the good programmers (who have better diagnostic skills than the poor programmers) to be yanked off of development tasks to diagnose emergency problems out in the field, while the underutilized poor programmers, being heads in management's head count, are put on the task of writing next generation software.

Thus the cycle of crappy software and discouraged senior programmers propagate...

¶ 1:46 PM 0 Comments

Tuesday, January 17, 2006

Engineering Big Systems

So far I can think of several components of the project I'm working on, where I've complained about how the code was over-engineered and fragile--where the original author of the bad code got promoted and moved on, and I got stuck with the crap...

It's really rather sad.

So here's some quick notes on how to build a large system.

First, a big system is not a bigger "small" system. Instead, a big system is simply a lot of small systems strung together.

Because a big system is simply a bunch of small systems strung together, it's important that each small system be engineered to be as simple to understand as possible. This means each small system should be easy to follow and deterministic.

For example, today's problem I'm dealing with is failover: if one box goes down, the software I'm working on is supposed to connect to a second box to work. Now failover is conseptually easy: an array of boxes to connect to, and a small engine which attempts to connect to each until it makes a successful connection. You also need a background thread that determines if a connection can be re-established to the primary, and some way to force the currently open connections to fail, so they re-run the connection loop of trying each machine until you get a successful connection. So ideally you're talking about (1) an array of machines, (2) a connection factory that walks the array to find a machine that's up, (3) a wrapper that wraps the connection object that permits you to forcefully break an established connection (for failback), (4) some way to read the list of machines that are available to connect to, and (5) some means to detect if a connection exception thrown by your connection wrapper is either a failover event ("transaction failed: machine is down") verses a non-failover event ("transaction failed: illegal transaction").

Conceptually this is five moving parts: five simple components strung together into a small system. It's deterministic: part (2) walks array (1) in order, so when something fails, you can watch it go "connect to A? No. Connect to B? No. Connect to C? Yes" repeatedly.

So, is our failover code that simple? Oh, good Lord no; instead of being about 5 or 6 Java classes to implement, our current failover implementation runs almost 50 classes, consisting of at least a half-dozen different design models which make no sense in the context of the software. It's not deterministic: multiple threads run asynchronously doing house keeping in odd ways which make no sense. And worse, rather than providing a simple wrapper to our underlying connection object, the connection object itself was engineered with another dozen or so classes which implement parts of the failover logic--leftover logic that came from several redesigns that were never cleaned up after.

Rather than an array, a loop and a few conditional statements, several thousand lines of code sit in it's place--several thousand lines of code which turn out to be extremely fragile.

Second, as a big system is a bunch of small systems strung together, each small building block needs to be defensively programmed, so other building blocks which are not as well coded don't bring the system down.

In the system I'm working on, once failover fails, we get a cascading series of failures which result in the system dropping into an inoperative state somewhat randomly. The system is supposed to fail gracefully into a less-functional (but still running) state. Instead, the system degrades until three days later, the OS runs out of resources and the system collapses. Now if failover had been easily engineered, a leak in resources would be easy to find: there is just one array, one loop, some logic to debug. But no; because the thing is an over-engineered piece of crap, it's impossible to figure out what failed.

Worse, because other elements of the system are not defensively built, other components of the system (event logging, alerting, etc.) all start failing at different rates: they either expect a successful connection, or an explicit failure.

Third, and this is where the art of development comes into play and replaces the "engineering" aspects of development, the small components need to be designed around the functional components necessary to get the big tasks done. In our system, for example, our core architecture sits on top of Tomcat. Each component that communicates with the outside world runs in it's own servlet, but on top of a common servlet base which provides common services beyond those provided by Tomcat. (Things like system metrics, for example.) On the blackboard, one needs to be able to draw the software, draw the blocks that go into the software, then break out each block and break those into smaller blocks. It's the "fractle" way of designing software, and because so much of that is art, often you wind up having to do several designs before you conseptualize the software correctly.

(That's why sometimes it's best to design a system by writing it once, throwing the result away, then writing it again. The first time was practice to gain experience with the components of the system you will eventually need.)

Oddly enough our software is easy to draw into block diagrams. But for some reason, our software is not written into modules that follow the block diagrams. Which means the entire thing is a tangled mess.

It's so simple to engineer a top-notch system: keep the components simple. I don't understand why people insist on creating crap instead--though oddly enough, many of my co-workers, even when they agree with the principle of keeping it simple, go off and engage in bad programming practices anyways.

¶ 1:07 PM 0 Comments

About Me