Technical ramblings: August 2005

Technical ramblings

Thursday, August 25, 2005

Observation

A large company is a small company with extra people and bureaucracy to protect the small company core.

¶ 1:24 PM 0 Comments

Monday, August 22, 2005

Miscellaneous Design Gripe

You can tell when your software design is fucked up, when you have an endless series of interface objects within Java or C++, which are only backed by a single implementation.

Interfaces exist to help support (amongst other things) an abstraction process where the same interface may have multiple implementations; this permits the implementation to be swapped out. For this to work, interfaces *should* be kept as small as necessary, so a new implementation of the interface has relatively little work to perform.

However, if you have only one back-end implementation of an interface, why bother? It's needless complexity for the sake of complexity--and frankly, if later you decide you need an interface, it's easier to add a new interface (or to add new complexity) to simple code than to try to refactor code where the interfaces don't exist quite at the level of abstraction where you needed them.

¶ 5:12 PM 0 Comments

Tuesday, August 09, 2005

Synchronization

So the product I'm working on has about a million synchronization issues, where either two threads may occassionally screw up (dead lock, get wrong results, run inefficiently), or where two processes may occassionally screw up. Now I understand that synchronization issues are sometimes the hardest to design for and sometimes the hardest to develop for, and with increasing numbers of software projects using multi-threaded/multi-process solutions, most programmers out there are going to run into synchronization issues.

What irks me is that most of our synchronization issues could have been avoided by sticking to a handful of useful design patterns. It irks me because it's not all that difficult to realize "oh, I've got a case of design pattern C", then use it--rather than come up with all sorts of ad-hoc things which sort of (but not always) work.

Before we get to the design patterns, some preliminaries.

Why synchronize?

The easy answer to that is it's the only way you can guarantee that a resource or bit of information is in a "stable state" rather than in an "illegal state" before it is read or used. So, for example, suppose you have a simple file which contains some settings. One thread attempts to read those settings while another attempts to write them. If the code that reads and writes the file isn't synchronized: that is, if the code isn't set up in such a way that the writer code is allowed to finish before the reader code is allowed to start, then we could find ourselves in the odd position of only being able to read half our settings in, or getting a "file not found" error, when it is in fact being written now.

This sort of behavior even extends to what seems to be an "atomic" operation, such as writing an integer. Suppose we have a two-byte integer, and it has two legal values: -1 and 0. If we do not have proper synchronization, it is possible that one thread is interrupted when it has only written half of the integer, when the next thread reads the integer. Suddenly instead of the integer only having two legal values -1 and 0, we may find ourselves reading in 256 (0x00FF), or -255 (0xFF00)--both illegal values.

The Danger of Improper Synchronization

The danger, of course, of not having the proper synchronization is just as bad as not having synchronization at all. The biggest danger comes from deadlocks. Suppose we have two threads, thread 1 and thread 2, and we have two synchronization locks--locks which prohibit us from doing something until we have the lock--call them A and B.

Thread 1 hits some code: it locks A, then locks B, does something, then releases the locks.
Thread 2 hits some code: it locks B, then locks A, does something, then releases the locks.

Now suppose 1 runs: locks A, and is interrupted. Now thread 2 runs, locks B, then tries to lock A--but it cannot, because thread 1 has a lock on A. Thread 2 interrupts, bounces back to thread 1, which attempts to lock B--but cannot, because thread 2 has it.

Deadlock. Neither thread can proceed.

There are a huge number of algorithms out there in database design which are designed to detect and roll back the executing threads in such an event--but they are mostly designed in database transaction, where we (a) can reliably throw thread 1 and 2 back to the "pre-lock A" state, and (b) where we can have the additional expense of dead-lock detection.

In our case, we don't have that luxury. Fortunately, there are a few simple solutions--mostly picking the right design model--which allows us to always avoid deadlock.

(Quick hint: the easiest way to avoid deadlock is to have a strict ordering on our locks in our system, and only allow locking in that order. Thus, we could say that "A locks before B"--then we would never write the code in thread 2 because it violates our strict ordering on locks.)

The second problem with synchronization is that too heavy reliance upon it can make our code inefficient--that is, it could cause our code to effectively be single-threaded again. This can happen when we are attempting to avoid deadlock by using just one synchronization object, or locking code ineffectively. For example, in our file read/write routines above, if we had placed the synchronization block a couple of levels higher in the code--say, at the user interface point where we are updating the values in our (thread-local) in-memory copy, then we would unnecessarly block other threads that don't need to be blocked. In the worse case, performance degrades to the same level it would be if we just used one thread in our code.

Careful writes and careful design

A related topic to synchronization is the notion of careful writes (and what I would call "careful design"), where you are careful as to how information is written and how your code is developed.

Careful writes come from the field of database design. In short, it's the practice of making sure that data is written in a particular order, so that the last "write" operation is done in such a way that all the other write operations are then either completely valid or completely invalid. This means your code could fail half-ways through the writes and, when it comes back up in recovery mode, successfully roll back to the state things were prior to the write operation.

One example of this "careful write" method is to rename your existing file prior to writing the new contents. If your file contains a check-sum, then it is either completely and correctly written (because the check-sum passes), or it is not--and your code can recover by reading the back-up file instead. Your careful write is structured in such a way that you cannot lose data--so long as the integrity of the hard disk is guaranteed, of course. (Higher level design patterns allow us to even deal with hard disk failure, but that's beyond the scope of most software development most of us do for desktop boxes.)

Related to this is what I would call "careful design" or defensive programming. This is where we consider every possible input suspect and design mechanisms by which the software can recover gracefully in the event something unexpected occurs. For example, if you have a client program and a server program, the client program may be designed so that it gracefully deals with unknown or illegal returns from the server program--perhaps by logging the error, then retrying the connection after a few seconds or minutes.

Sometimes we need to use a combination of both, such as a careful transaction policy where a locally cached copy of some data being sent to the server is preserved until the server acknowledges the transaction. If a unique identifier (such as an incrementing integer or a timestamp) is used to identify each of the transactions being sent by the client to the server, it even allows us to recover if the server processed the transaction but failed to acknowledge it to the client: the transaction, when resent, could be ignored by the server but acknowledged to the client.

Some Design Models

Unfortunately all of this complexity can be hard for most people to manage. And so they grope around in the dark, adding a lock here and a lock there, hoping that their code performs well and doesn't deadlock. The big problem with deadlock bugs, of course, is that they never happen when you are looking for them, but only when you have running code in the field and are impressing a very important client.

Fortunately, there are a few design models which can be employed which solves most multi-threading issues, which allow us to get the synchronization correct.

1. Work Queues

Here's your basic problem. Suppose you have a common resource, such as a TCP/IP connection or an IPC pipe, and you have multiple threads which need to be able to write to that connection. (This assumes your protocol is so designed that, for example, each thread writes a single blob of data, and your protocol slams an identifier and a length and sends it over the wire.) Or you have code which is writing to the hard disk--whatever. The point is that this common resource is not "thread-safe"--you cannot do two writes at the same time.

The simple solution, of course, would be to put a synchronization block on the write operation. That way, only one thread can write to the socket or the disk at the same time. However, this poses a problem that your write winds up being a huge synchronization block. (Disk writes are netoriously slow, which is why disk caching is so popular.)

One solution is the Work Queue.

A work queue consists of two parts. The first part is the queue itself, and the second part is a background thread which "owns" the resource, and which is the only component allowed access to that resource. The idea is simple: when another thread wants to perform a write, it places the object it's writing on the queue. The queue is simply a list of requests that are pending, and they are pulled off by the background thread which actually performs the operation itself. The thread making the request can then either examine the state of the work request object, or just "insert and forget."

The queue becomes the trickiest part, as it is the thing that actually owns the synchronization. However, because all the work queue does is handle objects being inserted and removed, it's actually quite easy to write--and get correct. The queue consists of two main methods: an "insert" method which locks the queue, inserts the work unit in the linked list, then releases the lock. The second method, the "remove" method, locks the queue, tests to see if there is an item in the queue, and if not, waits. As soon as an object is in the queue, the remove method returns the work object.

Your background thread then can process items in the queue in strict order from first to last, pulling the items off one at a time, and process those items. Because there is only one thread that owns your resource (the disk, the TCP/IP port, etc), communications to that thread is intrinsically single-threaded.

Starting up this model is simple: initialize the queue, then start your background thread. Any foreground thread that wishes to write simply inserts into the queue. If your background thread wants to provide status information, it can: it can simply update the status on the work unit passed to it. But in practice, it's been my experience that most writer threads can be easily written as "insert and forget"--so you don't have a more complicated mechanism for dealing with memory when two objects may hold a reference to the same work item in the queue.

The work queue model is an extremely powerful and flexable model, and can be extended with relatively little cost. For example, a disk writer queue could know a little bit about the arrangement of the sectors on a disk, and know how to sort the write requests as they are inserted to minimize disk head traversal.

Further, you can even use a single work queue to handle "fanning out" information to multiple threads which are waiting for work units that are specifically ear-marked to that thread. For example, suppose we have a TCP/IP listener listening for requests that have a "channel" as part of it's header. We could add intelligence to our queue to allow listening threads to register that they are listening for a particular channel--then the 'remove' method could scan the queue for work objects specific to the listening thread. Requests that are received but do not have a corresponding channel could cause the work thread to either start up a new listener thread, or the requests could be dropped on the floor or sent back with an error response--using the writer model above, of course.

Work Queues can even be used when the underlying resource is not single-threaded, but when your writer threads need really good performance. As the underlying 'insert' method simply inserts a queued object into a linked list (a very fast operation), the number of CPU instructions behind a synchronization lock is very small--which means multiple threads can operate nearly at full speed. So long as your thread can do a "insert for later processing and forget" operation--such as a database request that can be handled by one of five or ten different database processing threads--your worker threads would gain quite a bit of performance over, say, explicitly opening it's own database connection itself.

You can also set "queue throttling" by setting the maximum number of entries you will allow in your queue before the "insert" call blocks. This allows you to slow down your worker threads in the event that the background thread or background resource is significantly slower than your foreground worker threads.

Relatives to the Work Queue

There are a number of Work Queue relatives besides the variations listed above. The one that immediately comes to my mind is:

Thread Pool. The Thread Pool is essentially a variation of the Work Queue, where requests for work go into a queue, and are pulled out by one or more threads that are part of our thread pool. Normally the "unit of work" here is the actual method to invoke in our worker thread--a "Runnable" object in Java parlence. The Thread Pool can also create new threads (up to a maximum limit) as new threads are needed and current worker threads are occupied--or it can simply create the maximum number of threads to start. Each worker thread then waits until something is inserted into the queue for processing.

2. Complex Synchronization Models

Typically most languages or libraries provide only very basic primitives to handle synchronization. For example, in the pthread library, we get a mutex object and a conditional object; the two work together to allow one thread to block while another thread executes, then when the second thread is done, it can signal the first thread to retry it's lock again. Java provides similar objects: the synchronization keyword allows a thread to block other threads trying to synchronize on the same object, and 'wait' and 'notify' methods allow one thread to block and wait until notified by another thread.

These synchronization primitives are often inadequate to our needs. We often need more complex synchronization objects to handle our locking needs.

Read multiple/write once

One very simple synchronization object is the 'read multiple/write once' synchronization object. One very common model that we can encounter is a resource object that can have multiple threads simultaneously trying to read it--but only one thread can be allowed to write it at a time. Of course we could simply use a single synchronization object to allow only one thread at a time to read the object as well--this is quite adequate if what we are reading is small, such as a short integer.

But if our read process is quite complex or involved, it would be a shame to block four or five reader threads simply because we're too lazy to write the correct synchronization object.

A 'read multiple/write once' synchronization object works by maintaining two bits of information: the number of objects which are holding a read lock, and a bit indicating if a write lock is being held. A thread requesting a read lock would test to see if a write lock was being held, and if so, wait until it's free, then increment the read lock. A thread requesting a write lock would check to make sure both read and write counts were zero before proceeding.

This has the advantage that a read request could go through very quickly, then the expensive read process could be done. Multiple readers could then read the same bit of memory at the same time without getting serialized.

Now in the event where writers are occassional but readers are persistent, you can modify this model by adding an additional flag: has a write lock request been made but has not been fulfilled yet? If that flag is set to true, other threads requesting a new read lock could wait. This solves the problem where read requests effectively drown out write requests, preventing the write request to go through because the read requests overlap each other. The downside, of course, is that a write request effectively kills all incoming read requests--but then, if write requests are occassional, it's a reasonable price to pay to guarantee the writer can actually get a chance to write.

Semaphores

A semaphore is a much simplier object than our read/write object, and consists of a simple counter. The counter is incremented on a set condition, and decremented on a clear condition, and a thread can wait until the count reaches zero.

This makes a great mechanism for doing a multiple thread "join" operation, where one thread waits until a bunch of other threads have completed. The threads are started up, then the main thread waits on the semaphore object.

Of course in this case you have to be mindful of the race condition of the threads not having started before our main thread reaches the semaphore wait state--but that can be solved by the main thread setting the semaphore count with the number of threads that are to be started, rather than allowing the subthreads to increment the semaphore count themselves.

Dealing with and debugging deadlocks

Fundamentally the reason for using all these queues and the correct (and judiciously placed) locking models is to avoid deadlock, yet improving throughput by minimizing the cross-section of our locked code. That's because the whole point in writing a multi-threaded program is to increase performance--yet deadlocks are a pain in the ass to find.

Fortunately one method I've used to find deadlocks is to create a synchronization timeout exception or diagnostic error. The idea here is that when you need to synchronize against something (such as a read/write object or a semaphore object from above), you also add that synchronization object to a list of synchronization objects, so your debug code can quickly dump the current state of all waiting objects.

Then, instead of waiting using an infinite wait period, you use a finite amount of wait time--and throw an exception into your debug code if your thread has waited for some inordinate amount of time, like, say, five minutes. If, after five minutes, your code is still locked, you can dump the current state of all waiting objects and a stack trace for each, so you can diagnose what has locked up.

Of course, the easiest way to avoid deadlocks entirely is to assign a strict ordering to the order in which synchronization objects are locked. That is, you can determine, for example, that between two synchronization objects, B will never be locked before A. Either B is locked by itself or A is locked prior to B being locked. That way, you will never get into a situation where thread 1 locks A and tries to lock B, while thread 2 locks B and tries to lock A.

You could conceivably carry this one further, by creating a synchronization object which knows about the ordering above. For example, you could create a synchronization object which tracks which threads are holding a lock on this object, and which is also told about the implicit ordering on all of the locks. That way, the synchronization object can immediately fail if it find a condition where someone attempts to lock A when B is already locked.

For a language such as C where synchronization objects must be explicitly allocated, this should be a relatively easy thing to build. For a language such as Java where synchronization is implicit in the 'synchronize' keyword, this could be a little more difficult and cumbersome to build.

Conclusions

Of course there are undoubtedly more complex synchronization issues that need to be dealt with. For example, one could conceive of a paged file object where each page can hold a read/write lock, but data can span multiple pages. In that case, the synchronization object to maintain the multiple read/write page file and the locks on that file, while preventing deadlock, can be quite complex.

But nine times out of ten, the multi-threading issues I've encountered in most programs I've worked on can be solved with one of the three or four synchronization design patterns above, or a relative of that pattern. While I'm not a big believer in the whole "I have a hammer, and now the world looks like a nail" design philosophy, for multi-threaded applications it's pretty true that a very small number of design models can take us very far with only minimal threading synchronization and deadlock issues.

And it's why so many operating systems have a primitive "synchronized queue" object or API, as well as some sort of semaphore object. And while read/write lock objects are less common, all too often they wind up being the first thing I build when building a multi-threaded program.

¶ 3:13 PM 0 Comments

Tuesday, August 02, 2005

Why Java Serialization Is Bad

So you've written a very cool application in Java. Say you've built the next great thing, a program that edits 3D objects or is the next Microsoft Word killer or just keeps track of your recipes. And now it's time to tackle the whole "how do we save our in-memory representation out to disk" problem.

Well, the answer is pretty simple, no? Just pop open a java.io.ObjectOutputStream and dump your KillerDocument object to the file. Done, no problems, right?

Over the past few days here at Symantec we had a problem. We wanted to reduce the footprint of a background task which had been originally created in Java. For enterprise reliability, Java is great: it's built in memory management reduces the number of problems that need to be solved in order to create a reliable process--and it's built in class library means you can pretty much drop your jar files anywhere and just have the thing run.

But Java is a pig: it's memory footprint is just unacceptable for smaller end-point boxes, such as older desktop models.

So we have to rewrite this key component in C.

Problem: the person who originally wrote an encrypted keystore mechanism by which we could secure some information used Java serialization to dump the keystore, stored as a Java HashMap. And I had to write some code to load and save the same HashMap data in C. What had been perhaps a dozen lines of code in the original Java implementation turned into a couple of thousand lines of C code to do the same thing.

Why? Why was it such a pain in the ass to read in a serialized Java HashMap in C? It's not that the contents of the serialized file is exactly a secret: Sun has published the Java Object Serialization Specification on-line. But the actual format of the serialized data makes a bunch of fundamental assumptions about the implementation of your code.

First, and most devastating to my own purposes, the Java serialization API assumes that the file will be read in by Java. The underlying file format is not quite block-oriented: sure, you get 'end of block' markers in the file, but you have no way to apriori know the size of a unit of code within the file that you don't understand. (For example, if you had to parse a file with a HashMap embedded in it, but you wanted to skip the HashMap block because it contains information your application doesn't need, you would have no way to know how big the HashMap object was as stored in the file. The HashMap simply provides a flag that says "yes, this object contains custom code", but no way to know the size of that custom blob of data, without parsing through the data as if you were a hash map.)

Worse, because the Java serialization API stores a bunch of internal state about the objects being written, it becomes necessary to understand some of the internal field and what their values should be for given sets of keys. For example, the HashMap stores four bits of information: the "loadFactor", the "threshold", the "capacity", and the "number". The number is obvious: this is the number of key/value pairs. But what about the threshold? The capacity? The load factor?

Examining the source code for the HashMap class in Java's source kit, we find out that the reader code bypasses important checks on the contents before loading the file. This means that we bloody well better get the capacity and the threshold correct, or else we run a risk of crashing the HashMap implementation during file load.

Now it turns out that the loadFactor is the loading factor used before growing the internal array in the HashMap--which can be set to 0.75. The threshold--that is, the number of objects that the HashMap will store before growing it's internal array--is ((floor)(16 * loadFactor)) * (2 ** N) for some value of N which gives a threshold that is greater than or equal to the number of objects in our array. And the capacity--that is, the total number of slots in the internal array used by our HashMap--is ((floor)(threshold / loadFactor)), or, as 16*loadFactor happens to be 12 even, it's 16 * (2 ** N) for the value of N found above.

And without the Java source kit, there is no way I could have known this so I could write out a well-formed HashMap.

Second, the serialization process restricts the implementation of your code. Currently our key store is a hash map, but if in the future we wanted to replace it with (say) a vector array or with a new-and-improved object--we're SOL. The file says we're getting a HashMap, and by God we're getting a HashMap.

Third, if the file needs to be communicated to another party, and the data is not serialized with well-known classes, then the third party is out of luck without your custom class implementation.

What makes this particular example especially egregious is that it would have taken just a few extra lines of code for the original author above to have instead opened up a DataOutputStream and explicitly wrote the key/value pairs to that file--perhaps with a little additional decoration, such as a header and version number. Instead of taking a dozen lines of code it would have taken two dozen.

The disadvantage of the DataOutputStream is that it only knows how to write certain primitives (shorts, longs, Java strings)--which means that the implementation would have been slightly more complex.

But then, in the end, the file would be (a) data independent (we could use any data structure we want to store our data internally), (b) language independent (we're not married to Java--we could read it in Perl or Basic), and (c) implementation independent (we're not storing implementation-dependent data that is tangental to our actual content--think 'threshold' in the example above)--which would have reduced the multi-thousand line C code for reading in our HashMap down to perhaps less than a couple of dozen.

¶ 12:05 PM 4 Comments

About Me