Why Java Serialization Is Bad
So you've written a very cool application in Java. Say you've built the next great thing, a program that edits 3D objects or is the next Microsoft Word killer or just keeps track of your recipes. And now it's time to tackle the whole "how do we save our in-memory representation out to disk" problem.
Well, the answer is pretty simple, no? Just pop open a java.io.ObjectOutputStream and dump your KillerDocument object to the file. Done, no problems, right?
Over the past few days here at Symantec we had a problem. We wanted to reduce the footprint of a background task which had been originally created in Java. For enterprise reliability, Java is great: it's built in memory management reduces the number of problems that need to be solved in order to create a reliable process--and it's built in class library means you can pretty much drop your jar files anywhere and just have the thing run.
But Java is a pig: it's memory footprint is just unacceptable for smaller end-point boxes, such as older desktop models.
So we have to rewrite this key component in C.
Problem: the person who originally wrote an encrypted keystore mechanism by which we could secure some information used Java serialization to dump the keystore, stored as a Java HashMap. And I had to write some code to load and save the same HashMap data in C. What had been perhaps a dozen lines of code in the original Java implementation turned into a couple of thousand lines of C code to do the same thing.
Why? Why was it such a pain in the ass to read in a serialized Java HashMap in C? It's not that the contents of the serialized file is exactly a secret: Sun has published the
Java Object Serialization Specification on-line. But the actual format of the serialized data makes a bunch of fundamental assumptions about the implementation of your code.
First, and most devastating to my own purposes, the Java serialization API assumes that the file will be read in by Java. The underlying file format is not quite block-oriented: sure, you get 'end of block' markers in the file, but you have no way to apriori know the size of a unit of code within the file that you don't understand. (For example, if you had to parse a file with a HashMap embedded in it, but you wanted to skip the HashMap block because it contains information your application doesn't need, you would have no way to know how big the HashMap object was as stored in the file. The HashMap simply provides a flag that says "yes, this object contains custom code", but no way to know the size of that custom blob of data, without parsing through the data as if you were a hash map.)
Worse, because the Java serialization API stores a bunch of internal state about the objects being written, it becomes necessary to understand some of the internal field and what their values should be for given sets of keys. For example, the HashMap stores four bits of information: the "loadFactor", the "threshold", the "capacity", and the "number". The number is obvious: this is the number of key/value pairs. But what about the threshold? The capacity? The load factor?
Examining the source code for the HashMap class in Java's source kit, we find out that the reader code
bypasses important checks on the contents before loading the file. This means that we bloody well better get the capacity and the threshold correct, or else we run a risk of crashing the HashMap implementation during file load.
Now it turns out that the loadFactor is the loading factor used before growing the internal array in the HashMap--which can be set to 0.75. The threshold--that is, the number of objects that the HashMap will store before growing it's internal array--is ((floor)(16 * loadFactor)) * (2 ** N) for some value of N which gives a threshold that is greater than or equal to the number of objects in our array. And the capacity--that is, the total number of slots in the internal array used by our HashMap--is ((floor)(threshold / loadFactor)), or, as 16*loadFactor happens to be 12 even, it's 16 * (2 ** N) for the value of N found above.
And without the Java source kit, there is no way I could have known this so I could write out a well-formed HashMap.
Second, the serialization process restricts the implementation of your code. Currently our key store is a hash map, but if in the future we wanted to replace it with (say) a vector array or with a new-and-improved object--we're SOL. The file says we're getting a HashMap, and by God we're getting a HashMap.
Third, if the file needs to be communicated to another party, and the data is not serialized with well-known classes, then the third party is out of luck without your custom class implementation.
What makes this particular example especially egregious is that it would have taken just a few extra lines of code for the original author above to have instead opened up a DataOutputStream and explicitly wrote the key/value pairs to that file--perhaps with a little additional decoration, such as a header and version number. Instead of taking a dozen lines of code it would have taken two dozen.
The disadvantage of the DataOutputStream is that it only knows how to write certain primitives (shorts, longs, Java strings)--which means that the implementation would have been slightly more complex.
But then, in the end, the file would be (a) data independent (we could use any data structure we want to store our data internally), (b) language independent (we're not married to Java--we could read it in Perl or Basic), and (c) implementation independent (we're not storing implementation-dependent data that is tangental to our actual content--think 'threshold' in the example above)--which would have reduced the multi-thousand line C code for reading in our HashMap down to perhaps less than a couple of dozen.