Published
Tuesday, May 13, 2008 12:09 AM
by
martin
In my consulting days, I had quite a few engagements where you might call me a performance troubleshooter. That is, the client had already built an app, in some cases it was even in the hands of users, but performance was unacceptably bad. They turned to their platform vendor for help, and they got me.
Performance is a feature, but it's an unusual one in that it can be present one day and so easily obliterated the next, sometimes by changing a single line of code. One technology stands out in my memory as being misused more than any other, and creating performance problems along the way. That technology is XML. This post from Jeff Atwood prompted me to think about this stuff again.
First, we should draw a distinction between the XML Infoset and the normal, textual encoding of XML that Jeff talks so eloquently about. Sure, we all think about angle brackets at the mention of XML, but data structured using the XML Infoset can be encoded in much more efficient ways. Note that WCF (in .NET v3.0 at least) uses XML to format all its messages, but can encode them in various different ways, so that performance can be just as good as Microsoft's older, binary, messaging formats. So it's really the textual encoding of XML that gives performance problems, but until recently I think that's all we had. Certainly I was never aware of tools on the Microsoft platform that did anything different.
Problems usually arose when XML was used as a mechanism for carrying lumps of data between modules, subsystems, call them what you will. Actually, let's think about that. Quite apart from any performance implications, is XML an appropriate "bag" to shift data around - often between components that are sharing an address space? Well, I'd never say "never", but it seems like an odd choice to me. I think in many cases development teams grew tired of typed interfaces between components and thought how much easier life would be if they could pass an XML string. You can put anything in there and it will still compile. That's right: turn all your compile-time errors into runtime ones. Smart move. Going way back I remember ADO Recordsets being used for similar reasons. Now, XML isn't necessarily untyped of course; we do have schema, but to do schema validation simply compounds the performance issues I'm mainly thinking about.
I saw two uses of XML that gave performance problems. One was where data was formatted as XML, with a textual encoding, into a string, then passed across some interface or other. This process incurs CPU overheads to format, encode, decode, parse, etc. and it's also pretty likely that there are memory overheads associated with transferring data in this way. I daresay though that plenty of systems out there work like this and manage to hit their performance goals.
The other, more interesting, scenario was where XML document objects were used rather than strings. I saw this done with msxml DOM objects and System.Xml.XmlDocument. The performance in these cases was usually significantly worse than for simple strings, although that does presume the strings were parsed using SAX or the XmlTextReader or similar, rather than loaded into a document object. The XML document objects are quite big, costly things. They maintain indexes across the data they hold, and they have to store the data in a way that allows them to easily read/write the textual encoding. Creating nodes in a document can typically only be done by making a call on a document object, so in some cases I saw people instantiating document objects simply to create nodes that could then be copied into other document objects. Once I saw around 80 document object instantiations to formulate a single data item that would more naturally have been a graph of typed objects. That's a lot of memory and a lot of CPU cycles.
That leads me to a simple statement that became something of a golden rule for me. If you're going to use a technology, understand it first. In other words, know what it costs.
It sounds so simple, but I'd say almost all the performance issues I ever saw could be traced back to a failure of this simple rule. The people who built those XML-centric systems weren't fools: if they only knew the cost of what they were doing, they'd have done things differently.
Today, I hope never to use code if I don't understand the tradeoffs involved. Does that mean I have to have source code for everything I use? No, although I use Reflector quite a bit. It also means that I experiment with alternatives. I build simple standalone apps to try out different algorithms and understand where they're strong and where they're weak. I've written before about "Big O" notation, and that's another big input for me when choosing data structures and algorithms.
For a long time, at least one of my fellow consultants held the view that XML is evil and should be banned. Of course, XML is a tool, and there are some jobs for which XML is the right tool. I think XML is unusual because, for a while there, a lot of people thought XML was the right tool for every job.