How does one view an XML document? Notepad, or the XML editor in Visual Studio? XML is text, after all, so theoretically any text editor will do the job. Right?
It’s not really WYSIWYG
Simply put, XML is a binary format. Consider this simple XML document composed in the Visual Studio XML editor:
<?xml version=“1.0” encoding=“utf-8”?>
If this file is saved to disk, the encoding is truly UTF-8, including the UTF-8 Byte Order Marker. However, if I compose this same document in a generic text editor and save it to disk, the encoding will most likely be Windows 1252, not UTF-8. Lets say I highlighted and copied the XML from the Visual Studio XML editor to the clipboard. When I paste the XML into another editor and save it to disk, the resulting encoding will not be UTF-8 because the encoding copied and pasted is Windows 1252. My point is that the very first line of the document, <?xml version=”1.0″ encoding=”utf-8″?>, doesn’t necessarily indicate that what you see is what you get.
For the most part, UTF-8 and Windows 1252 encodings are identical if all the characters are single byte. However, the degree symbol, °, is different. In Windows 1252 (and many other encodings), the degree symbol in hex is 0xB0, a single byte. In UTF-8, the degree symbol is multi-byte, 0xC2B0. If the document is composed without regard to the underlying encoding, it’s very easy to end up with an UTF-8 document containing an illegal character. The problem is that just viewing the document in a text editor isn’t going to tell you that. To be absolutely sure, you need a good binary editor that has the ability to convert between encodings according to the specifications behind the encodings.
When encoding meets code
Recently, a customer using the JNBridge JMS Adapter for BizTalk Server ran into some unexpected behavior. The JMS BizTalk adapter was replacing the File Adapter as the customer was moving to a messaging solution centered on JMS. Occasionally, a UTF-8 XML document would contain an illegal character. When the file adapter was used, the XML pipeline would throw an exception during disassembly when the illegal UTF-8 character was encountered. The failed message was routed to a directory where the illegal characters were presumably fixed and the message resubmitted. When the customer moved to the JMS adapter, there were no failed messages, even those containing an illegal character, in this case, the one-byte degree symbol. The final XML document, wherever it was routed to, now contained ‘ ï¿½ ‘ instead. The byte 0xB0 had been replaced by three bytes: 0xEF, 0xBF and 0xBD. The customer, understandably, was confused.
The problem got its start when the message was published to the JMS queue. At some point, Java code similar to this executed. The code is Java 7.
byte rawBytes = Files.readAllBytes(Paths.get(somePath));
String jmsMessageBody = StandardCharsets.UTF_8.decode(ByteBuffer.wrap(rawBytes)).toString();
A UTF-8 XML file is read as raw bytes, then explicitly converted from UTF-8 to a java.lang.String. The string, jmsMessageBody, is used to create a JMS Text Message that will be published to a queue. Though not entirely obvious, the above line of code has just performed a conversion. A Java string uses UTF-16 encoding. During the conversion, any illegal UTF-8 characters, like 0xB0, are converted to the UTF-16 replacement character, 0xFFFD. This mechanism is part of the Unicode specification.
When the JMS BizTalk adapter receives the JMS Text Message, it must convert the contained text to the expected encoding, UTF-8, before submitting the message to the BizTalk DB. As per the Unicode specification, the UTF-16 replacement character, 0xFFFD, is converted to the UTF-8 replacement characters: 0xEF, 0xBF and 0xBD. When the File Adapter was used, the pipeline threw an exception because the byte, 0xB0, was illegal. When using the JMS adapter, the UTF-8 replacement characters are perfectly legal, hence the pipeline disassembled the document correctly.
The JMS specification says this about JMS Text Messages:
The inclusion of this message type is based on our presumption that String messages will be used extensively. One reason for this is that XML will likely become a popular mechanism for representing the content of JMS messages.
Of course, this was written in 1999, when XML was still used to markup content rather than define it. Should one use Text Messages to carry XML documents as payloads? I believe the answer is ‘no’. If the customer had used JMS Byte Messages instead of Text Messages, the behavior when they switched to JMS would have been the same as with the file adapter. Illegal characters would have been caught by the pipeline and subsequently fixed with the process that was already in place. Instead, the document ends up with the garbage characters ‘ ï¿½ ‘ instead of the intended degree symbol.
Using Byte Messages for XML is also more efficient. Not only would there be no unnecessary conversions from UTF-8 to UTF-16 and back again, but there would be no message bloat. Converting UTF-8 to UTF-16 can double the size of the XML document. That’s a 100% increase in size. Why incur the cost to build, publish and deliver JMS messages that are twice the size of the original document?