Archive for January, 2009

Preventing XML-Restricted Characters

In the last post, I told you how I wanted to send around a custom Message object among services in our system using Amazon SQS messaging.  I soon ran into another problem.

As a reminder, here is the set of allowed XML characters again.

#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]

I soon got another Amazon SQS exception.  I inspected the hex dump and found 0x19 byte in the text.

The 0x19 character is END-OF-MEDIUM character, a control character I imagine you might copy-and-paste from a Windows machine into a web form.  In Firefox, 0x19 looks like this (right after the phrase “Basket Makers”).

EOM (0x19) in Firefox

In IE7, it looks like this.

EOM (0x19) in IE7

One thing you might try is to strip out invalid UTF-8 characters using iconv.

$string = iconv("UTF-8", "UTF-8//IGNORE", $string);

But this won’t work.  0x19 is a valid UTF-8 character, but an invalid XML character.  No choice but to explicitly filter out invalid XML values.  Using this filter (assuming text is in UTF-8 already) on instance values before serializing the object seems to work.

$str = preg_replace('/[^(\x9|\xA|\xD|\x20-\xD7FF|\xE000-\xFFFD|\x10000\-\x10FFFF)]*/', '', $str);

In our service-oriented system, I wanted to send around a custom Message object.  We currently use Amazon SQS for messaging, which requires that all message characters fall within the valid XML character range (according to W3C XML 1.0 spec).  This range is:

#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]

Here’s some code to serialize the object and enqueue it to SQS.

class Message
   private $_msg;
   public function setMessage($msg)
       $this->_msg = $msg;
$obj = new Message;
$obj->setMessage('Hello world!');
$msg = serialize($obj);

Unfortunately, this code produced this exception:

Amazon_SQS_Exception: An invalid binary character was found in the message body, the set of allowed characters is #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

What was going on?

I echoed out the serialized $msg var and looked at it carefully.  There was a lot of funny punctuation, but nothing outside of the XML character range.

O:7:"Message":1:{s:13:" Message _msg";s:12:"Hello world!";}

When in doubt, look at the binary.  I wrote $msg out to a file, and looked at the file’s hex dump.

0000000: 4f3a 373a 224d 6573 7361 6765 223a 313a  O:7:"Message":1:
0000010: 7b73 3a31 333a 2200 4d65 7373 6167 6500  {s:13:".Message.
0000020: 5f6d 7367 223b 733a 3132 3a22 4865 6c6c  _msg";s:12:"Hell
0000030: 6f20 776f 726c 6421 223b 7d              o world!";}

The culprit is at the end of line 2.  The NUL (0x00) character is most definitely not in the valid character range.  Some googling confirmed my suspicions in the PHP manual itself.

If you are serializing an object with private variables, beware. The serialize() function returns a string with null (x00) characters embedded within it, which you have to escape.

Other comments on that manual page also gave me my solution: escaping with addslashes().  So I replaced line 13 in the code above like so, and I was then able to send objects over SQS with no problem.

$msg = addslashes(serialize($obj));