In the last post, I told you how I wanted to send around a custom
Message object among services in our system using Amazon SQS messaging. I soon ran into another problem.
As a reminder, here is the set of allowed XML characters again.
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
I soon got another Amazon SQS exception. I inspected the hex dump and found 0x19 byte in the text.
The 0x19 character is END-OF-MEDIUM character, a control character I imagine you might copy-and-paste from a Windows machine into a web form. In Firefox, 0x19 looks like this (right after the phrase “Basket Makers”).
In IE7, it looks like this.
One thing you might try is to strip out invalid UTF-8 characters using iconv.
$string = iconv("UTF-8", "UTF-8//IGNORE", $string);
But this won’t work. 0x19 is a valid UTF-8 character, but an invalid XML character. No choice but to explicitly filter out invalid XML values. Using this filter (assuming text is in UTF-8 already) on instance values before serializing the object seems to work.
$str = preg_replace('/[^(\x9|\xA|\xD|\x20-\xD7FF|\xE000-\xFFFD|\x10000\-\x10FFFF)]*/', '', $str);