In the last post, I told you how I wanted to send around a custom Message
object among services in our system using Amazon SQS messaging. I soon ran into another problem.
As a reminder, here is the set of allowed XML characters again.
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
I soon got another Amazon SQS exception. I inspected the hex dump and found 0x19 byte in the text.
The 0x19 character is END-OF-MEDIUM character, a control character I imagine you might copy-and-paste from a Windows machine into a web form. In Firefox, 0x19 looks like this (right after the phrase “Basket Makers”).
In IE7, it looks like this.
One thing you might try is to strip out invalid UTF-8 characters using iconv.
$string = iconv("UTF-8", "UTF-8//IGNORE", $string); |
But this won’t work. 0x19 is a valid UTF-8 character, but an invalid XML character. No choice but to explicitly filter out invalid XML values. Using this filter (assuming text is in UTF-8 already) on instance values before serializing the object seems to work.
$str = preg_replace('/[^(\x9|\xA|\xD|\x20-\xD7FF|\xE000-\xFFFD|\x10000\-\x10FFFF)]*/', '', $str); |
3 Comments to 'Preventing XML-Restricted Characters'
January 7, 2009
have you tried using XMLReader as you process stream. That should take care of things like that.
January 8, 2009
Hmmm, have never used XMLReader. Will look into that next time. Of course, we weren’t processing large amounts of XML. We were sending and receiving our own XML messages.
January 8, 2009
i am taking about PHP XMLReader/Writer. It just an SAX parser. It is quick and it should be able to escape those characters correctly. You can also use it to validate your xml. Do not use SimpleXML because it does not escape.