Most of our live production code was written (by me) without any attention paid to character encodings. Fortunately, nearly every link in the LAMP chain seems to default to ISO-8859-1 nicely, so things have worked out for the most part as that. Every now and then a UTF-8 character will pop up, and we’ll either change the character in the database, or someone will use random combinations of htmlentities() and mb_convert_encoding() in some random file until it looks right in that particular case. It’s one of those cases of building up a smidgen of technical debt. Doing it the right way and switching all of our code, databases, and data from ISO-8859-1 to UTF-8 at this point makes me shudder.
For our newer systems coming online, I really wanted to get this character encoding problem right. Since we started from scratch, all the necessary endpoints were written to support UTF-8 encoded text. And we made sure that all incoming data is UTF-8 encoded. If it was not, we converted it basically using this single line.
$string = mb_convert_encoding($string, 'UTF-8');
But something was wrong. When I tried to convert a single smart quote (’) generated on my Windows machine and view it in my browser, it simply disappeared. Trawling the PHP manual for a solution (as usual), I came upon it on the manual page for utf8_encode().
Note that you should only use utf8_encode() on ISO-8859-1 data, and not on data using the Windows-1252 codepage. Microsoft’s Windows-1252 codepage contains ISO-8859-1, but it includes several characters in the range 0x80-0x9F whose codepoints in Unicode do not match the byte’s value (in Unicode, codepoints U+80 – U+9F are unassigned).
utf8_encode() simply assumes the bytes integer value is the codepoint number in Unicode.
What this means is that, for example, a single smart quote (’), sent to PHP as ISO-8859-1, and converted to UTF-8 using utf8_encode(), will not convert to the proper multi-byte character, and thus will either appear as garbage in the browser or not at all (in fact it’s not at all since the values are unassigned).
Since no third argument is given, mb_convert_encoding() will use the default internal encoding for that platform. Unfortunately, PHP uses ISO-8859-1 on Windows instead of the so-similar-yet-different-it’s-annoying-that-it-must-be-a-Microsoft-product Windows-1252 encoding, which mostly overlaps with ISO-8859-1 but has different values for certain non-control, non-ASCII punctuation characters.
The solution fortunately was also in the same manual page, which was simply a function with a hard-coded mapping to replace all the incorrectly converted Windows-1252 characters to their correct UTF-8 values.
So I modified the above line of code to look like the following, and I could see my smart quotes once again.
$string = strtr(mb_convert_encoding($string, 'UTF-8'), self::$_cp1252_map);
In the last post, I told you how I wanted to send around a custom Message
object among services in our system using Amazon SQS messaging. I soon ran into another problem.
As a reminder, here is the set of allowed XML characters again.
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
I soon got another Amazon SQS exception. I inspected the hex dump and found 0x19 byte in the text.
The 0x19 character is END-OF-MEDIUM character, a control character I imagine you might copy-and-paste from a Windows machine into a web form. In Firefox, 0x19 looks like this (right after the phrase “Basket Makers”).
In IE7, it looks like this.
One thing you might try is to strip out invalid UTF-8 characters using iconv.
$string = iconv("UTF-8", "UTF-8//IGNORE", $string);
But this won’t work. 0x19 is a valid UTF-8 character, but an invalid XML character. No choice but to explicitly filter out invalid XML values. Using this filter (assuming text is in UTF-8 already) on instance values before serializing the object seems to work.
$str = preg_replace('/[^(\x9|\xA|\xD|\x20-\xD7FF|\xE000-\xFFFD|\x10000\-\x10FFFF)]*/', '', $str);