Tag: unicode

PCRE Fail

Here’s a good reason to keep all your development and production environments the same.  The task was simple enough.  I wanted to strip a UTF-8 encoded string of all punctuation.  Here’s some example code that does it, using PHP’s PCRE library.

1
2
3
4
5
6
7
8
<?php
 
// remove everything but letters, numbers, and spaces
// the 'u' modifier enables UTF-8
 
$string = "TAGholy! “moley”.    & bát's were _killed ^%by ; dogs, for £50 ümlauts";
$string = preg_replace('/[^\p{L}\p{N}\p{Zs}]+/u', '', $string);
echo "$string\n";

On PHP 5.2.4:

1
2
3
4
5
% php -i | grep PCRE
PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 6.6 06-Feb-2006
% php pcre_test.php
ss

On PHP 5.2.6:

1
2
3
4
5
% php -i | grep PCRE
PCRE (Perl Compatible Regular Expressions) Support => enabled
PCRE Library Version => 7.6 2008-01-28
% php pcre_test.php
TAGholy moley     báts were killed by  dogs for 50 ümlauts

Only took me a day to figure out.

Converting ’s Correctly

Most of our live production code was written (by me) without any attention paid to character encodings.  Fortunately, nearly every link in the LAMP chain seems to default to ISO-8859-1 nicely, so things have worked out for the most part as that.  Every now and then a UTF-8 character will pop up, and we’ll either change the character in the database, or someone will use random combinations of htmlentities() and mb_convert_encoding() in some random file until it looks right in that particular case.  It’s one of those cases of building up a smidgen of technical debt.  Doing it the right way and switching all of our code, databases, and data from ISO-8859-1 to UTF-8 at this point makes me shudder.

For our newer systems coming online, I really wanted to get this character encoding problem right.  Since we started from scratch, all the necessary endpoints were written to support UTF-8 encoded text.  And we made sure that all incoming data is UTF-8 encoded.  If it was not, we converted it basically using this single line.

$string = mb_convert_encoding($string, 'UTF-8');

But something was wrong.  When I tried to convert a single smart quote (’) generated on my Windows machine and view it in my browser, it simply disappeared.  Trawling the PHP manual for a solution (as usual), I came upon it on the manual page for utf8_encode().

Note that you should only use utf8_encode() on ISO-8859-1 data, and not on data using the Windows-1252 codepage. Microsoft’s Windows-1252 codepage contains ISO-8859-1, but it includes several characters in the range 0x80-0x9F whose codepoints in Unicode do not match the byte’s value (in Unicode, codepoints U+80 – U+9F are unassigned).

utf8_encode() simply assumes the bytes integer value is the codepoint number in Unicode.

What this means is that, for example, a single smart quote (’), sent to PHP as ISO-8859-1, and converted to UTF-8 using utf8_encode(), will not convert to the proper multi-byte character, and thus will either appear as garbage in the browser or not at all (in fact it’s not at all since the values are unassigned).

Since no third argument is given, mb_convert_encoding() will use the default internal encoding for that platform.  Unfortunately, PHP uses ISO-8859-1 on Windows instead of the so-similar-yet-different-it’s-annoying-that-it-must-be-a-Microsoft-product Windows-1252 encoding, which mostly overlaps with ISO-8859-1 but has different values for certain non-control, non-ASCII punctuation characters.

The solution fortunately was also in the same manual page, which was simply a function with a hard-coded mapping to replace all the incorrectly converted Windows-1252 characters to their correct UTF-8 values.

So I modified the above line of code to look like the following, and I could see my smart quotes once again.

$string = strtr(mb_convert_encoding($string, 'UTF-8'), self::$_cp1252_map);