<?xml version="1.0" encoding="UTF-8"?> <rss
version="2.0"
xmlns:content="http://purl.org/rss/1.0/modules/content/"
xmlns:wfw="http://wellformedweb.org/CommentAPI/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:atom="http://www.w3.org/2005/Atom"
xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
> <channel><title>Straylight Run &#187; utf-8</title> <atom:link href="http://blog.straylightrun.net/tag/utf-8/feed/" rel="self" type="application/rss+xml" /><link>http://blog.straylightrun.net</link> <description>Software, Technology, PHP</description> <lastBuildDate>Mon, 07 Nov 2011 19:26:59 +0000</lastBuildDate> <language>en</language> <sy:updatePeriod>hourly</sy:updatePeriod> <sy:updateFrequency>1</sy:updateFrequency> <generator>http://wordpress.org/?v=3.3.1</generator> <item><title>Converting &#8217;s Correctly</title><link>http://blog.straylightrun.net/2009/01/09/converting-s-correctly/</link> <comments>http://blog.straylightrun.net/2009/01/09/converting-s-correctly/#comments</comments> <pubDate>Fri, 09 Jan 2009 18:55:05 +0000</pubDate> <dc:creator>gerard</dc:creator> <category><![CDATA[Coding]]></category> <category><![CDATA[encoding]]></category> <category><![CDATA[unicode]]></category> <category><![CDATA[utf-8]]></category> <guid
isPermaLink="false">http://blog.straylightrun.net/?p=91</guid> <description><![CDATA[Most of our live production code was written (by me) without any attention paid to character encodings.  Fortunately, nearly every link in the LAMP chain seems to default to ISO-8859-1 nicely, so things have worked out for the most part as that.  Every now and then a UTF-8 character will pop up, and we&#8217;ll either [...]]]></description> <content:encoded><![CDATA[<p>Most of our live production code was written (by me) without any attention paid to character encodings.  Fortunately, nearly every link in the LAMP chain seems to default to ISO-8859-1 nicely, so things have worked out for the most part as that.  Every now and then a UTF-8 character will pop up, and we&#8217;ll either change the character in the database, or someone will use random combinations of <a
href="http://php.net/htmlentities">htmlentities()</a> and <a
href="http://php.net/mb_convert_encoding">mb_convert_encoding()</a> in some random file until it looks right in that particular case.  It&#8217;s one of those cases of building up a smidgen of <a
href="http://blog.straylightrun.net/2008/12/19/technical-debt-eventually-the-man-will-collect/">technical debt</a>.  Doing it the right way and switching all of our code, databases, and data from ISO-8859-1 to UTF-8 at this point makes me shudder.</p><p>For our newer systems coming online, I really wanted to get this character encoding problem right.  Since we started from scratch, all the necessary endpoints were written to support UTF-8 encoded text.  And we made sure that all incoming data is UTF-8 encoded.  If it was not, we converted it basically using this single line.</p><div
class="wp_syntax"><div
class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">mb_convert_encoding</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'UTF-8'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div><p>But something was wrong.  When I tried to convert a single smart quote (’) generated on my Windows machine and view it in my browser, it simply disappeared.  Trawling the PHP manual for a solution (as usual), I came upon it on the <a
href="http://us2.php.net/manual/en/function.utf8-encode.php#44843">manual page for utf8_encode()</a>.</p><blockquote><p>Note that you should only use utf8_encode() on ISO-8859-1 data, and not on data using the Windows-1252 codepage. Microsoft&#8217;s Windows-1252 codepage contains ISO-8859-1, but it includes several characters in the range 0&#215;80-0x9F whose codepoints in Unicode do not match the byte&#8217;s value (in Unicode, codepoints U+80 &#8211; U+9F are unassigned).</p><p>utf8_encode() simply assumes the bytes integer value is the codepoint number in Unicode.</p></blockquote><p>What this means is that, for example, a single smart quote (’), sent to PHP as ISO-8859-1, and converted to UTF-8 using utf8_encode(), will not convert to the proper multi-byte character, and thus will either appear as garbage in the browser or not at all (in fact it&#8217;s not at all since the values are unassigned).</p><p>Since no third argument is given, mb_convert_encoding() will use the default internal encoding for that platform.  Unfortunately, PHP uses ISO-8859-1 on Windows instead of the <em>so-similar-yet-different-it&#8217;s-annoying-that-it-must-be-a-Microsoft-product</em> Windows-1252 encoding, which mostly overlaps with ISO-8859-1 but has different values for certain non-control, non-ASCII punctuation characters.</p><p>The solution fortunately was also in the same manual page, which was simply a <a
href="http://us2.php.net/manual/en/function.utf8-encode.php#45226">function with a hard-coded mapping</a> to replace all the incorrectly converted Windows-1252 characters to their correct UTF-8 values.</p><p>So I modified the above line of code to look like the following, and I could see my smart quotes once again.</p><div
class="wp_syntax"><div
class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">strtr</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">mb_convert_encoding</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'UTF-8'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">self</span><span style="color: #339933;">::</span><span style="color: #000088;">$_cp1252_map</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div> ]]></content:encoded> <wfw:commentRss>http://blog.straylightrun.net/2009/01/09/converting-s-correctly/feed/</wfw:commentRss> <slash:comments>0</slash:comments> </item> <item><title>Preventing XML-Restricted Characters</title><link>http://blog.straylightrun.net/2009/01/07/preventing-xml-restricted-characters/</link> <comments>http://blog.straylightrun.net/2009/01/07/preventing-xml-restricted-characters/#comments</comments> <pubDate>Wed, 07 Jan 2009 19:03:26 +0000</pubDate> <dc:creator>gerard</dc:creator> <category><![CDATA[Coding]]></category> <category><![CDATA[amazon]]></category> <category><![CDATA[encoding]]></category> <category><![CDATA[sqs]]></category> <category><![CDATA[utf-8]]></category> <category><![CDATA[xml]]></category> <guid
isPermaLink="false">http://blog.straylightrun.net/?p=87</guid> <description><![CDATA[In the last post, I told you how I wanted to send around a custom Message object among services in our system using Amazon SQS messaging.  I soon ran into another problem. As a reminder, here is the set of allowed XML characters again. #x9 &#124; #xA &#124; #xD &#124; [#x20 to #xD7FF] &#124; [#xE000 [...]]]></description> <content:encoded><![CDATA[<p>In the <a
href="http://blog.straylightrun.net/2009/01/05/sending-a-serialized-object-to-amazon-sqs/">last post</a>, I told you how I wanted to send around a custom <code>Message </code>object among services in our system using Amazon SQS messaging.  I soon ran into another problem.</p><p>As a reminder, here is the set of allowed XML characters again.</p><blockquote><p>#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]</p></blockquote><p>I soon got another Amazon SQS exception.  I inspected the hex dump and found 0&#215;19 byte in the text.</p><p>The <a
href="http://www.fileformat.info/info/unicode/char/0019/index.htm">0&#215;19 character is END-OF-MEDIUM character</a>, a control character I imagine you might copy-and-paste from a Windows machine into a web form.  In Firefox, 0&#215;19 looks like this (right after the phrase &#8220;Basket Makers&#8221;).</p><p
style="text-align: center;"><a
href="http://blog.straylightrun.net/wp-content/uploads/2009/01/ff-xml-char.jpg"><img
style="text-align: center; border-width: 0px;" src="http://blog.straylightrun.net/wp-content/uploads/2009/01/ff-xml-char-thumb.jpg" border="0" alt="EOM (0x19) in Firefox" width="354" height="127" /></a></p><p>In IE7, it looks like this.</p><p
style="text-align: center;"><a
href="http://blog.straylightrun.net/wp-content/uploads/2009/01/ie-xml-char.jpg"><img
style="text-align: center; border-top-width: 0px; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px" src="http://blog.straylightrun.net/wp-content/uploads/2009/01/ie-xml-char-thumb.jpg" border="0" alt="EOM (0x19) in IE7" width="340" height="130" /></a></p><p>One thing you might try is to strip out invalid UTF-8 characters using <a
href="http://us.php.net/manual/en/function.iconv.php">iconv</a>.</p><div
class="wp_syntax"><div
class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">iconv</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;UTF-8&quot;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">&quot;UTF-8//IGNORE&quot;</span><span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div><p>But this won&#8217;t work.  0&#215;19 is a <em>valid</em> UTF-8 character, but an <em>invalid</em> XML character.  No choice but to explicitly filter out invalid XML values.  Using this filter (assuming text is in UTF-8 already) on instance values before serializing the object seems to work.</p><div
class="wp_syntax"><div
class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$str</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/[^(\x9|\xA|\xD|\x20-\xD7FF|\xE000-\xFFFD|\x10000\-\x10FFFF)]*/'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">,</span> <span style="color: #000088;">$str</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div> ]]></content:encoded> <wfw:commentRss>http://blog.straylightrun.net/2009/01/07/preventing-xml-restricted-characters/feed/</wfw:commentRss> <slash:comments>3</slash:comments> </item> </channel> </rss>
