<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Straylight Run &#187; utf-8</title>
	<atom:link href="http://blog.straylightrun.net/tag/utf-8/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.straylightrun.net</link>
	<description>Software, Technology, PHP</description>
	<lastBuildDate>Tue, 11 May 2010 03:53:07 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Converting &#8217;s Correctly</title>
		<link>http://blog.straylightrun.net/2009/01/09/converting-s-correctly/</link>
		<comments>http://blog.straylightrun.net/2009/01/09/converting-s-correctly/#comments</comments>
		<pubDate>Fri, 09 Jan 2009 18:55:05 +0000</pubDate>
		<dc:creator>gerard</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[encoding]]></category>
		<category><![CDATA[unicode]]></category>
		<category><![CDATA[utf-8]]></category>

		<guid isPermaLink="false">http://blog.straylightrun.net/?p=91</guid>
		<description><![CDATA[Most of our live production code was written (by me) without any attention paid to character encodings.  Fortunately, nearly every link in the LAMP chain seems to default to ISO-8859-1 nicely, so things have worked out for the most part as that.  Every now and then a UTF-8 character will pop up, and we&#8217;ll either [...]]]></description>
			<content:encoded><![CDATA[<p>Most of our live production code was written (by me) without any attention paid to character encodings.  Fortunately, nearly every link in the LAMP chain seems to default to ISO-8859-1 nicely, so things have worked out for the most part as that.  Every now and then a UTF-8 character will pop up, and we&#8217;ll either change the character in the database, or someone will use random combinations of <a href="http://php.net/htmlentities">htmlentities()</a> and <a href="http://php.net/mb_convert_encoding">mb_convert_encoding()</a> in some random file until it looks right in that particular case.  It&#8217;s one of those cases of building up a smidgen of <a href="http://blog.straylightrun.net/2008/12/19/technical-debt-eventually-the-man-will-collect/">technical debt</a>.  Doing it the right way and switching all of our code, databases, and data from ISO-8859-1 to UTF-8 at this point makes me shudder.</p>
<p>For our newer systems coming online, I really wanted to get this character encoding problem right.  Since we started from scratch, all the necessary endpoints were written to support UTF-8 encoded text.  And we made sure that all incoming data is UTF-8 encoded.  If it was not, we converted it basically using this single line.</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">mb_convert_encoding</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'UTF-8'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>But something was wrong.  When I tried to convert a single smart quote (’) generated on my Windows machine and view it in my browser, it simply disappeared.  Trawling the PHP manual for a solution (as usual), I came upon it on the <a href="http://us2.php.net/manual/en/function.utf8-encode.php#44843">manual page for utf8_encode()</a>.</p>
<blockquote><p>Note that you should only use utf8_encode() on ISO-8859-1 data, and not on data using the Windows-1252 codepage. Microsoft&#8217;s Windows-1252 codepage contains ISO-8859-1, but it includes several characters in the range 0&#215;80-0&#215;9F whose codepoints in Unicode do not match the byte&#8217;s value (in Unicode, codepoints U+80 &#8211; U+9F are unassigned).</p>
<p>utf8_encode() simply assumes the bytes integer value is the codepoint number in Unicode.</p></blockquote>
<p>What this means is that, for example, a single smart quote (’), sent to PHP as ISO-8859-1, and converted to UTF-8 using utf8_encode(), will not convert to the proper multi-byte character, and thus will either appear as garbage in the browser or not at all (in fact it&#8217;s not at all since the values are unassigned).</p>
<p>Since no third argument is given, mb_convert_encoding() will use the default internal encoding for that platform.  Unfortunately, PHP uses ISO-8859-1 on Windows instead of the <em>so-similar-yet-different-it&#8217;s-annoying-that-it-must-be-a-Microsoft-product</em> Windows-1252 encoding, which mostly overlaps with ISO-8859-1 but has different values for certain non-control, non-ASCII punctuation characters.</p>
<p>The solution fortunately was also in the same manual page, which was simply a <a href="http://us2.php.net/manual/en/function.utf8-encode.php#45226">function with a hard-coded mapping</a> to replace all the incorrectly converted Windows-1252 characters to their correct UTF-8 values.</p>
<p>So I modified the above line of code to look like the following, and I could see my smart quotes once again.</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">strtr</span><span style="color: #009900;">&#40;</span><span style="color: #990000;">mb_convert_encoding</span><span style="color: #009900;">&#40;</span><span style="color: #000088;">$string</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">'UTF-8'</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">,</span> <span style="color: #000000; font-weight: bold;">self</span><span style="color: #339933;">::</span><span style="color: #000088;">$_cp1252_map</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://blog.straylightrun.net/2009/01/09/converting-s-correctly/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Preventing XML-Restricted Characters</title>
		<link>http://blog.straylightrun.net/2009/01/07/preventing-xml-restricted-characters/</link>
		<comments>http://blog.straylightrun.net/2009/01/07/preventing-xml-restricted-characters/#comments</comments>
		<pubDate>Wed, 07 Jan 2009 19:03:26 +0000</pubDate>
		<dc:creator>gerard</dc:creator>
				<category><![CDATA[Coding]]></category>
		<category><![CDATA[amazon]]></category>
		<category><![CDATA[encoding]]></category>
		<category><![CDATA[sqs]]></category>
		<category><![CDATA[utf-8]]></category>
		<category><![CDATA[xml]]></category>

		<guid isPermaLink="false">http://blog.straylightrun.net/?p=87</guid>
		<description><![CDATA[In the last post, I told you how I wanted to send around a custom Message object among services in our system using Amazon SQS messaging.  I soon ran into another problem.
As a reminder, here is the set of allowed XML characters again.
#x9 &#124; #xA &#124; #xD &#124; [#x20 to #xD7FF] &#124; [#xE000 to #xFFFD] [...]]]></description>
			<content:encoded><![CDATA[<p>In the <a href="http://blog.straylightrun.net/2009/01/05/sending-a-serialized-object-to-amazon-sqs/">last post</a>, I told you how I wanted to send around a custom <code>Message </code>object among services in our system using Amazon SQS messaging.  I soon ran into another problem.</p>
<p>As a reminder, here is the set of allowed XML characters again.</p>
<blockquote><p>#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]</p></blockquote>
<p>I soon got another Amazon SQS exception.  I inspected the hex dump and found 0&#215;19 byte in the text.</p>
<p>The <a href="http://www.fileformat.info/info/unicode/char/0019/index.htm">0&#215;19 character is END-OF-MEDIUM character</a>, a control character I imagine you might copy-and-paste from a Windows machine into a web form.  In Firefox, 0&#215;19 looks like this (right after the phrase &#8220;Basket Makers&#8221;).</p>
<p style="text-align: center;"><a href="http://blog.straylightrun.net/wp-content/uploads/2009/01/ff-xml-char.jpg"><img style="text-align: center; border-width: 0px;" src="http://blog.straylightrun.net/wp-content/uploads/2009/01/ff-xml-char-thumb.jpg" border="0" alt="EOM (0x19) in Firefox" width="354" height="127" /></a></p>
<p>In IE7, it looks like this.</p>
<p style="text-align: center;"><a href="http://blog.straylightrun.net/wp-content/uploads/2009/01/ie-xml-char.jpg"><img style="text-align: center; border-top-width: 0px; border-left-width: 0px; border-bottom-width: 0px; border-right-width: 0px" src="http://blog.straylightrun.net/wp-content/uploads/2009/01/ie-xml-char-thumb.jpg" border="0" alt="EOM (0x19) in IE7" width="340" height="130" /></a></p>
<p>One thing you might try is to strip out invalid UTF-8 characters using <a href="http://us.php.net/manual/en/function.iconv.php">iconv</a>.</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$string</span> <span style="color: #339933;">=</span> <span style="color: #990000;">iconv</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">&quot;UTF-8&quot;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">&quot;UTF-8//IGNORE&quot;</span><span style="color: #339933;">,</span> <span style="color: #000088;">$string</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>But this won&#8217;t work.  0&#215;19 is a <em>valid</em> UTF-8 character, but an <em>invalid</em> XML character.  No choice but to explicitly filter out invalid XML values.  Using this filter (assuming text is in UTF-8 already) on instance values before serializing the object seems to work.</p>

<div class="wp_syntax"><div class="code"><pre class="php" style="font-family:monospace;"><span style="color: #000088;">$str</span> <span style="color: #339933;">=</span> <span style="color: #990000;">preg_replace</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">'/[^(\x9|\xA|\xD|\x20-\xD7FF|\xE000-\xFFFD|\x10000\-\x10FFFF)]*/'</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">''</span><span style="color: #339933;">,</span> <span style="color: #000088;">$str</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

]]></content:encoded>
			<wfw:commentRss>http://blog.straylightrun.net/2009/01/07/preventing-xml-restricted-characters/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>
<!-- WP Super Cache is installed but broken. The path to wp-cache-phase1.php in wp-content/advanced-cache.php must be fixed! -->