Editor’s note: This post formed the basis of the Front-End Optimization talk I’ve given in the past.
You’ve programmed websites for years, know the ins & outs of PHP, MySQL, why are Javascript and CSS files such a big deal? You put them in a directory, and link to them from your pages. Done. Right?
Not if you want maximum performance.
According to the Yahoo Exception Performance team:
…Only 10% of the time is spent here for the browser to request the HTML page, and for apache to stitch together the HTML and return the response back to the browser. The other 90% of the time is spent fetching other components in the page including images, scripts and stylesheets.
So static content is very important. The same Yahoo people provide us with a comprehensive list of Best (Front-end) Practices for Speeding Up Your Website. IMO, some of the rules are more important than others, and some are more easily achieved. Leaving aside hardware solutions (static server, CDN, etc.) for now, let’s look at six of the rules:
mod_deflate
in Apache. Rule 3 is a matter of configuring Apache. How to achieve the other five?
As I see it, there are three broad ways to achieve them.
<link rel="stylesheet" type="text/css" href="custom_handler.php?file1.css,file2.css" />
or something like that). It can also mean using mod_rewrite
to direct incoming requests for CSS and Javascript to go to a PHP script. Either way, there is processing on every page load. Caching the end-product helps. Still, there must be a better way. Don’t have a build server? That’s a whole other topic.
Most of our live production code was written (by me) without any attention paid to character encodings. Fortunately, nearly every link in the LAMP chain seems to default to ISO-8859-1 nicely, so things have worked out for the most part as that. Every now and then a UTF-8 character will pop up, and we’ll either change the character in the database, or someone will use random combinations of htmlentities() and mb_convert_encoding() in some random file until it looks right in that particular case. It’s one of those cases of building up a smidgen of technical debt. Doing it the right way and switching all of our code, databases, and data from ISO-8859-1 to UTF-8 at this point makes me shudder.
For our newer systems coming online, I really wanted to get this character encoding problem right. Since we started from scratch, all the necessary endpoints were written to support UTF-8 encoded text. And we made sure that all incoming data is UTF-8 encoded. If it was not, we converted it basically using this single line.
$string = mb_convert_encoding($string, 'UTF-8');
But something was wrong. When I tried to convert a single smart quote (’) generated on my Windows machine and view it in my browser, it simply disappeared. Trawling the PHP manual for a solution (as usual), I came upon it on the manual page for utf8_encode().
Note that you should only use utf8_encode() on ISO-8859-1 data, and not on data using the Windows-1252 codepage. Microsoft’s Windows-1252 codepage contains ISO-8859-1, but it includes several characters in the range 0x80-0x9F whose codepoints in Unicode do not match the byte’s value (in Unicode, codepoints U+80 – U+9F are unassigned).
utf8_encode() simply assumes the bytes integer value is the codepoint number in Unicode.
What this means is that, for example, a single smart quote (’), sent to PHP as ISO-8859-1, and converted to UTF-8 using utf8_encode(), will not convert to the proper multi-byte character, and thus will either appear as garbage in the browser or not at all (in fact it’s not at all since the values are unassigned).
Since no third argument is given, mb_convert_encoding() will use the default internal encoding for that platform. Unfortunately, PHP uses ISO-8859-1 on Windows instead of the so-similar-yet-different-it’s-annoying-that-it-must-be-a-Microsoft-product Windows-1252 encoding, which mostly overlaps with ISO-8859-1 but has different values for certain non-control, non-ASCII punctuation characters.
The solution fortunately was also in the same manual page, which was simply a function with a hard-coded mapping to replace all the incorrectly converted Windows-1252 characters to their correct UTF-8 values.
So I modified the above line of code to look like the following, and I could see my smart quotes once again.
$string = strtr(mb_convert_encoding($string, 'UTF-8'), self::$_cp1252_map);
In the last post, I told you how I wanted to send around a custom Message
object among services in our system using Amazon SQS messaging. I soon ran into another problem.
As a reminder, here is the set of allowed XML characters again.
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
I soon got another Amazon SQS exception. I inspected the hex dump and found 0x19 byte in the text.
The 0x19 character is END-OF-MEDIUM character, a control character I imagine you might copy-and-paste from a Windows machine into a web form. In Firefox, 0x19 looks like this (right after the phrase “Basket Makers”).
In IE7, it looks like this.
One thing you might try is to strip out invalid UTF-8 characters using iconv.
$string = iconv("UTF-8", "UTF-8//IGNORE", $string);
But this won’t work. 0x19 is a valid UTF-8 character, but an invalid XML character. No choice but to explicitly filter out invalid XML values. Using this filter (assuming text is in UTF-8 already) on instance values before serializing the object seems to work.
$str = preg_replace('/[^(\x9|\xA|\xD|\x20-\xD7FF|\xE000-\xFFFD|\x10000\-\x10FFFF)]*/', '', $str);
In our service-oriented system, I wanted to send around a custom Message
object. We currently use Amazon SQS for messaging, which requires that all message characters fall within the valid XML character range (according to W3C XML 1.0 spec). This range is:
#x9 | #xA | #xD | [#x20 to #xD7FF] | [#xE000 to #xFFFD] | [#x10000 to #x10FFFF]
Here’s some code to serialize the object and enqueue it to SQS.
class Message
{
private $_msg;
public function setMessage($msg)
{
$this->_msg = $msg;
}
}
$obj = new Message;
$obj->setMessage('Hello world!');
$msg = serialize($obj);
enqueueToSQS($msg);
Unfortunately, this code produced this exception:
Amazon_SQS_Exception: An invalid binary character was found in the message body, the set of allowed characters is #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
What was going on?
I echoed out the serialized $msg
var and looked at it carefully. There was a lot of funny punctuation, but nothing outside of the XML character range.
O:7:"Message":1:{s:13:" Message _msg";s:12:"Hello world!";}
When in doubt, look at the binary. I wrote $msg
out to a file, and looked at the file’s hex dump.
0000000: 4f3a 373a 224d 6573 7361 6765 223a 313a O:7:"Message":1:
0000010: 7b73 3a31 333a 2200 4d65 7373 6167 6500 {s:13:".Message.
0000020: 5f6d 7367 223b 733a 3132 3a22 4865 6c6c _msg";s:12:"Hell
0000030: 6f20 776f 726c 6421 223b 7d o world!";}
The culprit is at the end of line 2. The NUL (0x00) character is most definitely not in the valid character range. Some googling confirmed my suspicions in the PHP manual itself.
If you are serializing an object with private variables, beware. The serialize() function returns a string with null (x00) characters embedded within it, which you have to escape.
Other comments on that manual page also gave me my solution: escaping with addslashes(). So I replaced line 13 in the code above like so, and I was then able to send objects over SQS with no problem.
$msg = addslashes(serialize($obj));
I first heard the term technical debt when I was learning about Scrum. It immediately struck a chord because it made so much sense! Technical debt described those times I coded the quick and dirty way (incurring debt) and not the way I wanted. Technical debt described all those times I wanted to refactor an ugly system (and pay down debt), but couldn’t due to deadlines and the fact that it’s so hard to demonstrate the value of better code when the business output is the same.
As Steve McConnell says about this attitude towards debt:
I’ve found that business staff generally seems to have a higher tolerance for technical debt than technical staff does. Business executives tend to want to understand the tradeoffs involved, whereas some technical staff seem to believe that the only correct amount of technical debt is zero.
Like financial debt, a little technical debt is okay. After all, if the viability of the business depends on releasing a product, saving the business is more important than feeling good about your architecture. But you have to service the debt at some point, and fight the common business notion: if the software works, then it’s good enough.
Because the tax man will come to collect. And you will know when he does when you attempt to change someone else’s (or perhaps your own) old code and you see that you have to change 40 files to make a tiny feature because the whole system is a big ball of mud and you grumble something about doing it right the first time.
I thought about all this when I read Jay Pipes’ advice to MySQL. Years between releases… Bug fixes that cause bugs… It sounds like the tax man has come to collect at MySQL. He argues for taking a year break to pay off the technical debt in MySQL. Ouch. That’s quite a bill.
Now, I consider Jay to be one of the smartest persons I’ve ever met, but I have to disagree on this one. The thought of stopping new work to radically alter a huge working system and ultimately release it a year later in a Big Bang terrifies me. If you were really, really, good, you could be successful at this and maintain quality and measure output and do tons of integration testing, regression testing, etc. But one thing that I believe is oft overlooked is that developers like to release code. And the more time that goes by without any released code, the more it feels like days are just wasting away. That’s been my experience anyway. But I digress.
Much like our credit markets these days, after accumulating so much debt, you find that you cannot move. You feel stuck and trapped and spend all your energy trying to stop moving backwards, instead of moving forwards.