Archive for January, 2009

ETags

This one is filed under “that’s pretty picky, but I guess it couldn’t hurt.”

The Entity Tags (ETags) HTTP header is a string that uniquely identifies a specific version of resource. When the browser first downloads a resource, it stores the ETag. When it requests it again, it sends along the ETag to the server. If the server sees the same ETag, it will respond with a 304 Not Modified response, saving the download.

The problem is that the default format for the ETag (in Apache) is inode-size-timestamp. And the inode will be different from server to server, meaning the server may see a different ETag from the browser, even thought it is in fact an identical file.

According to Yahoo:

The end result is ETags generated by Apache and IIS for the exact same component won’t match from one server to another. If the ETags don’t match, the user doesn’t receive the small, fast 304 response that ETags were designed for; instead, they’ll get a normal 200 response along with all the data for the component. If you host your web site on just one server, this isn’t a problem. But if you have multiple servers hosting your web site, and you’re using Apache or IIS with the default ETag configuration, your users are getting slower pages, your servers have a higher load, you’re consuming greater bandwidth, and proxies aren’t caching your content efficiently.

There is another scenario where it isn’t a problem: if you are using sticky sessions in your load balancer.

In any case, as stated above, it couldn’t hurt to rectify this. So I configured the ETag format in Apache to exclude the inode, and use only size and timestamp.

FileETag MTime Size

So files across servers have the same ETag.

Serving Javascript and CSS

Editor’s note: This post formed the basis of the Front-End Optimization talk I’ve given in the past.

You’ve programmed websites for years, know the ins & outs of PHP, MySQL, why are Javascript and CSS files such a big deal? You put them in a directory, and link to them from your pages. Done. Right?

Not if you want maximum performance.

According to the Yahoo Exception Performance team:

…Only 10% of the time is spent here for the browser to request the HTML page, and for apache to stitch together the HTML and return the response back to the browser. The other 90% of the time is spent fetching other components in the page including images, scripts and stylesheets.

So static content is very important. The same Yahoo people provide us with a comprehensive list of Best (Front-end) Practices for Speeding Up Your Website.  IMO, some of the rules are more important than others, and some are more easily achieved.  Leaving aside hardware solutions (static server, CDN, etc.) for now, let’s look at six of the rules:

  1. Rule 1: Make Fewer HTTP Requests, or combine files. The less downloads the better. Simple file concatenation would do. Our goal is at most one Javascript and one CSS file per page.
  2. Rule 3: Add an Expires Header, or every static file must accompany a time-stamp so we can take advantage of the HTTP Expires: header. A time-stamp in the GET parameters might work, but some say that some CDN’s and browser/version/platform combinations will not request a new file if the query string changes. A better solution would be to put the time-stamp in the filename somewhere.
  3. Rule 4: Gzip Components. This is easily achieved by enabling mod_deflate in Apache.
  4. Rule 9: Reduce DNS Lookups. Okay, the real value in this rule is introducing parallel downloads by using at least two but no more than four host names. This is better explained here.
  5. Rule 10: Minify JavaScript, or at the least strip out all whitespace and comments. There are more sophisticated compressors out there that replace your actual variables with shorter symbols, but the chances of introducing bugs is higher.
  6. Rule 12: Remove Duplicate Scripts, which as they say is more common than you think.

Rule 3 is a matter of configuring Apache. How to achieve the other five?

As I see it, there are three broad ways to achieve them.

  1. Handle every request in real-time.  This means using a PHP file to serve the files (e.g. <link rel="stylesheet" type="text/css" href="custom_handler.php?file1.css,file2.css" /> or something like that).  It can also mean using mod_rewrite to direct incoming requests for CSS and Javascript to go to a PHP script. Either way, there is processing on every page load. Caching the end-product helps. Still, there must be a better way.
  2. Use a template or view plugin.  If you are using a templating system to dynamically generate your HTML, you can use some sort of plugin or function to read in a list of static files, check their last-modified times, and if changed build a combined, minified, time-stamped output file to serve up.  This is better than method #1 because by the time the page is built, there is a static file that is simply served to the browser.  Still, there must be a better way.
  3. The best way is to do it offline.  This means a job that checks static files to see if they’ve been modified.  If so, it processes the files and builds the output file that is directly served to the browser.  This job could be run in cron, or run manually by developers, but the best way is to make it a part of the build server.

Don’t have a build server?  That’s a whole other topic.

I’ve been around offshoring for quite a while now, and I’ve heard people tout the benefit of the 24-hour development cycle many times.  The idea is that when your developers in the Western Hemisphere are going to sleep, your developers in the Easter Hemisphere are waking up.  When those developers go to sleep, the cycle begins again.  Voilá!  24×5 coding effort.

But it never made sense to me.  It was like there were two people building a house, and claiming that if one person worked on the house during the day, and one person worked at night, it would be faster than if both people worked during the day.  At best, it would be equal effort.  It may even be a little worse because if two people work on it together, they can re-use tools and such, and they each have someone to keep them company.

Looking for some literature on using agile methods with offshore teams a while ago, I finally found someone had put the same thought in writing.

Another benefit of offshore that’s coming up is the use of 24 hour development to reduce time to market. The benefit that touted is that by putting hands on the code base at all hours of the day, functionality gets written faster. Frankly I think this is a totally bogus argument, since I don’t see what adding people does in India that it wouldn’t do by adding them to the onshore team. If I need to add people, it’s more efficient to do it while minimizing the communication difficulties.

Martin Fowler, "Using an Agile Software Process with Offshore Development"

I have been thinking lately that the longer I am in this business, the more I am amazed that any software anywhere runs successfully at all.  Every scenario has to be accounted for.  Every detail has to be precise.  There are so many opportunities for error, from translating requirements, to miscommunications, to un-anticipated inputs, to simply flaws in logic and typos.  Not to mention the equally complex task of maintaining the necessary system, storage, and networking environment for the program to run within.

In Code Complete, Steve McConnell talks about this foolhardy profession.

Nobody is really smart enough to program computers.  Fully understanding an average program requires an almost limitless capacity to absorb details and an equal capacity to comprehend them all at the same time. The way you focus your intelligence is more important than how much intelligence you have.

At the 1972 Turing Award lecture, Edsger Dijkstra delivered a paper titled "The Humble Programmer." He argued that most of programming is an attempt to compensate for the strictly limited size of our skulls. The people who are best at programming are the people who realize how small their brains are. They are humble. The people who are the worst at programming are the people who refuse to accept the fact that their brains aren’t equal to the task.

The purpose of many good programming practices is to reduce the load on your gray cells. You might think that the high road would be to develop better mental abilities so you wouldn’t need these programming crutches. You might think that a programmer who uses mental crutches is taking the low road. Empirically, however, it’s been shown that humble programmers who compensate for their fallibilities write code that’s easier for themselves and others to understand and that has fewer errors.

Speaking of Dijkstra, he knew this already in 1968.  In "The Structure of the T-H-E Multiprogramming System," wherein he describes the design of one of the first multitasking systems, he gives props to his students:

"The other remark is that the members of the group have previously enjoyed as good students a university training of five to eight years and are of Master’s or Ph.D. level. I mention this explicitly because at least in my country the intellectual level needed for system design is in general grossly underestimated. I am convinced more than ever that this type of work is very difficult, and that every effort to do it with other than the best people is doomed to either failure or moderate success at enormous expense."

Converting ’s Correctly

Most of our live production code was written (by me) without any attention paid to character encodings.  Fortunately, nearly every link in the LAMP chain seems to default to ISO-8859-1 nicely, so things have worked out for the most part as that.  Every now and then a UTF-8 character will pop up, and we’ll either change the character in the database, or someone will use random combinations of htmlentities() and mb_convert_encoding() in some random file until it looks right in that particular case.  It’s one of those cases of building up a smidgen of technical debt.  Doing it the right way and switching all of our code, databases, and data from ISO-8859-1 to UTF-8 at this point makes me shudder.

For our newer systems coming online, I really wanted to get this character encoding problem right.  Since we started from scratch, all the necessary endpoints were written to support UTF-8 encoded text.  And we made sure that all incoming data is UTF-8 encoded.  If it was not, we converted it basically using this single line.

$string = mb_convert_encoding($string, 'UTF-8');

But something was wrong.  When I tried to convert a single smart quote (’) generated on my Windows machine and view it in my browser, it simply disappeared.  Trawling the PHP manual for a solution (as usual), I came upon it on the manual page for utf8_encode().

Note that you should only use utf8_encode() on ISO-8859-1 data, and not on data using the Windows-1252 codepage. Microsoft’s Windows-1252 codepage contains ISO-8859-1, but it includes several characters in the range 0x80-0x9F whose codepoints in Unicode do not match the byte’s value (in Unicode, codepoints U+80 – U+9F are unassigned).

utf8_encode() simply assumes the bytes integer value is the codepoint number in Unicode.

What this means is that, for example, a single smart quote (’), sent to PHP as ISO-8859-1, and converted to UTF-8 using utf8_encode(), will not convert to the proper multi-byte character, and thus will either appear as garbage in the browser or not at all (in fact it’s not at all since the values are unassigned).

Since no third argument is given, mb_convert_encoding() will use the default internal encoding for that platform.  Unfortunately, PHP uses ISO-8859-1 on Windows instead of the so-similar-yet-different-it’s-annoying-that-it-must-be-a-Microsoft-product Windows-1252 encoding, which mostly overlaps with ISO-8859-1 but has different values for certain non-control, non-ASCII punctuation characters.

The solution fortunately was also in the same manual page, which was simply a function with a hard-coded mapping to replace all the incorrectly converted Windows-1252 characters to their correct UTF-8 values.

So I modified the above line of code to look like the following, and I could see my smart quotes once again.

$string = strtr(mb_convert_encoding($string, 'UTF-8'), self::$_cp1252_map);