Archive for 'Coding'

Choosing A Language

This post, “14 descriptive reasons why Ruby may be better than PHP,” popped up on PHPDeveloper.org yesterday.  These types of lists appear regularly in the blogosphere. I usually don’t feel compelled to respond, but for some reason this time I do.  Perhaps it is the cool graphics.  Perhaps it is because I just wanted to blog.

Speaking as someone who has done a lot of PHP, and very little Ruby or Rails, overall, I agree with the article.  It talks about many of the reasons I wish I could start using Ruby more.  However, I would like to refute a few particular items in the list.  I’ll just cover each one with my input.

1. Human-readable code.

Yukihiro Matsumodo wanted to create a language that was designed for programmer functionality. By keeping naming consistent and keywords sensible, he was able to do so.

Agree.  I think this is more of a result of the object-oriented features of Ruby than anything else, but the names and keywords are indeed nice.

2. Principle of least surprise

it is designed in such a way that the language minimizes confusion for experienced users. Methods for different objects, if they do the same task, are generally named and capitalized the same way. This makes it less necessary to constantly refer to the documentation, which leads to faster coding.

Agree.  The inconsistencies in PHP names are well-documented. 

3. Community maturity

The community is better able to give concise, coherent answers since the community is a little more learned. As it is generally not a first language the people in the community already have experience coding and helping others.

Disagree.  I can’t tell if he is in fact boasting that the Ruby community is in general smarter, but the PHP community includes some of the brightest and veteran software developers in the world.  It is also one of the most diverse, in terms of skillset, and thus is also one of the most welcoming and open technology communities around.  Maybe he means that diversity is a disadvantage, which I also would disagree with.  Anecdotally, the Ruby community is often perceived of as elitist and prone to groupthink.  I believe that these are qualities of a very immature community, one that has not healthily fragmented or penetrated the corporate world.  The PHP community is older and has experienced this fragmentation, which seems to be exactly what the author is criticizing. Just wait, Ruby. Just wait.

4. Rails

Creating a web application complete with database is as easy as typing in a few simple commands. The web app framework community for PHP, on the other hand, is fragmented and far less organized.

Basically agree.  It is true that Ruby as a whole has embraced a single framework thus enjoying the advantages that have resulted.  Would PHP have enjoyed the same benefits had it settled on a single framework early on?  Impossible to know.  I predict that over time, the Ruby web frameworks will become fragmented and less organized as well.

5. Objected Oriented Programming

Ruby was initially designed to be object-oriented. Robust, organized programs are easier to create with an object oriented language.

Basically agree.  It is of course possible to write very good object-oriented code in PHP, but Ruby provides less rope to hang yourself with.

6. ActiveRecord design

using Rails, it is possible to map database records/tables to objects, which makes creating and handling persistent data extremely easy.

Basically agree.  Like point #4 above, there are plenty of good ActiveRecord libraries for PHP, but having a single, agreed-upon ORM to go with your single, agreed-upon MVC framework lends many advantages.

7. Easier to install

Gem makes it extremely easy to download and install different frameworks and applications to use in your code. PHP, however, makes you find, download, install, and configure all extensions yourself.

Disagree.  This should not even be on this list.  I see no real advantages of gem over pear.  Furthermore, I have seen some really hairy gem problems involving paths and multiple installations.

8. MVC

The model view controller architecture is a common system that all software engineers use. Rails is designed around this concept, so it gives future maintainers of your project an advantage when trying to understand your code.

Basically agree.  See points #4 and #6 above.  As I type this, points #4, #6, and #8 could be combined into a single item.

9. JVM support

Whenever there is a JVM installed, you can use JRuby to run your programs. You can also integrate your project into a Java application if you would like. This is especially useful if you are already writing a large application in Java, and are just looking for a scripting language to augment some functionality.

Agree.  I have no experience with either JRuby or Quercus, but JRuby appears to have achieved more maturity and popularity faster than Quercus.

10. Formal namespaces

Organizing code is much easier when you can group classes into namespaces.

Disagree.  PHP 5.3 fully supports namespaces.  Personally, I have always questioned the value of namespacing.  It always seemed more confusing to have identically named entities in different namespaces than to just prefix them and call them different things – which is what you have to do in PHP 5.2.

11. Pre-installed on certain operating systems

It ships with OSX 10.5 (Leopard), so you can get started coding immediately.

Disagree.  It sounds like he is referring to local installations, and not deployed installations, but in either case I would disagree.  OSX comes with PHP as well.  And PHP comes installed on every Linux distro currently made, so I really don’t understand this one at all.   If he is referring to deployed installations, PHP is still installed on more hosting environments than any other language, and it is easier to get running than any other language.  Passenger is slowly mitigating this situation, but PHP still is in front.

12. Interactive shell

It comes default with a robust interactive shell that makes it easy to experiment with code. Instead of typing out some code, saving it to your sever, opening up your internet browser, and navigating to your page, you can just open up the console and type.

Agree.  PHP has `php -a`, but it is no where near as usable as IRB.

13. Maintainability

Less code (assuming the complexity is the same) means there is less to get confused about. Forcing MVC on you (through Rails) also helps developers maintain your code. Unit testing has also been tightly integrated from the beginning of the language, so it is more convenient to create/use unit tests for regression tests.

Basically agree.  I am now combining points #4, #6, #8, and #13.

14. Everything is an object

Not having primitives makes code easier to handle. Instead of making sure something is an object and then executing it’s methods, one can just execute the method.

Agree.  If you start treating everything as an object, your business model starts becoming cleaner.  I’d much rather write “hello world”.lower than strtolower(“hello world”).

:::

Right about now is when I might conclude with: “every situation is different and the proper solution depends on your particular requirements, blah, blah, blah.” I basically agree with that statement, but in my experience, it is missing the point. 

In my experience, one can pretty much do anything with one platform that one can do with another.  One (web) technology may be faster or easier, but in the end, it’s all just web.  In my experience, most people are not equally strong in more than one platform.  Most people are stronger in one language, even if it is because it is the language most recently used. 

So back to the “every situation is different…” statement.  The particular requirements are usually, how can I make the business money?  Then, the proper solution becomes: by using the platform I know the best, and that is already running, formal namespaces be damned.

Real Software Engineering

So this talk given at JRubyConf 2010 has been setting user groups and Twitter afire.  If you are a developer, I highly recommend you watch it.


Real Software Engineering by @glv is amazing. Truly a foundational talk of our field. One of the best I’ve ever seen? #jrubyconfless than a minute ago via Tweetie for Mac

Real Software Engineering talk presented at Lone Star Ruby Conference 2010.

86% Of Writes Were For Statistics

View counts, click counts, hit counts, traffic statistics… The need for analytics and reporting on web products is a must-have.  Well, the easiest way to do that is to simply increment a database value each time.  The problem is when those counts are coming in hundreds of times per second.  Writes are the most expensive queries:

  1. Writes usually trigger updating an index.
  2. If you’re using a MyISAM storage engine, the table-level locking can get out of hand. 
  3. Writes are not query-cacheable.

After observing subpar write behavior, I wanted to know just how many of our total writes were for updating statistics?

First, I ran mysqltuner.

% mysqltuner
...
[**] Reads / Writes: 93% / 7%
...
%

So 7% of all queries were writes.  That wasn’t bad.  Then, I took the binary log of all DML statements for yesterday, starting at midnight.  I figured 24 hours was a good sample.

% mysqlbinlog  --start-date='2010-06-06 0' binary-log.000152 > cow

I grepped out DML lines, to get rid of the binary log stuff.

% grep -i '^insert' cow > cow2
% grep -i '^update' cow >> cow2

I counted up lines that wrote to our stat tables.

% wc -l cow2
24898 cow
% grep -i -c 'stat_' cow2
20880

Doing the math: 20880 / 24898 = 0.86. About 86% of all writes to our database were for statistics.  Which wasn’t too surprising.  Most web sites must store and log a lot of data to know where to improve and how users are using the site.

So what do we do?

That’s the subject of another post, but the short answer is that these writes can be batched somehow.  Whether the queries are batched with some sort of write-through cache, or job queues, the database won’t suffer from constant write queries.

MySQL Slow Query Log Is Your Friend

The MySQL Slow Query Log is a required tool in the database administrator’s toolbox.  It’s great for troubleshooting specific issues, but it’s also great for some rainy day application tuning.

My slow query log is in /var/lib/mysqld/db-001-slow.log and records any queries that take longer than 10 seconds (the default value for long_query_time). I can get information out of this log using mysqldumpslow.

Running `mysqldumpslow db-001-slow.log` prints out slow queries sorted by descending execution time. But that’s not useful to me, because any query can get delayed by a blip in the system.

I like running `mysqldumpslow -s c db-001-slow.log` which prints out the slow queries sorted by descending count of times that query occurred. Optimizing a query that takes 10 seconds to execute but occurs a dozen times every minute will be more beneficial than optimizing the query that takes 140 seconds to execute but rarely occurs.

The first time I tried this exercise, I revealed the following 3 types of slow queries (can’t remember the exact order now):

  1. Queries with lots of logic and joins returning infrequently-changing data.
  2. Queries using the curdate() function, which are not query cacheable.
  3. Queries to insert/update a stats table for content view counts.

For #1, I used an in-memory cache to cache the query results.   For #2, I replaced the curdate() function with the PHP date() function everywhere I could find it.  For #3, I noticed an extraneous index on the stats table, and indexes slow down inserts and updates, so I removed it.  For more on handling these types of queries, see my next post.

Validation Of Object Model

I was reading up on memcached, when I came across some validation of the Pox Framework object model.  In the FAQ, a general design approach to storing lists of data is described.

Storing lists of data into memcached can mean either storing a single item with a serialized array, or trying to manipulate a huge “collection” of data by adding, removing items without operating on the whole set. Both should be possible.

One thing to keep in mind is memcached’s 1 megabyte limit on item size, so storing the whole collection (ids, data) into memcached might not be the best idea.

Steven Grimm explains a better approach on the mailing list: http://lists.danga.com/pipermail/memcached/2007-July/004578.html

Following the link gives this quote:

A better way to deal with this kind of thing is with a two-phase fetch. So instead of directly caching an array of event data, instead cache an array of event IDs. Query that list, then use it construct a list of the keys of individual event objects you want to fetch, then multi-get that list of keys.

…Another advantage of a scheme like this is that you can update an item’s data without having to read then write every list that contains that item. Just update it by ID (like you’d do in your database queries) and all the lists that contain it will magically get the correct information.

That always feels nice.

We’ve never had long-lived sessions.  It was never a requirement.  I think we had a “Remember me” checkbox that didn’t work at one point, but we soon removed it.  But suddenly, customer requests started coming in.  They asked, “why do I have to log in every time I use the site?  Why can’t I stay logged in forever, like Facebook or Twitter?”  That was a good question.

Basic User Login

Like most sites, we used the PHP session to maintain a logged in user for our site.   We started a session, kept track of some data indicating if the user is logged in or not, and that was about it.

I never looked at sessions and cookies in-depth before.  I knew generally how sessions worked.  PHP sets a cookie in the client’s browser.  The cookie contains a session ID.  When a request comes in, PHP reads the session ID, looks for a file corresponding to the ID on disk (or in a database, memcached, etc.), reads in the file containing the session data, and loads the session into the request.  When the request finishes, the session data is saved to the file again.

Implementing The “Remember Me” Checkbox

First, naively, I thought all I had to do was find the right php.ini directive to make sessions last forever.  Browsing the PHP manual and googling, I came across the session.cookie_lifetime directive, configured in either php.ini or by session_set_cookie_params().

session.cookie_lifetime specifies the lifetime of the cookie in seconds which is sent to the browser. The value 0 means “until the browser is closed.” Defaults to 0.

I set this to 24 hours.  Well, that was easy, I thought.

Except it didn’t work.  Users reported logging in, going out to lunch, coming back, and getting logged out on the first link clicked.  I dug deeper and found another directive.

session.gc_maxlifetime specifies the number of seconds after which data will be seen as ‘garbage’ and cleaned up. Garbage collection occurs during session start.

It defaults to 1440 seconds, or 24 mins.

It’s important to know that session.cookie_lifetime starts when the cookie is set, regardless of last user activity, so it is an absolute expiration time.  session.gc_maxlifetime starts from when the user was last active (clicked), so it’s more like a maximum idle time.

Starting To Understand

Now I could see that both of these directives must cooperate to get the desired effect. Specifically, the shorter of these two values determines my session duration.

For example, let’s say I have session.cookie_lifetime set to its default of 0, and session.gc_maxlifetime is set to its default of 24 mins.  A user who logs in can stay logged in forever, provided he never closes his browser, and he never stops clicking for more than 24 mins.

Now, let’s say the same user takes a 30 min. lunch break, and leaves his browser open.  When, he gets back, he’ll most likely have been logged out because his session data was garbage collected on the server, even though his browser cookie was still there.

Now, let’s change session.cookie_lifetime to 1 hour.  A user who logs in can stay logged in for up to an hour if he clicks away for the whole time.  This is regardless of whether or not he closes/reopens his browser.  If he takes his 30 min. lunch break after working for 15 mins. he will most likely be logged out when he returns, even though his browser cookie had 15 more mins. of life.

Now, keeping session.cookie_lifetime at 1 hour, let’s set session.gc_maxlifetime to 2 hours.  A user who logs in can stay logged in for up to an hour, period.  He does not have to click at all in that time, but he’ll be logged out after an hour.

The Real “Remember Me” Solution

Back to my problem.  At this point, I could’ve just set both directives to something  like 1 year.  But since session.gc_maxlifetime controls garbage collection of session data, I’d have session data up to a year old left on the server!  I did a quick check on the PHP session directory.  There were already several thousand sessions, and that was only for a 24-minute lifetime!

Clearly, this was not how Twitter did it.  A little more digging, and I realized that sites like those do not keep your specific session around for long periods of time.  What they do is set a long-lasting cookie that contains some sort of security token.  From that token, they can authenticate you, and re-create your session, even if your session data has already been removed from the server.  (The cookie name for Twitter is auth_token and looks to have a lifetime of 20 years.)

With the session recreation method, I could control when and how to log out users, if at all.  So this enabled us to give users indefinite sessions, while keeping all session directives at their default values.

Beyond Session Cookies

This only scratches the surface of authentication topics of course.  We didn’t talk about security implications of the session re-creation method, though I will say that the best security practice against session-based attacks seems to prompt for a password if the user attempts to change or view sensitive account information.  LinkedIn is the first example that comes to mind.

Shortly after implementing this, a request came down from high above to centralize the authentication for our multiple products.  I began to investigate single sign-on (like Google accounts) and federated identity (like OpenID), but those are topics of another post.

Here are a couple blogs that got me on my way to the final solution. Be sure to read the comments: