Archive for 'Coding'

Real Software Engineering

So this talk given at JRubyConf 2010 has been setting user groups and Twitter afire.  If you are a developer, I highly recommend you watch it.

Real Software Engineering by @glv is amazing. Truly a foundational talk of our field. One of the best I’ve ever seen? #jrubyconfless than a minute ago via Tweetie for Mac

Real Software Engineering talk presented at Lone Star Ruby Conference 2010.

86% Of Writes Were For Statistics

View counts, click counts, hit counts, traffic statistics… The need for analytics and reporting on web products is a must-have.  Well, the easiest way to do that is to simply increment a database value each time.  The problem is when those counts are coming in hundreds of times per second.  Writes are the most expensive queries:

  1. Writes usually trigger updating an index.
  2. If you’re using a MyISAM storage engine, the table-level locking can get out of hand. 
  3. Writes are not query-cacheable.

After observing subpar write behavior, I wanted to know just how many of our total writes were for updating statistics?

First, I ran mysqltuner.

% mysqltuner
...
[**] Reads / Writes: 93% / 7%
...
%

So 7% of all queries were writes.  That wasn’t bad.  Then, I took the binary log of all DML statements for yesterday, starting at midnight.  I figured 24 hours was a good sample.

% mysqlbinlog  --start-date='2010-06-06 0' binary-log.000152 > cow

I grepped out DML lines, to get rid of the binary log stuff.

% grep -i '^insert' cow > cow2
% grep -i '^update' cow >> cow2

I counted up lines that wrote to our stat tables.

% wc -l cow2
24898 cow
% grep -i -c 'stat_' cow2
20880

Doing the math: 20880 / 24898 = 0.86. About 86% of all writes to our database were for statistics.  Which wasn’t too surprising.  Most web sites must store and log a lot of data to know where to improve and how users are using the site.

So what do we do?

That’s the subject of another post, but the short answer is that these writes can be batched somehow.  Whether the queries are batched with some sort of write-through cache, or job queues, the database won’t suffer from constant write queries.

MySQL Slow Query Log Is Your Friend

The MySQL Slow Query Log is a required tool in the database administrator’s toolbox.  It’s great for troubleshooting specific issues, but it’s also great for some rainy day application tuning.

My slow query log is in /var/lib/mysqld/db-001-slow.log and records any queries that take longer than 10 seconds (the default value for long_query_time). I can get information out of this log using mysqldumpslow.

Running `mysqldumpslow db-001-slow.log` prints out slow queries sorted by descending execution time. But that’s not useful to me, because any query can get delayed by a blip in the system.

I like running `mysqldumpslow -s c db-001-slow.log` which prints out the slow queries sorted by descending count of times that query occurred. Optimizing a query that takes 10 seconds to execute but occurs a dozen times every minute will be more beneficial than optimizing the query that takes 140 seconds to execute but rarely occurs.

The first time I tried this exercise, I revealed the following 3 types of slow queries (can’t remember the exact order now):

  1. Queries with lots of logic and joins returning infrequently-changing data.
  2. Queries using the curdate() function, which are not query cacheable.
  3. Queries to insert/update a stats table for content view counts.

For #1, I used an in-memory cache to cache the query results.   For #2, I replaced the curdate() function with the PHP date() function everywhere I could find it.  For #3, I noticed an extraneous index on the stats table, and indexes slow down inserts and updates, so I removed it.  For more on handling these types of queries, see my next post.

Validation Of Object Model

I was reading up on memcached, when I came across some validation of the Pox Framework object model.  In the FAQ, a general design approach to storing lists of data is described.

Storing lists of data into memcached can mean either storing a single item with a serialized array, or trying to manipulate a huge “collection” of data by adding, removing items without operating on the whole set. Both should be possible.

One thing to keep in mind is memcached’s 1 megabyte limit on item size, so storing the whole collection (ids, data) into memcached might not be the best idea.

Steven Grimm explains a better approach on the mailing list: http://lists.danga.com/pipermail/memcached/2007-July/004578.html

Following the link gives this quote:

A better way to deal with this kind of thing is with a two-phase fetch. So instead of directly caching an array of event data, instead cache an array of event IDs. Query that list, then use it construct a list of the keys of individual event objects you want to fetch, then multi-get that list of keys.

…Another advantage of a scheme like this is that you can update an item’s data without having to read then write every list that contains that item. Just update it by ID (like you’d do in your database queries) and all the lists that contain it will magically get the correct information.

That always feels nice.

We’ve never had long-lived sessions.  It was never a requirement.  I think we had a “Remember me” checkbox that didn’t work at one point, but we soon removed it.  But suddenly, customer requests started coming in.  They asked, “why do I have to log in every time I use the site?  Why can’t I stay logged in forever, like Facebook or Twitter?”  That was a good question.

Basic User Login

Like most sites, we used the PHP session to maintain a logged in user for our site.   We started a session, kept track of some data indicating if the user is logged in or not, and that was about it.

I never looked at sessions and cookies in-depth before.  I knew generally how sessions worked.  PHP sets a cookie in the client’s browser.  The cookie contains a session ID.  When a request comes in, PHP reads the session ID, looks for a file corresponding to the ID on disk (or in a database, memcached, etc.), reads in the file containing the session data, and loads the session into the request.  When the request finishes, the session data is saved to the file again.

Implementing The “Remember Me” Checkbox

First, naively, I thought all I had to do was find the right php.ini directive to make sessions last forever.  Browsing the PHP manual and googling, I came across the session.cookie_lifetime directive, configured in either php.ini or by session_set_cookie_params().

session.cookie_lifetime specifies the lifetime of the cookie in seconds which is sent to the browser. The value 0 means “until the browser is closed.” Defaults to 0.

I set this to 24 hours.  Well, that was easy, I thought.

Except it didn’t work.  Users reported logging in, going out to lunch, coming back, and getting logged out on the first link clicked.  I dug deeper and found another directive.

session.gc_maxlifetime specifies the number of seconds after which data will be seen as ‘garbage’ and cleaned up. Garbage collection occurs during session start.

It defaults to 1440 seconds, or 24 mins.

It’s important to know that session.cookie_lifetime starts when the cookie is set, regardless of last user activity, so it is an absolute expiration time.  session.gc_maxlifetime starts from when the user was last active (clicked), so it’s more like a maximum idle time.

Starting To Understand

Now I could see that both of these directives must cooperate to get the desired effect. Specifically, the shorter of these two values determines my session duration.

For example, let’s say I have session.cookie_lifetime set to its default of 0, and session.gc_maxlifetime is set to its default of 24 mins.  A user who logs in can stay logged in forever, provided he never closes his browser, and he never stops clicking for more than 24 mins.

Now, let’s say the same user takes a 30 min. lunch break, and leaves his browser open.  When, he gets back, he’ll most likely have been logged out because his session data was garbage collected on the server, even though his browser cookie was still there.

Now, let’s change session.cookie_lifetime to 1 hour.  A user who logs in can stay logged in for up to an hour if he clicks away for the whole time.  This is regardless of whether or not he closes/reopens his browser.  If he takes his 30 min. lunch break after working for 15 mins. he will most likely be logged out when he returns, even though his browser cookie had 15 more mins. of life.

Now, keeping session.cookie_lifetime at 1 hour, let’s set session.gc_maxlifetime to 2 hours.  A user who logs in can stay logged in for up to an hour, period.  He does not have to click at all in that time, but he’ll be logged out after an hour.

The Real “Remember Me” Solution

Back to my problem.  At this point, I could’ve just set both directives to something  like 1 year.  But since session.gc_maxlifetime controls garbage collection of session data, I’d have session data up to a year old left on the server!  I did a quick check on the PHP session directory.  There were already several thousand sessions, and that was only for a 24-minute lifetime!

Clearly, this was not how Twitter did it.  A little more digging, and I realized that sites like those do not keep your specific session around for long periods of time.  What they do is set a long-lasting cookie that contains some sort of security token.  From that token, they can authenticate you, and re-create your session, even if your session data has already been removed from the server.  (The cookie name for Twitter is auth_token and looks to have a lifetime of 20 years.)

With the session recreation method, I could control when and how to log out users, if at all.  So this enabled us to give users indefinite sessions, while keeping all session directives at their default values.

Beyond Session Cookies

This only scratches the surface of authentication topics of course.  We didn’t talk about security implications of the session re-creation method, though I will say that the best security practice against session-based attacks seems to prompt for a password if the user attempts to change or view sensitive account information.  LinkedIn is the first example that comes to mind.

Shortly after implementing this, a request came down from high above to centralize the authentication for our multiple products.  I began to investigate single sign-on (like Google accounts) and federated identity (like OpenID), but those are topics of another post.

Here are a couple blogs that got me on my way to the final solution. Be sure to read the comments: