View counts, click counts, hit counts, traffic statistics… The need for analytics and reporting on web products is a must-have. Well, the easiest way to do that is to simply increment a database value each time. The problem is when those counts are coming in hundreds of times per second. Writes are the most expensive queries:
After observing subpar write behavior, I wanted to know just how many of our total writes were for updating statistics?
First, I ran
% mysqltuner ... [**] Reads / Writes: 93% / 7% ... %
So 7% of all queries were writes. That wasn’t bad. Then, I took the binary log of all DML statements for yesterday, starting at midnight. I figured 24 hours was a good sample.
% mysqlbinlog --start-date='2010-06-06 0' binary-log.000152 > cow
I grepped out DML lines, to get rid of the binary log stuff.
% grep -i '^insert' cow > cow2 % grep -i '^update' cow >> cow2
I counted up lines that wrote to our stat tables.
% wc -l cow2 24898 cow % grep -i -c 'stat_' cow2 20880
Doing the math:
20880 / 24898 = 0.86. About 86% of all writes to our database were for statistics. Which wasn’t too surprising. Most web sites must store and log a lot of data to know where to improve and how users are using the site.
So what do we do?
That’s the subject of another post, but the short answer is that these writes can be batched somehow. Whether the queries are batched with some sort of write-through cache, or job queues, the database won’t suffer from constant write queries.
The MySQL Slow Query Log is a required tool in the database administrator’s toolbox. It’s great for troubleshooting specific issues, but it’s also great for some rainy day application tuning.
My slow query log is in
/var/lib/mysqld/db-001-slow.log and records any queries that take longer than 10 seconds (the default value for
long_query_time). I can get information out of this log using
`mysqldumpslow db-001-slow.log` prints out slow queries sorted by descending execution time. But that’s not useful to me, because any query can get delayed by a blip in the system.
I like running
`mysqldumpslow -s c db-001-slow.log` which prints out the slow queries sorted by descending count of times that query occurred. Optimizing a query that takes 10 seconds to execute but occurs a dozen times every minute will be more beneficial than optimizing the query that takes 140 seconds to execute but rarely occurs.
The first time I tried this exercise, I revealed the following 3 types of slow queries (can’t remember the exact order now):
curdate()function, which are not query cacheable.
For #1, I used an in-memory cache to cache the query results. For #2, I replaced the
curdate() function with the PHP
date() function everywhere I could find it. For #3, I noticed an extraneous index on the stats table, and indexes slow down inserts and updates, so I removed it. For more on handling these types of queries, see my next post.
Since I knew that the MySQL Query Cache used the literal queries as keys, it made sense that MySQL did not cache queries with certain SQL functions in them, such as this one:
$sql = "select event_id from events where event_dt >= curdate()";
Because MySQL knows that this query run today is not the same query when it is run tomorrow. There are other SQL functions such as
unix_timestamp() that will bypass the query cache. These are listed here.
So I avoid these functions when possible by calculating the value in PHP. For example, I’d rewrite the above query as:
$date = date('Y-m-d'); $sql = "select event_id from events where event_dt >= '$date'";
The MySQL Query Cache is not very hard to understand. It is at its most basic a giant hash where the literal queries are the keys and the array of result records are the values. So this query:
SELECT event_name FROM events WHERE event_id = 8;
is different from this query:
SELECT event_name FROM events WHERE event_id = 10;
Important note! This means that even though your parameterized queries may look the same without the parameters, to the query cache, they are not!
As with all caches, the query cache is concerned about freshness of data. It takes perhaps the simplest approach possible to this problem by keeping track of any tables involved in your cached query. If any of these tables changes, it invalidates the query and removes it from the cache. This means that if your query returns frequently-changing data in its results, the query cache will invalidate the query frequently, leading to thrashing. For example, if you had a query that returned a view count of an event:
SELECT event_name, views FROM events WHERE event_id = 8;
Every time that event is viewed, the cached query will be invalidated. What’s the solution?
In general, write queries so that their result sets do not change often. In specific, mixing static attributes with frequently updated fields in a single table leads to thrashing, so separate out things like view counts and analytics into their own tables. The frequently updated data can be read with a separate query, or perhaps cached in your application in a data structure that periodically flushes to the DB.
This vertical partitioning of a single table’s columns into multiple tables helps immensely with the query cache. What’s more is that the table with the unchanging data can be further optimized for READS, and the frequently updated table can be optimized for UPDATES.