View counts, click counts, hit counts, traffic statistics… The need for analytics and reporting on web products is a must-have. Well, the easiest way to do that is to simply increment a database value each time. The problem is when those counts are coming in hundreds of times per second. Writes are the most expensive queries:
After observing subpar write behavior, I wanted to know just how many of our total writes were for updating statistics?
First, I ran
% mysqltuner ... [**] Reads / Writes: 93% / 7% ... %
So 7% of all queries were writes. That wasn’t bad. Then, I took the binary log of all DML statements for yesterday, starting at midnight. I figured 24 hours was a good sample.
% mysqlbinlog --start-date='2010-06-06 0' binary-log.000152 > cow
I grepped out DML lines, to get rid of the binary log stuff.
% grep -i '^insert' cow > cow2 % grep -i '^update' cow >> cow2
I counted up lines that wrote to our stat tables.
% wc -l cow2 24898 cow % grep -i -c 'stat_' cow2 20880
Doing the math:
20880 / 24898 = 0.86. About 86% of all writes to our database were for statistics. Which wasn’t too surprising. Most web sites must store and log a lot of data to know where to improve and how users are using the site.
So what do we do?
That’s the subject of another post, but the short answer is that these writes can be batched somehow. Whether the queries are batched with some sort of write-through cache, or job queues, the database won’t suffer from constant write queries.
The MySQL Slow Query Log is a required tool in the database administrator’s toolbox. It’s great for troubleshooting specific issues, but it’s also great for some rainy day application tuning.
My slow query log is in
/var/lib/mysqld/db-001-slow.log and records any queries that take longer than 10 seconds (the default value for
long_query_time). I can get information out of this log using
`mysqldumpslow db-001-slow.log` prints out slow queries sorted by descending execution time. But that’s not useful to me, because any query can get delayed by a blip in the system.
I like running
`mysqldumpslow -s c db-001-slow.log` which prints out the slow queries sorted by descending count of times that query occurred. Optimizing a query that takes 10 seconds to execute but occurs a dozen times every minute will be more beneficial than optimizing the query that takes 140 seconds to execute but rarely occurs.
The first time I tried this exercise, I revealed the following 3 types of slow queries (can’t remember the exact order now):
curdate()function, which are not query cacheable.
For #1, I used an in-memory cache to cache the query results. For #2, I replaced the
curdate() function with the PHP
date() function everywhere I could find it. For #3, I noticed an extraneous index on the stats table, and indexes slow down inserts and updates, so I removed it. For more on handling these types of queries, see my next post.
Storing lists of data into memcached can mean either storing a single item with a serialized array, or trying to manipulate a huge “collection” of data by adding, removing items without operating on the whole set. Both should be possible.
One thing to keep in mind is memcached’s 1 megabyte limit on item size, so storing the whole collection (ids, data) into memcached might not be the best idea.
Steven Grimm explains a better approach on the mailing list: http://lists.danga.com/pipermail/memcached/2007-July/004578.html
Following the link gives this quote:
A better way to deal with this kind of thing is with a two-phase fetch. So instead of directly caching an array of event data, instead cache an array of event IDs. Query that list, then use it construct a list of the keys of individual event objects you want to fetch, then multi-get that list of keys.
…Another advantage of a scheme like this is that you can update an item’s data without having to read then write every list that contains that item. Just update it by ID (like you’d do in your database queries) and all the lists that contain it will magically get the correct information.
That always feels nice.
Sometimes, I want to pop onto a database server, check the status of something, and then logout. So, for example, if I want to check on the number query cache free blocks, I run this long command:
% mysqladmin -u admin -p extended | grep -i qcache
Then I type in the password. Well, I grew tired of typing in the extra options, plus the password. Turns out, MySQL will look for the configuration file
.my.cnf in your home directory after it looks in /etc/my.cnf (it looks in a few other places as well). So I put this in my
[client] user=admin password=secret
And now I can simply run:
% mysqladmin extended | grep -i qcache
and it works right away. Note that the password is stored in the clear.
Like most people, I did not know much about HTTP Keep-Alive headers other than that they could be very bad if used incorrectly. So I’ve kept them off, which is the default. But I ran across this blog post which explains the HTTP Keep-Alive, including its benefits and potential pitfalls pretty clearly.
It’s all pretty simple really. There is an overhead to opening and closing TCP connections. To alleviate this, Apache can agree to provide persistent connections by sending HTTP Keep-Alive headers. Then the browser can open a single connection to download multiple resources. But Apache won’t know when the browser is done downloading, so it simply keeps the connection open according to a Keep-Alive timeout, which is set to 15 seconds by default. The problem is the machine can only keep so many simultaneous requests open due to physical limitations (e.g. RAM, CPU, etc.) And 15 seconds is a long time.
To allow browsers to gain some parallelism on downloading files, without keeping persistent connections open too long, the Keep-Alive timeout value should be set to something very low, e.g. 2 seconds.
I’ve done this for static content only. Why only static content? It doesn’t really make much sense for the main page source itself since that’s the page the user wants to view.
I’ve mentioned before that by serving all static content on dedicated subdomains, we indirectly get the benefit of being able to optimize just those subdomains. So far, this meant:
Now we can add to the list: enabling HTTP Keep-Alive headers. The
VirtualHost block might look like this now: