It’s been an interesting year for us webmonkeys.
In the last several months a few big banks, insurance companies, and/or financial companies were deeper into sub-prime mortgages than they should have been. The US government stepped in to either loan money (AIG) or take over (Fannie Mae, Freddie Mac) said organizations. The aforementioned steps and those that followed have since blown open the door for much more intervention.
If you’re very familiar with Dave Ramsey, you probably know that we don’t like debt very much around here. To try and stop the one of the first multi-billion dollar bailouts from becoming law, Dave went on 32 different radio and tv shows within 36 hours talking about The Common Sense Fix as an alternative to buying hundreds of billions of bad debts.
This all came together very quickly. Needless to say, it left us in the exciting position of being able to serve more people than our website was comfortably equipped to handle at the time.
At that point in time, many of the pages on our website are generated dynamically every time someone visits. We quickly took static “snapshots” of the high-traffic dynamic pages and put them out there to speed load times. In addition, we moved a lot of our images, .css, .js, and other static files out to be served by Akamai instead of by our local servers. It’s a bit more difficult to update things when Akamai hosts them, but they have a ginormous amount of capacity in their network. To use another example from the industry, Yahoo uses Akamai for most of their static content as well. We were pleasantly surprised by how dramatic of a speed increase came from using them. Perhaps even more exciting, Akamai’s bandwidth price is less expensive than our more traditional hosting arrangements.
What else did we do to prepare? In addition to upgrading some hardware, we also looked at our code to see how we could reduce database usage. We wrote this query to analyze the top 50 queries performing IO on our database:
select top 50
qs.total_worker_time / execution_count as avg_worker_time,
substring(st.text, (qs.statement_start_offset/2)+1, ((
when -1 then datalength(st.text)
end – qs.statement_start_offset)/2) + 1) as statement_text,*
FROM sys.dm_exec_query_stats as qs with (nolock)
cross apply sys.dm_exec_sql_text(qs.sql_handle) as st
order BY avg_worker_time desc
As a quick disclaimer, the query above takes a cross section of the I/O reads executed over the last 10 minutes. We did notice significant differences in the pie charts generated at different times of the day. If you’re going to be investing a lot of time or money into making optimizations, you need to run several of these reports at different times and do some experimentation of your own to see what makes the most sense for you.
We took the results of the above query, we grouped the queries by application, and we made a quick excel pie chart like this one:
Database Usage by Application
With this data, we were then able to dig down into the code and analyze which activities were causing the heaviest database load. In a pinch we were able to quickly shut off a scheduled task and optimize another heavy piece of code to reduce our database IO by 50%.
After the first round of optimizations, we ran the report again and created reality-based plans for further optimization should it be needed.
Getting actual concrete data on the queries actually causing I/O reads was empowering. At the outset of this we brainstormed ways to increase capacity. We proposed some fairly involved projects to optimize a few parts of the site that we knew executed queries. Thankfully we waited for real data before acting, because none of the ideas we initially proposed would have fixed any of the 50 slowest queries.
We made it through this event without serious disruption to our web visitors. However, the experience is really making us think twice about what would happen if we were ever to be Dugg or Slashdotted in a big way.
What an INCREDIBLE problem to have!