I wanted to take a moment to share what has been going on with the site in terms of downtime issues. As longtime users know, the site prior to Legion was handling everything just fine.
Warcraft Logs on the back end consists of six pieces:
(1) A set of load balanced scalable PHP servers that field all incoming requests from users.
(2) A PHP job server that computes all stars, statistics and that processes rankings as new logs come in.
(3) A set of load balanced scalable Java servers that field incoming requests from the PHP servers and that handle doing all the work when you’re viewing a report itself (e.g., showing you damage done, healing, buffs, deaths, etc.)
(4) A memcache that sat in between the PHP and Java servers to reduce load on the Tomcat servers and that sat in between PHP servers and the database.
(5) A redis cache that sat in between the Java servers and the raw log data on Amazon S3.
(6) A MySQL database that contains rankings, guilds, users, etc.
For the last couple of weeks the site has been getting overwhelmed, and it took me a while to figure out why. I was seeing very heavy database load, and this was different from WoD. The issue was also masked for a while because each zone has its rankings in a new table, so it took a while for the Emerald Nightmare rankings table to get large enough to start exposing this issue.
What I believe was happening is that the memcache layer between the PHP servers and the database was failing. Last week I figured this was simply because the memcache layer needed to be scaled up, so I created a new memcache node with 2x the memory. I also changed rankings to be cached for 5 minutes instead of 2 minutes. I figured that was the issue.
This week, however, everything just got worse, so at that point I knew it couldn’t be the memcache getting overwhelmed. I believe now that I was running into a limitation of memcache regarding the size of the values that you can put into the cache. If a value is > 1 MB, memcache will not cache it.
The only thing different about rankings from WoD to Legion was the addition of a huge amount of information per rank in the form of gear and talents. Each row now has gear ids, names and quality information, and also has a bunch of talent information.
So now I knew what the problem was, that I needed to essentially switch the cache layer between PHP and the database over to redis instead, since redis did not have this limitation.
However, redis does not come pre-installed on Amazon’s PHP machines, so I had to set up rules to install it. This was the cause of a fair amount of the downtime yesterday… just getting that configuration correct. Amazon machines had a critical component that redis needed (igbinary) installed without their corresponding include headers, so the compilation of redis on the machines was failing. I had checked that igbinary was installed before deploying, but was expecting the include headers to be present, so that was a surprise relative to my local test environment.
After I finally got everything working, I turned rankings back on yesterday evening. The site promptly died again and got overwhelmed. I checked redis statistics and saw that there were tons of cache hits for redis, so at this point I was very confused.
Following my evening raid, I went back and looked at my code more closely, and it turned out my utility function for fetching from redis was never returning the value it fetched. This meant the cache might as well have not been turned on. Oops!
At any rate, now redis appears to be fully functional as the cache layer, so - fingers crossed - I think we should be more or less back to where we were prior to the Legion launch.
I do have other steps I can take if performance remains a problem, but they involve unpleasant UI changes like limiting pagination to next/prev links and/or removing the talents/trinkets column from rankings. Hopefully it doesn’t come to that (I don’t think it will).
Anyway, ramble over. Now you know what’s been going on.