Performance Updates

I wanted to take a moment to share what has been going on with the site in terms of downtime issues. As longtime users know, the site prior to Legion was handling everything just fine.

Warcraft Logs on the back end consists of six pieces:
(1) A set of load balanced scalable PHP servers that field all incoming requests from users.
(2) A PHP job server that computes all stars, statistics and that processes rankings as new logs come in.
(3) A set of load balanced scalable Java servers that field incoming requests from the PHP servers and that handle doing all the work when you’re viewing a report itself (e.g., showing you damage done, healing, buffs, deaths, etc.)
(4) A memcache that sat in between the PHP and Java servers to reduce load on the Tomcat servers and that sat in between PHP servers and the database.
(5) A redis cache that sat in between the Java servers and the raw log data on Amazon S3.
(6) A MySQL database that contains rankings, guilds, users, etc.

For the last couple of weeks the site has been getting overwhelmed, and it took me a while to figure out why. I was seeing very heavy database load, and this was different from WoD. The issue was also masked for a while because each zone has its rankings in a new table, so it took a while for the Emerald Nightmare rankings table to get large enough to start exposing this issue.

What I believe was happening is that the memcache layer between the PHP servers and the database was failing. Last week I figured this was simply because the memcache layer needed to be scaled up, so I created a new memcache node with 2x the memory. I also changed rankings to be cached for 5 minutes instead of 2 minutes. I figured that was the issue.

This week, however, everything just got worse, so at that point I knew it couldn’t be the memcache getting overwhelmed. I believe now that I was running into a limitation of memcache regarding the size of the values that you can put into the cache. If a value is > 1 MB, memcache will not cache it.

The only thing different about rankings from WoD to Legion was the addition of a huge amount of information per rank in the form of gear and talents. Each row now has gear ids, names and quality information, and also has a bunch of talent information.

So now I knew what the problem was, that I needed to essentially switch the cache layer between PHP and the database over to redis instead, since redis did not have this limitation.

However, redis does not come pre-installed on Amazon’s PHP machines, so I had to set up rules to install it. This was the cause of a fair amount of the downtime yesterday… just getting that configuration correct. Amazon machines had a critical component that redis needed (igbinary) installed without their corresponding include headers, so the compilation of redis on the machines was failing. I had checked that igbinary was installed before deploying, but was expecting the include headers to be present, so that was a surprise relative to my local test environment.

After I finally got everything working, I turned rankings back on yesterday evening. The site promptly died again and got overwhelmed. I checked redis statistics and saw that there were tons of cache hits for redis, so at this point I was very confused.

Following my evening raid, I went back and looked at my code more closely, and it turned out my utility function for fetching from redis was never returning the value it fetched. This meant the cache might as well have not been turned on. Oops!

At any rate, now redis appears to be fully functional as the cache layer, so - fingers crossed - I think we should be more or less back to where we were prior to the Legion launch.

I do have other steps I can take if performance remains a problem, but they involve unpleasant UI changes like limiting pagination to next/prev links and/or removing the talents/trinkets column from rankings. Hopefully it doesn’t come to that (I don’t think it will).

Anyway, ramble over. Now you know what’s been going on. :slight_smile:

5 Likes

So, I’m sure you’ve had tons of advice from tons of people, but the 1M limit in Memcache can be upped using -l e.g. -l2M will up the size to 2 megabytes.

This of course is not advised by memcached, but I’ve been using it in production at 4Mb for more than a year now with no side-effects.

That said, Redis is definitely preferable and if I could get a redis stack up I’d use it over memcached in a heartbeat.

Good luck with this and other issues. Thanks for all the hard work you put in.

So the good news is the site held up fine on Sunday and Tuesday. Today however, the database got overwhelmed again. I had a profiler running the entire time, and I’m implementing a few optimizations that should help.

(1) I’m cutting the number of displayed ranks from 200 down to 100. This means the site will have to do half the work to display those pages, which people load a ton. I mean, a ton.

(2) I have changed the pagination controls for paged rankings to no longer count the total # of pages. This operation is just something InnoDB is terrible at, and even caching it for 5 minutes didn’t help. There are just so many ways to filter the rankings data. People were also loading old ranks like Highmaul and Hellfire Citadel during peak traffic periods, and doing a count on those rankings was horrific.

(3) I’ve taken more steps to cache some of the queries used in building the sidebars. This should reduce some of the stress on the DB.

All other components of the site are holding up fine. The DB is just a tricky piece since I can’t really scale it up (I have a reserved instance for the DB already on Amazon), so I have to find ways to reduce the load on it by enough to handle all these extra people.

The good news is I think we’re pretty close. The DB did fine all the way until 4pm CST, so if I can reduce the load on it by about 20-30%, I think it will hold up fine. The pagination changes alone should accomplish that.

1 Like

I have transitioned from PHP5.5 to PHP7. Supposedly this will improve performance, so hopefully this will help with site speed as well.

Site held up very well today with no lag or downtime. The database did reach 88% CPU at the peak, so it’s going to be close next Wednesday unless I make a few more changes. I identified two major CPU hits during the peak time today, so I’m going to be disabling some features during high traffic periods.

These include:
(1) Viewing your guild’s rankings page. This page is very expensive to compute and is virtually never cached because of how infrequently a guild’s rankings page is accessed. During peak periods, I plan to only allow subscribers to view guild rankings pages.

(2) Viewing “old” rankings. People are searching for rankings on Google and stuff and they end up landing on the Hellfire Citadel ranking page. This is happening a lot. I am going to disable viewing of old rankings completely during peak periods to ensure that Hellfire Citadel is not jostling with the Emerald Nightmare indexes for space.

These will both be temporary measures and should only be necessary during EU peak hours on Wed/Thurs/Sun. I expect that site traffic will decline as it has in past expansions, and then I can remove these restrictions.

Just a note that I have implemented restrictions on viewing guild/server rankings and viewing of old rankings. This will go into effect tomorrow during peak times. Hopefully these restrictions will be enough to keep the site from falling over. :slight_smile:

I wanted to provide an update, since as you can see there are still some issues at peak times. Last week, I upgraded the database to double its CPU and memory in preparation for Nighthold. (Much thanks to all the Patrons whose donations made such an upgrade possible).

I was surprised yesterday to see that the PHP front end started having issues even though it wasn’t stressed CPU-wise and the database and back end also weren’t stressed. After debugging for a while, I realized that the database had an alarmingly high number of connections open simultaneously.

What’s happening is that PHP requests are waiting for a database connection, and the load balancer’s surge queue is filling up to the point where it has to just deny requests.

I believe the excessive database connections are being caused by ability and item images. Right now when you load pages like the damage done view in a report, I now put both talents and gear icons into tooltips. The URL scheme I chose for this was /reports/ability_icon/ and /reports/item_icon/. These then look up the actual WoW/FF/Wildstar icon to use for that ability or item in a memcache, and on a miss look in the database.

The problem is that looking this up in the database (as images expire from the memcache) could cause a huge number of surged database connections to occur to look up images. This is what’s killing the site now during peak traffic.

I am in the process of reworking my architecture so that the back end Java servers hold and cache the icon names of ability and item images and actually send that to the front end when JSON requests are made. The front end can then just use those names directly without talking to the database.

Fingers crossed that this is the last hurdle to overcome performance-wise. I’ll post another update once this conversion is complete, since I have to go page by page to fix this issue.

Another update. I now believe the bottleneck causing the site to fail is the Redis cache. It is used both by the PHP servers and the Java servers in order to cache all sorts of information. If this cache lags or refuses connections it would explain why both server clusters (PHP and Java) could become slow even if I added a huge # of machines to each cluster.

This is similar to the database problem that led me to upgrade the database a week ago, only now that the database is fine, it seems the cache in front of it may not be fine.

I’m going to throw money at the problem and do a 4x upgrade on the Redis cache. We’ll see if that addresses the issue. If not, I can look into not caching as many things in it to reduce the # of clients connecting to it.

My latest suspicion is that the issue under high load was that the PHP servers use “stickiness” to route requests for data from the same report to the same back end server. This stickiness was implemented using Redis. The back end Java server also uses Redis to fetch raw report data (falling back to Amazon S3 on a Redis miss).

This created a “loop” in terms of Redis dependencies, in that you had a picture that looked liked this:

User -> PHP Server -> Redis (for session info) -> Java Server -> Redis (for report info)

So once the back end gets overloaded, you have Redis connections on the back end, and then you have all the front end requests start to stall, and they all have Redis connections open for the session info.

I have made the Redis cache far more powerful, and I’ve also reduced the time of many objects that sit in the cache to ensure there’s a lot more evictions taking place under high load.

In addition, to break the “loop”, I’ve switched the session info over to Memcached, so issues with Redis will no longer affect the front end as much.

Good news! Everything ran smoothly today. It really took a combination of all the above optimizations to get to this point. At EU peak today, the DB hit 54% CPU (would have been overloaded on the weaker DB). The back end servers hit 40 (my previous limit on them was 32, so raising the limit was necessary), and the Redis cache even filled completely.

Thanks to everyone for bearing with me. The traffic on the site just continues to rise, and as long as it keeps climbing, new scalability challenges emerge that I have to overcome.

Anyway, it seems like we’ll be good for a while with all this upgraded hardware, so happy raiding. :slight_smile:

1 Like