First, housekeeping: there will be free beer at the end of this discussion. Now that I have your attention, let’s get started.
For those of you following our exploits, our Principal Systems Engineer Resident Cthulhu Cultist Ryan Creasey mentioned in a previous post – “Arming a Bee with a Machine Gun“ – that we’ve been actively working on logging all of the data that runs through our systems; and we track a lot of data (8 billion requests per month). The goal here: provide real-time analytics to our customers. Simple, right?
Let me step back and give some context – since the launch of GameSpy Open we’ve been on a path to continually improve developer experience with our products and services. We’ve been actively working on improving platform integration, simplifying the process for integrating our SDKs, and of course, bringing more awesome to our Developer Dashboard which is the impetus for this blog post.
The beauty of data visualization (xkcd.com)
Now, back to the data. The long and short is that we outgrew the analytics system that we initially chose for handling all the data we collect for the games that use our services – and we outgrew it MUCH faster than we’d ever anticipated. Shortly after the launch of Open, it became abundantly clear that our analytics data would be too big for its britches.
With this goal in mind, we set out to create a superior analytics system that could keep up with the heavy load over the long haul.
We Need More Powah…
Our first pass at building this new analytics system leveraged an existing webservice infrastructure written in C# using .NET 4.0/WCF. Hardware: two Windows 2K3 quad-core boxes with 2GB RAM to get things started. Our initial concern was the data bottleneck, the issue behind the aforementioned growing pains. Of course, NoSQL has already addressed this problem – TL;DR (as this will be discussed in a future post), we settled on MongoDB for our data store.
We then wrangled up some really pissed-off bees, which proceeded to annihilate our systems, Max-Payne-style. What we found is that our data store was no longer the bottleneck – Mongo was kicking ass and taking names.
Instead, we began hitting limits on our web tier. Our initial tests never even matched the identified goal of handling 3K reqs/sec (roughly what we see from production load across all of our current titles) – the two boxes we spun up for the PoC hit their cap at ~2,400 reqs/sec. Worse, these requests took 3-5 seconds to complete, on average; insufficient, to say the least, and unacceptable for a real-time system.
We knew that we needed something different; better, faster, stronger – if you will. Our first thought was to evaluate something from the Java world – which would be an easy paradigm-shift from C# and could leverage frameworks such as Play/Netty.
Diving a bit further, we stumbled upon an interesting article around Goliath/EventMachine that piqued our interest. This framework leverage Ruby 1.9 Fibers to schedule cooperative lightweight, thread-like structures that can pause and resume on a whim to create a highly efficient web server framework for handling concurrent load.
I find your lack of faith (in other languages) disturbing
You might ask yourself: Ruby… really? Believe me, we were skeptical as well. On the one hand, Java is tried and true and has reliable (albeit bloated) libraries to choose from. Ruby, on the other glove hand, provided the agility we needed to get a smaller, modular piece of our infrastructure done quickly. And, with the increasing number of Ruby projects available on GitHub, it also provides a great deal of extensibility. With this, we unleashed Goliath.
Drum Roll Please…
Now on the Linux front, we used two equivalent quad-core, 2GB RAM XenServer VMs running CentOS5 to keep our tests as even as possible. Since Goliath creates individual reactors to handle requests – we put HAProxy in front to balance the load. Pro Tip: ten reactors per machine gave us optimal results, but experiment to see what works best for you.
Speaking of which, those results were astounding (especially for some of our harder-core Ruby skeptics). Our load tests showed that our new Goliath endpoint was handling 4,500 reqs/sec with a 200ms avg. response time. Holy crap! This knocked the socks off our previous go ’round by an order of magnitude – even running on equivalent hardware. Now granted… 200 ms is still not ideal, but it’s a huge step in the right direction.
Papa Loves Mongo
But we weren’t out of the woods yet. If you haven’t played with MongoDB yet, forewarning: Mongo is not like other NoSQL offerings. If configured improperly, it will bite you. Hard. We’ve had our fair share of S&M, but this was not the time nor the place, as we were stone cold sober and had a deadline to meet. After poring over the Mongo wiki, Google group, and some online validation we came up with an approach that worked for us and could scale horizontally while maintaining a redundant backup.
We settled on a configuration of six nodes for the main cluster: three replica sets with two nodes apiece, each other node of the set is part of a sharted (sharded) cluster; e.g. (1-2-3) | (4-5-6). A final seventh node acts as the hidden backup of each replica set, runs three mongod instances and a configsvr to backup all data in case of zombie apocalypse (only a matter of time). The important part to note is that this third instance of the replica set means there is an odd number of nodes in the set, so an arbiter is unnecessary. Furthermore, the configsvr can live on this backup node to protect any failures from the primary cluster serving traffic. Check out our MongoDB Chef Cookbook if you want to replicate this setup.
You Talk Too Much
Thanks! The results of our efforts are now an updated Developer Dashboard (currently in the last stages of QA), with working metrics that are “super speedy”! Yes, it’s a technical term. We also have a lightweight, modular, highly-optimized and horizontally-scalable analytics webservice and a distributed load-test harness to boot. Not too shabby. We’re not done of course! There’s still more to come. Join us next time for a deeper dive into the infrastructure behind this system and our plans for the future. Plus, we’ll have more beer.