Dev Tools for a Server Cluster Wax Seal Decoration

09/24/2007    |    Devlog    |    brendanw    |    Discuss

Monitoring tools for the server cluster

One of the initiatives being pushed by the engineering department (DevCo) these days has been trying to improve how we monitor the beta and deal with critical beta issues like crashes, character data loss, and exploits, etc. Keeping on top of these issues is immensely important and in the past we (the devs) have been a little slow to respond to these problems. A major part of our slow reaction time has been due to the visibility of any given problem. I wanted to share with you what sort of tools we’ve been working on to help with exposing beta issues more quickly and also give you a picture of some of the inner gristle of the beta server cluster.

Anatomy of a Server Cluster

Before I start talking about the tools DevCo has been working on I should explain how our beta cluster is set up. Currently our beta cluster consists of 4 super beefy machines sitting at our co-location facility in downtown Seattle. Each machine runs a number of server apps that control different aspects of our MMO. This diagram shows the current configuration of the four machines. All of the dynamic game data needed by various cluster apps (character info, landmark state, etc) are stored on our database server 10k-sql01. Each cluster machine runs one instance of a special app called “BigBrother” whose only job is to start up other server apps upon request. One of the BigBrothers elects themselves the master BigBrother and tells the others what to do.

There are two main classes of server apps. The first class handles cluster wide services like missions and chat. These typically run on the first machine, but BigBrother can chose to run a server app on any machine depending on server load. These special services are:

  • MissionServer – Tracks all the missions each player has
  • LoginServer – Handles character creation and player login
  • ChatServer – Handles Chat, Mail, and Group/Society management
  • ConnectionServer – Gate keeper from the Internet to the server cluster through which your game client connects
  • DispatchServer – Handles all delayed message sending (i.e. messages to offline characters) like auction listings being sold and in game e-mail messages
  • CacheServer – Gate keeper to the database. Makes game data lookups from the database faster.

The other class of server apps is the ZoneServer. There are hundreds of these running on a given server cluster. Each server controls an instance that players and AIs currently reside in. Towns, mission encounters, Ad-Hoc battles and the Open Sea are all zone servers. Whenever you go from a town to the Open Sea your character is actually being saved to the database, removed from the town zone server, and then reloaded in the Open Sea nav zone. Those of you that were unfortunate enough to get the character data loss bug we had a while ago (i.e. went to one zone, gained XP, left zone, new XP was missing) were hitting a bug in this zoning process, which unfortunately is VERY complicated. If you want to know more about our server architecture check out the chapter Joe wrote, MMP Server Cluster Architecture, in Massively Multiplayer Game Development 2.

A View From Afar

So remember how I said that the server cluster isn’t located here at FlyingLab HQ but rather at a co-location facility downtown. The next question you might have is, “How do you keep an eye on what the cluster is doing”? Well, I’m glad you asked! For some time we’ve been using Remote Desktop Connection (or RDC for short) to do everything with the remote cluster machines. This was a serious drag because RDC over our DSL line is a bit on the slow side. It became quickly appearant that we needed some web-based tools to manage the cluster from afar and thus Web OpsViewer was born. NOTE: All links you see here are pictures, not links to the actual tools. This fist picture is the cluster overview page where we can take the cluster up or down, see how many server apps are running, or access more specific cluster info. Clicking on the Process List gives a list of all the core server apps running, which machine they are running on, when they started, what their load is, etc. Clicking on the Zone List link gives a list of all of the zones currently running. The image here is just a snippet of the total listings as there are usually hundreds of these entries. We also keep track of all the zones that have shutdown, which is particularly useful if a server goes down abnormally and we need to find out why (more on that in a bit).

Memoirs of a ZoneServer

You might have noticed the Log link on the above process and zone lists. Each server app writes a log file to disk keeping a record of what errors or debugging events transpired for that app. We can turn on and off different kinds of logging depending on what we are concerned with at any given moment. However getting to these log files used to be a tremendous pain as we could have to Remote Desktop into the server machines (slow), and search through a log folder by hand looking for the log file we wanted (slower). Misha was not a fan of this process to say the least. To fix this, I created a web-based log tool that lets you peek at the log files for any process in the cluster. I’m currently working on some better searching and filtering functionality as some log files (like the Open Sea) can be hundreds of pages long.

Another tool I wrote runs every two hours scanning all logs files for server apps that have finished running and saves the logs to a database, which makes doing queries and searchings on the logs faster as well as cleans up the number of log files we have sitting around. Additionally, this log scanner also keeps an eye out specially marked error logs (like broken missions doors and characters data corruption) and sends an e-mail with error messages attached. Since I started running the tool it’s been amazing to see how much log data gets generated. In the last month we’ve generated about 1.3 GB of log data, which can be quite unwieldy to deal with if you don’t have the right tools in place to handle it (still more work to do in this area).

When Good Clusters Go Bad

The beta community used to be quite small. One of the consequences of having a small beta community is that many hidden bugs don’t get exposed. Once the server population began ramping up we started seeing crashes more frequently. This is totally expected in the development of an MMO as there are issues that just don’t crop up until lots of people are banging on the systems. Unfortunately, our mechanism for dealing with server crashes was terrible. For example, the mission server would crash and a DrWatson dialog would appear on one of the cluster machines, but no one here at FlyingLab HQ knew about it. People in the beta would report that none of the NPCs were working (they have to talk to the mission server). Misha would tell a dev about it and then we would Remote Desktop to the cluster and look at the crash dialog waiting patiently for our attention. Debugging a crashed server app remotely was painful and slow. All told the whole process could take hours and was stressful. When the beta was small this didn’t happen often, but once the beta started ramping up it became readily apparent that we would get overwhelmed if things didn’t change.

To fix this, I set up some tools to let DevCo know about crashes when they happen and provide the data we need to debug the issue. When a program crashes, you can save a file called a crash dump. This crash dump file gives a bunch of information about the state of the program when it died (what’s in memory, what the program was doing etc). Now when something on the cluster crashes, a tool runs that saves a dump file for the crashed app, gets the log file for the app, and sends an e-mail to the devs with a link to the crash dump and log file. This tools has been immensely helpful in dealing with and prioritizing crashes as they happen. In fact, the day we did the auction house stress test event everyone in the office had bets placed on how many crash e-mails we would get :). The other interesting thing you start to notice is that even though you may get 20 crashes in a day, there are usually only one or two actual crash-bugs that people keep hitting in the same way (which is good), rather than 20 unique issues (not so good).

Flogging Those Who ‘Sploit

Besides having systems in place for dealing with crashes and critical errors we’ve been working on tracking other critical cluster events. A big part of this initiative is trying to track down if people are exploiting a mission for experience or money. Or if someone found an item duplication bug. Large artificial injections of cash can really disrupt the balance of an MMO economy. A while ago Joe started working on a Formal Logging system or “Flogs” for short. A flog is a log of a game event that’s written to a database. These events include every time you gain or lose money, get killed, gain xp, take a mission, etc. Joe then made a simple Flog Viewer that lets you view flogs for a given character or search for flogs of a given type (like who got a given rare loot item).

The Flog Viewer is a very general tool that can spit out overwhelming amounts of data so we started creating simpler tools that for asking more specific common questions. The first such tool was the money distrbution tool that gives a break down of where all the money is earned or spent in the economy. We didn’t really learn anything new from that tool, but the results were interesting.

The most interesting recent discovery we made with the flogger came about when we did a query to see which missions were getting canceled the most often. The intent being to figure out if some missions we didn’t know about were buggy or boring. However, we discovered that there were two missions that were being canceled two orders of magnitude more than any of the missions. Further, a small number of players were responsible for these repeated cancelations. It became apparent after looking into the two missions in question that we had an XP exploit due to bad AI spawn tables and players were just farming the encounters and then canceling and retaking the missions. As a result of this, we decided to add an XP and Money Velocity Flog Queries to see who in the game world are earning money and XP at an abnormal rate. As the beta continues we’ll be data mining the flog data more for uncovering exploits like this as well as using it for tuning various systems in the game. In situations where we find a really bad exploitable mission (or a really broken mission that causes crashes) I just wrote a command that allows a GM to lock a mission so that NPCs will no longer offer it. Players who have the locked mission can still complete anything except encounters associated with the mission (when clicking on the mission door you get an error message in your chat window saying that a GM locked the mission).

Tools, Tools, and More Tools

DevCo has really only scratched the surface of the cluster management tools that we can make. One area I would really like to see development in is exposing cluster data on external web pages for the community. In the forums people have asked about a website that exposes port resources and PvP state for landmarks. In a previous dev log, I mentioned work I was doing recording AI movement in the open sea for playback in a Flash viewer. I actually got that working a while ago and have wanted to adapt it to make recordings of Landmark and AdHoc battles in the Open Sea so that players can review them online later. Another one of our devs, Whather, has also made a server app that crawls our game database and pulls character data out into a seperate database that a website could read. The original intent was to make a tool for our Ops Team, but this allows us to have a website similar to the WoW Armory. I personally find all of these web tool possibilities are very exciting as they allow you to be part of the game without having to be signed on. I suspect that in coming years other MMOs will be continuing this trend as well. In the meantime, if you have ideas for things you would like to see exposed on the on the web I invite you to discuss it in the forums. I look forward to hearing your thoughts and questions.

09/24/2007    |    Devlog    |    brendanw    |    Discuss

(divider)

Worldwide: us.png ru.png au.png