Architecting For Dedicated Game Servers With Unreal, Part 1


In the age of cloud infrastructure, dedicated servers are being chosen over peer-to-peer with increasing frequency for multiplayer games. You can read more about the pros and cons of each model (e.g. here). What we discuss less frequently are the many traps developers can fall into when they''ve committed to integrating dedicated servers into their game''s online platform.



I''ve fallen into many of these traps in my life over the years. I''m sharing part 1 here to save others from the same fate. Many years ago, while working on The Maestros, an indie game written in Unreal Engine, my team was faced with incorporating dedicated servers into our back-end services ecosystem. The problems we encountered were something I would encounter again and again in games, no matter if they used Unreal Engine, Unity, or any other engine.



Later, in Part 2, I''ll discuss the major choices you''ll face when running dedicated servers for your own game: datacenter vs cloud, bare metal vs VMs vs containers, etc.



Getting into the Game - The Flow



To give you an idea of the topic, let me briefly illustrate how a player plays in The Maestros.



1 - Create a lobby



2 - Players Choose Characters & Join the Game



3 - Wait for a game server to start



4 - Join the game server



Phase 1 - Make It Work



We knew what we wanted so we started to make it possible with our tech stack. This was Node.js, Windows (required for Unreal at the time), and Microsoft Azure cloud VMs. First, the maestros.exe program on a player''s computer made HTTP calls to a Node.js service called "Lobbies." These calls would create/join a Lobby and select characters. When all the players were connected and ready, the Lobbies service made an HTTP call to another Node.js service called the "Game Allocator." This would cause the Game Allocator service to start another process on the same VM for the Game Server. In Unreal, a game server is just another maestros.exe process with some special parameters like so: "maestros.exe /Game/Maps/MyMap -server"



Our Game Allocator then watched for the Game Server to complete startup by searching the Game Server''s logs for a string like "Initializing Game Engine Completed." When it saw the startup string, the Game Allocator would send a message back to the Lobbies service which would then pass along the IP & port to players. Players, in turn, would connect to the Game Server, emulating the "open 192.168.1.99" you might type in the Unreal console.



Phase 2 - Scaling Up



We can now play a video game from here! With a couple more lines of JavaScript, our Game Allocator was also able to manage multiple Game Server processes simultaneously on it''s VM. Eventually, we would need to run more game server processes than 1 VM could handle, and we''d want the redundancy of multiple game server VMs as well. To achieve that, we had multiple VMs, each with a Game Allocator that would report it''s own status to the Lobbies service periodically. Lobbies code would then choose the best game allocator to start a new one.



Phase 3 - Bug Fixing the Software



This architecture worked well and was very useful throughout development. It''s similar to how many developers implement game server allocation on their first try too. It''s also plagued with problems. We had to manually deal with many issues for The Maestros. Despite our cleverest code, we dealt with Game Server processes that never exited, or game server VMs getting overloaded, or games being assigned to VMs that were in a bad state or even shut down (Azure performs regular rolling maintenance). Our engineers would have to manually kill game instances, restart the game alloctor process, or even restart the whole VM.



These headaches have been reported on many different games, so let''s examine the root causes. The first problem with starting new processes is that it is messy. Unreal is a slow process that loads a lot from disk. In addition, any process can fail for a variety reasons (e.g. Insufficient RAM, CPU, Disk We can''t do much structurally to fix it except to test extensively and write as clean code as possible.



Second, we keep trying to observe these processes from far away. In order to tell a Game Server process had completed startup, we read it''s logs (yuck). Node is used by Node to read wmic commands. Even more problematic, Lobbies makes decisions about which game server VMs can handle a new game. Lobbies makes decisions about which game server VMs can handle a new game. This separate process runs on a separate VM. It takes several milliseconds to complete (in the most ideal case). If your heart-rate doesn''t spike to dangerous levels by this time, you haven’t yet experienced networked-race-conditions.



Even if the Game Allocator parsed the OS information on a Game Server process correctly, the Game Server''s state could change before the Game Allocator acted upon it. What''s more, even if the Game Server''s state didn''t change before the Game Allocator reported it to Lobbies, the game server VM could get shut down by Azure while Lobbies tries to assign it a game. When we wanted to scale our Lobbies service and add redundancy, it would make the problem even worse because multiple Lobbies could assign games to a single Game Allocator before noticing each other''s games, thereby overloading the machine.



For a couple months we tried fixes, but ultimately, we didn''t resolve the race conditions until we changed how we thought about the problem. We put the decision making power in control of the process with the most information. This was the moment that breakthroughs occurred. When it came to game startup, the Game Server process had the best information about when it was done initializing. Therefore, we let the Game Server tell the Game Allocator when it was ready (via a local HTTP call), instead of snooping through it''s logs.



When it came to determining whether a game server VM was ready to accept new games, the Game Allocator process had the best information. Lobbies placed a game-start task in a message queue (RabbitMQ). The Game Allocator would pull tasks from the queue when it was ready. This is in place of being told by another process that has outdated information. Race conditions all but disappeared and we were free to add multiple Lobbies instances. Manual intervention on game servers reduced from weekly to a couple times a year.



Phase 4 - Hardware Bug Fixing



The next problem was a very serious one. During our regular Monday night playtests, we saw great performance for our game servers. Units were responsive, and hitching was uncommon. However, hitching and latency were unacceptable when we playedtest with alphas during weekends.



Our investigations found that packets weren''t making it to our clients - even those with strong connections. Although the Maestros uses a lot bandwidth, their specifications indicate that our Azure VMs should be able to keep up with both bandwidth and CPU. evina.si Even so, optimizing where possible did not solve the problem. However, it was back in our next weekend playtest. The only thing that seemed able to resolve the issue was the use of huge VMs that promised 10x our bandwidth, which were far more cost-efficient per-game than a few small/medium instances.



Over time, we began to feel suspicious. The only thing that was different between our regular playtests, and our external playtests, was not location or hardware (devs participated both tests), but the times. We played during development times, but our alpha testing was always scheduled around peak times to attract testers. More poking and prodding seemed to confirm the correlation.



Our hypothesis was that the network of our VM''s was not performing as advertised when traffic became heavy in the datacenter. This could be due to other tenants saturating it. This is commonly known as a "noisy neighbors problem" and is frequently discussed. However, many argue it doesn’t matter because you can dynamically assign more servers. These issues can be mitigated by Microsoft Azure''s overprovisioning. Unfortunately, this strategy doesn''t work for our Unreal game servers which are single processes with latency-sensitive network traffic that could not be distributed across machines, and certainly cannot be interrupted mid-game.



We had plenty of evidence, but no way to confirm it so we decided to do a test. We purchased unmanaged, bare-metal servers from a provider and ran them alongside our Azure VMs during peak-time playtests. The bare-metal servers had a double latency (40ms vs. 80ms), but the games ran smoothly, whereas our Azure VMs suffered from near-incomprehensible lag.



Although the switchover seemed inevitable, there were pros and con''s. It took one day to get new servers at our provider. If we decided to go all-in on baremetal, we would lose the ability to scale quickly to meet demand. On the other hand, we were already saving about 50% cost on a per-game basis by using bare metal. We decided to use bare metal servers to handle daily load and to use the more expensive Azure virtual machines when we needed to expand capacity quickly.



Conclusion and Future Topics



I hope our story helps you or other developers looking to use dedicated servers for your game. In Part 2, I''ll discuss the trade-offs in cost, maintenance, and complexity of the major choices around dedicated game server architectures.

This includes datacenters vs cloud, containers vs VMs, and existing solutions like Google''s new, container-based dedicated server solution, Agones.