Diablo 2: Resurrected launched, it’s authentic as all hell—but then the D2 servers took an instant trip to the Seventh Circle. For the last week, players have faced constant login issues and outages. And by the sounds of things, the poor server engineers must be absolutely hating life.
First up: any time a developer posts a blog that surpasses 2,000 words, you know the shit has really hit the fan. It’s a massive explainer on all the issues facing Diablo 2: Resurrected players lately, and it’s so extensive because the problems aren’t caused by a single issue but a mix, ranging from an inability to deal with the game’s popularity, its architecture, and even down to the fact that players are just way more efficient at smashing Diablo into the dust in 2021.
The first major problem outlined by the team is how players’ characters and data are stored. If you’ve played any Activision or Blizzard multiplayer game over the last few decades, you’ll know that you generally login to a set of servers as close to your location as humanly possible. It’s not an individual server per se, but a cluster of servers that service an entire region.
Anyway, these servers all have their own regional databases that store the data of the characters that play on them. This is needed because there’s too many people playing Diablo 2 to just continually upload everyone’s data to a single, central point.
“Most of your in-game actions are performed against this regional database because it’s faster, and your character is ‘locked’ there to maintain the individual character record integrity. The global database also has a back-up in case the main fails,” Blizzard wrote.
These regional databases periodically send information back to the central database, so that way Blizzard has a singular record (with backups) of your thicc Level 88 Barbarians, Necromancers and so on. Which sounds all well and good—until that central database gets overloaded and the whole system, much like the engineers working on it, needs a nap.
“On Saturday morning Pacific time, we suffered a global outage due to a sudden, significant surge in traffic. This was a new threshold that our servers had not experienced at all, not even at launch,” Blizzard explained.
This was exacerbated by an update we had rolled out the previous day intended to enhance performance around game creation–these two factors combined overloaded our global database, causing it to time out. We decided to roll back that Friday update we’d previously deployed, hoping that would ease the load on the servers leading into Sunday while also giving us the space to investigate deeper into the root cause.
On Sunday, though, it became clear what we’d done on Saturday wasn’t enough–we saw an even higher increase in traffic, causing us to hit another outage. Our game servers were observing the disconnect from the database and immediately attempted to reconnect, repeatedly, which meant the database never had time to catch up on the work we had completed because it was too busy handling a continuous stream of connection attempts by game servers. During this time, we also saw we could make configuration improvements to our database event logging, which is necessary to restore a healthy state in case of database failure, so we completed those, and undertook further root cause analysis.
Not exactly the recipe for a fun weekend, that. It also explains why players were having so many issues with progress, too. You’d pick your character, start a game, play for a while, but the regional server couldn’t communicate with the central database after an outage. So it couldn’t tell Diablo 2’s source of “ground truth” about the new gear and XP you’d gained, resulting in frustrated players losing some of the progress they’d made.
The problems only got worse from there. The Diablo 2 servers came back online, but they did so during a period when most players were online—so even though the servers rebounded quickly, they crashed almost straight away as soon as hundreds of thousands of Diablo 2 instances fired up.
And if the weekend was bad, what followed on Monday and Tuesday wasn’t any better:
This leads us into Monday, October 11, when we made the switch between the global databases. This led to another outage, when our backup database was erroneously continuing to run its backup process, meaning that it spent most of its time trying to copy from the other database when it should’ve been servicing requests from servers. During this time, we discovered further issues, and we made further improvements–we found a since-deprecated-but-taxing query we could eliminate entirely from the database, we optimised eligibility checks for players when they join a game, further alleviating the load, and we have further performance improvements in testing as we speak. We also believe we fixed the database-reconnect storms we were seeing, because we didn’t see it occur on Tuesday.
This is the point where I keep hearing my brother’s advice in my head: “Never get into networking.”
Somehow, Diablo 2 hadn’t had enough. The game enjoyed its best-ever highs for concurrent players on the Wednesday Australian time—after almost a week of constant login issues and crashes. Blizzard says there were “a few hundreds of thousands of players in one region alone”—which could either be a lot or relatively standard, depending on how Blizzard’s servers define regions. (A few hundred thousand would be hugely impressive for, say, Australia. For a “region” like the United States, not so much, but if that region was a small part of the United States, then maybe it would be. The blog post doesn’t specify here.)
According to the devs, one of the biggest problems causing all of this is how the original Diablo 2 handles core pieces of player behavior. While Vicarious Visions updated the original D2 code where they could, a large part of the project was keeping what code worked.
Which was fine, up until the point where it no longer started to scale.
Diablo 2 has a particular way in which it pulls data from the central database to make sure players can do the things they want to do. Joining a game? That’s calling back to the central database. Want to switch characters? That’s another check to central command to make sure you get the character you asked for, in the spot where you left it, with all the gear you’d worked for.
Diablo 2, according to the team, was designed to be centralized. The downside of that means that only a single instance of this particular service can be run at any one time, so they can’t offload some of the weight to regional servers.
“Importantly, this service is a singleton, which means we can only run one instance of it in order to ensure all players are seeing the most up-to-date and correct game list at all times,” the devs wrote. “We did optimize this service in many ways to conform to more modern technology, but as we previously mentioned, a lot of our issues stem from game creation.”
For now, there’s a range of short-term solutions and roadmaps to rewrite Diablo 2‘s architecture so it can better scale for modern demand. The service that just provides a list of games to players, for instance, is being broken out into a service of its own.
The devs will also be introducing a login queue, ala World of Warcraft, to prevent situations where the servers get overloaded when hundreds of thousands of game instances are launched all at once:
To address this, we have people working on a login queue, much like you may have experienced in World of Warcraft. This will keep the population at the safe level we have at the time, so we can monitor where the system is straining and address it before it brings the game down completely. Each time we fix a strain, we’ll be able to increase the population caps. This login queue has already been partially implemented on the backend (right now, it looks like a failed authentication in the client) and should be fully deployed in the coming days on PC, with console to follow after.
Players will also be rate limited, but only in instances where games are being created, closed and recreated in short spaces of time, which is mostly instances where players are farming areas like Shenk & Eldritch or Pindleskin. “When this occurs, the error message will say there is an issue communicating with game servers: this is not an indicator that game servers are down in this particular instance, it just means you have been rate limited to reduce load temporarily on the database, in the interest of keeping the game running,” Blizzard advised.
It all sounds like an absolute nightmare, to be honest, and I feel for the engineers who have what looks like months of retroactive fixes in front of them. There’s a school of internet thought that says, well, Blizzard should have seen this coming and planned for it. But that’s also fundamentally part of the risk you take with remasters. These games were written back in an age where information and multiplayer services didn’t have the popularity or ease of access that we have today, and it’s difficult to know whether a lot of that old infrastructure scales the way we think it might. Sometimes it does — right up until the point where it all collapses in a flaming heap.
This article originally appeared on Kotaku Australia.