A series of problems lasting 16 years

Not so long ago, at the dawn of this millennium, on a cold November day in 2004, I sat down to write a server emulator for an online game. It was written very well for me, on eye-pleasing C # and .Net Framework version 1.1. I did not set special goals for myself, and I had relatively little experience. For some reason, the community appreciated this craft (maybe because it appeared before the official launch of the main game?) And after a few months I was faced with an explosive growth in online, and at the same time serious performance problems. The project lived for 6+ years, reached noticeable heights (2500 online at its peak, about 20,000 MAU), and then rested in the Bose. And now, after a decade and a half, I decided to make my own MMO game based on the same โ€œtime-testedโ€ developments and faced similar problems, despite the fact that they had already been solved by me once.



PS When writing this article, not a single IP was harmed , the original project, although it was saturated with the spirit of piracy (free server of a paid game!), Did not violate any rights, the copyright holder's code was not used there, and the server was based entirely on research of an honestly purchased game client and a sound sense of the developer. This opus tells only about the challenges faced by the author and original methods of solving them, both in the old and in the modern project. I apologize in advance for the narrative style of the story, as opposed to simply listing the facts.



Introduction



You can argue as much as you like that .Net is not for servers, but it seemed to me then (and now) a very sensible idea that you can write logic in the form of scripts, compile and load it on the go, without thinking too much about memory allocation, assembly debris, pointers, and more. In fact, this allows you to delegate the scripting of business logic to less qualified developers, limiting only to Code Review. But to do this, you need to make sure that the kernel itself works without failures, and it began to fail at 10-15 online, both in 2004 and in 2020.

In 2004, everything was spinning on Windows Server 2003, .Net 1.1, MSSQL 2000. The server and hosting were provided by the Wnet provider, and then a new server was built with donations from players. The project was not purely commercial, and some minimal income from banners and premium accounts was used for upgrades.
The modern server runs on Mono under Debian in .Net 4.7 compatibility mode, with MariaDB for data, hosted in the Hetzner cloud. For a long time already there is no such idealist with burning eyes who believed that games should be free, and donation and sale of game items kill all interest. Now this character has turned pretty gray, changed his enthusiasm for experience and is convinced that a startup should bring both pleasure and income.



โ€จBut the tale is not about that, but about self-written servers and their problems.



Chapter 1. Pestilence





. , , , . , , . , , . Visual Studio, - , . EventLog .



โ€” , Console.Out Console.Error. UnhandledExceptionHandler, . AutoFlush = true, , .



cmd โ€” , . , , , - โ€” , . - โ€” .Net >> log.txt.



UnhandledExceptionHandler : OutOfMemoryException ( ), StackOverflowException Unmanaged . , โ€” Access Violation - OOM.

Access Violation โ€” ZLib ( ICSharpCode.SharpZipLib), OpenSSL ( SRP-6), MySQL ( System.Data MSSQL ).



, Socket.BeginReceive . .Net Thread Pool ( , IO Threads) , UnhandledExceptionHandler. , BeginReceive->EndReceive->BeginReceive , BeginReceive .

All this significantly improved the picture and the server began to crash much less often, mostly only when the memory ran out.
In 2020, the server application was, in principle, only a console application, running in a separate screen in Linux. There were no more options for launching for Visual Studio, but the logger became very advanced over the years, UnhandledExceptions came across like bunnies in the network, and there was no native code in principle. Which, however, did not save you from crashes with OOM and StackOverflowException. The stack depth in the case of a StackOverflowException has grown tenfold, filling hundreds of kilobytes of log with messages of the same type and refusing to write a normal stack trace. But in any case, redirecting to >> log.txt quickly made it possible to understand who is to blame and where. The Telegram bot helped separately, signaling that the server process had died.



Then it was just a matter of technology. A study of the logs showed that the stack overflow just manifested itself not in the core, but in business logic: the rocket collided with another rocket or mine, they detonated, this triggered the detonation of the first rocket and so on in a circle. All in all, this is a normal work moment, but that's when I felt weird dรฉjร  vu fighting long-forgotten demons of the past. And then a new (or long forgotten old) cause of the pestilence appeared - a lack of resources.



Chapter 2. Glad





โ€” 256 , ! - , , , , โ€” , OOM - . , โ€” Visual Studio ( , ), WinDbg (), - dotTrace (). , . โ€” , 1.7, . . 100%. , , , โ€” ~100 . Maoni Stephens Rico Mariani GC, LOH (Large Object Heap) .Net. , (pin) , Gen 2, โ€” LOH, . โ€” , , , (, .Net 1.1 Generics!). โ€” , - , . Marshal.AllocHGlobal ( - , ). , , . , , , 100% CPU - . Interop WSASend/WSAReceive ( Windows , .Net) . - , .Net : BeginSend/BeginReceive , , 100% CPU.



, , , , , . , - 100% , !



, 2005 Workstation GC Server GC .Net 2.0 Preview. โ€” , GC , 5-10% CPU.



, , Thread Pool Net 1.1 Workstation GC , ( !) ( 100% ).

BeginSend/BeginReceive Windows IOCP . , , , OOM 100% .
A modern server with less than 4GB of memory causes a grin, and you can add an extra 8-16 gigabytes for a cloud solution in a couple of clicks and one restart. Nevertheless, when memory began to leak and the processor load jumped to 100-150% (based on 800% for 8 cores), I again felt like a 20-year-old student, burning gigabytes and gigaflops in the firebox of a voracious car. It was strange, not normal, and stupid. It was especially unpleasant that, as before, the game continued to run normally (albeit with lags), but nothing was interrupted. Well, until the memory ran out, of course.



Over the years, Lightweight Threads (aka Fibers) managed to appear and disappear due to which we no longer have access to system threads in .Net, only to the so-called. Managed Threads, and on Mono there is still no access to ProcessThread - there are only stubs inside. Diagnostics of threads became much more complicated, but now I used my own Thread Pool, all threads were calculated and named, for each of them accurate statistics were kept, which of them is currently performing, how long a specific task takes. Due to this, it quickly turned out to track that now the problems are in my code, and not in the system one, and the thread statistics showed that the zhor is associated with the execution of business logic, just some actions are performed 100 times more often than they should. Now I was not limited in resources,therefore, I quite calmly supplied the call of each script and timer with additional logging, measured the execution time of each event, and in a week of experiments I was able to confidently say what the problem was. It turned out that a certain NPC was trying to attack another NPC and both were stuck in rocks, so they could not move and their attempts to shoot at each other were instantly interrupted due to the lack of Line Of Sight. But at the same time, each cycle of calculating the behavior (15ms), they tried to calculate the path, began to shoot, but due to the impossibility of firing, the guns did not reload and on the next cycle everything was repeated. For several days of the game, hundreds of such NPCs were recruited and they eventually consumed all the server's resources. The solution was to correct behavior and reduce stuck situations, and at the same time a short reload time even for unsuccessful shots.



And then the server started to freeze.



Chapter 3. Cold





Autumn 2005 was not easy - I had an uncertain situation with my work, apartment rent suddenly doubled. I was only pleased with the game server - there were already hundreds online, but there the problem began too - the whole world began to freeze. At best, pings continued to walk or some timers worked. And sometimes everything froze, traffic stopped and you had to kill the server application and start it again. As before, it was impossible to connect with a debugger to a running server due to significant consumption and brakes. For some reason, Visual Studio simply crashed or hung from this.



โ€” , . , - . , - . SOS.dll. Son Of Strike WinDbg .Net , , . , .Net GC. - sos.dll 50. , , , . , โ€” deadlock!



, . โ€” . โ€” , , , , ! , . SpinLock try/finally . , , โ€” , SpinLock , , , , , . 8 , . , : , , โ€œ โ€. , . , , โ€” .



, , Xeon 5130x2 8 . 2000, 2500, . , , , , -, . .
On one of the cold October days of 2020, the planned arrival of live streamers was disrupted because the server suddenly froze. Authorization worked, but it was impossible to enter the world, the Telegram bot was silent. A quick search for problems showed nothing in the logs, there were no memory problems, and none of the threads were starving. It just stopped. After saying aloud several times something about a cat from the matrix and a woman of indecent behavior, I went looking for a deadlock. After Microsoft bought Miguel de Icaz and Xamarin, the Mono documentation is a pitiful sight โ€” itโ€™s there, but not up to date or leading nowhere. For example, 3/4 of the data from the pageabout debugging in mono with gdb is not applicable and does not work. I was able to connect to the frozen server via gdb, but the commands call mono_pmip and others gave unintelligible answers, mostly about syntax errors. By some miracle, I realized that gdb wants me to cast the parameters and the result of mono_ * commands to certain types, and so I ended up being able to get a list of threads frozen in cross-blocking. But the numbers in the list did not match either the ps command or the ManagedThreadId from the server. The extended logging, which I did to find the processor burn, helped a lot - from it I was able to understand which packages and timers were executed last and gradually began to narrow the circle of suspects. As an evil, cross-blocking was not with two threads, but with three, so it was not possible to get a more detailed picture.Then I remembered the old rake and started looking at the code for using locks. As it turned out, several refactorings have passed over the years and SpinLock has been gradually replaced by Monitor.Enter / Monitor.Exit, and often by a simple lock. And then suddenly I caught my eyeEric Gunnerson's article , which says that you can do it much easier: use Monitor.TryEnter everywhere with a timeout, and if the blocking fails, then throw an exception. This is an incredibly simple and very effective method - if somewhere the TryEnter call waited for more than 30 seconds and fell out (and such delays are not typical of logic), then this place must be investigated and checked who could have taken for such a long time and not given the lock object. As I sprinkled ashes on my head, I realized that I could have cleaned everything up in this way 15 years ago, it was not necessary to reinvent the wheel with calculating the โ€œdepth of the holeโ€. But maybe it was for the best then.



Well, then the 4th rider came to a new project, as once to an emulator. Only he did not have time to become popular. Still, the presence of as many as three critical problems right at the start of the project quickly knocked him down. And the game came out not at all mainstream. But this is also not a topic for this article.



PPS The article uses illustrations by an unknown artist Parsakoira with the signature โ€œChoW # 227 :: VOTING :: 4 Horsemen of the Apocalypseโ€, presumably from the already deceased site conceptart.com:

https://www.pinterest.com/pin/460141286926583086/

https : //www.pinterest.com/pin/490681321879914768/



All Articles