Jump to content
BloodyIron

Big-ish server, server software looks to be the cause, please help

Recommended Posts

With the whole Pandemic thing going on and Avorion 1.0 releasing, we wanted to put together an event where we wiped our Galaxy and tried to get more gamers engaged on our server ( https://www.lanified.com/News/2020/Join-Our-Galaxy-Adventures ). We have been successful at getting more gamers to play regularly, however we have had performance issues with the Avorion dedicated server software just not scaling well at all. We cannot identify a hardware or resource bottleneck at this point in our environment, so we believe confidently (but would love to be proved wrong) that the issue lies with inefficiencies in the game server itself.

 

Before we start uploading server logs and the like, I want to describe the resources allocated and what we've done so far, and stuff like that.

 

Since we launched the new Galaxy we've had a steady stream of gamers, having anywhere from 4-13 gamers connected simultaneously (we have the limit currently set to 50), with people connecting from around the globe. We're in Canada.

 

Pretty much the whole time the server has been going since we started this Friday the gamers keep seeing "Server load is over 100%" in the top left, and the server frame graphs (what I assume they are) in-game continually are either yellow or red, rarely green. When we see lots of red, hit registration goes out the window, and other wonkiness ensues, until it clears up for whatever reason.

 

We can't yet find a pattern around this in regards to what the players are doing. I don't believe anyone has fighters yet, and most certainly there aren't any wars going on.

 

As for the resourcing... the Avorion dedicated server runs in an Ubuntu 16.04 VM with regular updates. We may upgrade to 18.04 in the near future, but not right now as it's being used regularly.

 

The VM resides on a striped-mirror array of SSD storage that runs ZFS, which is very fast. When we look at the ZFS IO stats we do not see any concerning level of IOPS or throughput that could be bottlenecking so we are very confident this is not a storage issue.

 

We have, however, scaled up the CPU, RAM and workers for the server several steps to try to address the issue, and while it has helped, it has not properly eliminated the issue.

 

First step:

CPU: 6x cores

RAM: 4GB

 

Second step:

CPU: 12x cores

RAM: 8GB

 

Third step:

CPU: 16x cores

RAM: 24GB

 

As we upped the CPU we upped the workers (current values I'll list below), and the earlier steps did help, but we feel we have plateaued from the CPU/worker aspect at this point. The RAM upping certainly was necessary, as we are now using 12GB of RAM out of our 24GB. We can add more RAM if need be, but I see no reason to do that.

 

Here are the current worker parameters:

workerThreads=32

generatorThreads=16

scriptBackgroundThreads=16

 

We have tried 16/8/8 when we had 16x CPU cores, however we have not seen a difference between 16/8/8 and 32/16/16, so for now I am leaving it at 32/16/16 as I have to restart the server each time to change that (don't want to scare away/frustrate our gamers).

 

I have also had profiling on multiple times, and took a look at the worker map html for the 32/16/16 configuration, generating that map when a "red storm" happened (red graphs), and I did not see anything stick out as "oh we need more workers" or something is problematic, etc.

 

Additionally, when profiling is off, the server console frequently spits out "Server Frame took over 1 second" and then shows the frame chart kind of thing.

 

The VM runs on Proxmox VE, effectively LinuxKVM at the core of the hypervisor, a very fast and good hypervisor. The VM currently is on a host that is a Dell R720 with the CPUs in it being 2x intel Xeon e5-2650 v0's. The Linux OS reports a load average of 6/6/6, so while we have lots of worker threads, I do not see any sort of thread queueing being an issue here.

 

As for the upload bandwidth, the connection is 15mbps and we're using about 4-5mbps for the Avorion server, and nothing else is pushing us close to our upload limit. The gateway is a very reliable and fast pfSense box, and the CPU on that is not getting pinned, so it is not a routing performance issue (I see these issues on LAN too btw).

 

So, at this point, I am very confident this is a software efficiency issue.

 

I can provide server logs privately to the developers and other debug info where possible. I'm not comfortable posting that publicly.

 

If there are any areas that I may have overlooked, I'm all ears, because I've ran out of online resources that I could dig up to help me with this problem. I read through all of the release notes for the last 2 years, and could not find anything to help me with this case.

 

We really want to be able to scale this up to 20-30+ gamers, but based on what I'm seeing, the experience is going to suffer more and more as we scale up.

 

Please help!

Share this post


Link to post
Share on other sites

It’s possible no one knows enough about this yet.  Sorry. :(

 

Well I certainly can understand that. I just want to make sure the right eyes are seeing this is all, so we can help get it addressed, by doing bug reports, debug filings, etc. :P

 

So if this is the right forum, great. If it isn't, well I can post elsewhere? Hmmmm.

Share this post


Link to post
Share on other sites

So, curious situation, increased "aliveSectorsPerPlayer" to 12 (was 7), decreased "workerThreads" to 16 (was 32) & "generatorThreads" to 8 (was 16), leaving "scriptBackgroundThreads" at 16. And now the server seems to be running better.

 

Right now just two players online, so we'll have to see how it holds up when more players jump back on, but the "server 100% load" alert for players is not currently happening (again, 2 players).

 

Still think this game needs actual programmatic tuning, but figured I'd share this possible improvement.

Share this post


Link to post
Share on other sites

The Server load didn't seem to increase until a particular player logged on, destroyed 5-6 ships and started salvaging them. So that seems like a probable candidate for a cause of the drastic server load increase. And an area where efficiency increases probably would help lots.

 

Not 100% sure, but IMO worth looking into, devs. ;P

Share this post


Link to post
Share on other sites

So, curious situation, increased "aliveSectorsPerPlayer" to 12 (was 7), decreased "workerThreads" to 16 (was 32) & "generatorThreads" to 8 (was 16), leaving "scriptBackgroundThreads" at 16. And now the server seems to be running better.

 

In my experience ,12 is a bit overkill for aliveSectorsPerPlayer for most servers, you may want to cap closer to  8 or 10 as it seems to have an slowing effect on the worker threads the more you add both.

 

You probably want to start by increasing the ratio between worker threads and generator/script threads as per the docs  (https://avorion.gamepedia.com/Setting_up_a_server#Examples_for_different_machines:).

 

It seems like every person I've talked to has has similar issues when they don't stick to the 4:1 - 3:1 ratio the developers suggest. Also verify that your processor supports hyperthreading if you're going to exceed the number of threads per core or you'll run into more issues.

 

As a final note on the matter of ZFS, assuming you're using linux it's good to note that ZFS is not natively supported on linux, and has a rather unique read/write architecture. I'm not going to say that it's also compounding the issues but I will say that you may want to work with a more classic/predictable filesystem before you have these issues ironed out, or at the very least configure your ZFS cluster in an extremely vanilla manner. Just a thought!

 

Also here's a little example of how I might configure your server (having used much weaker boxes).

 

16x Cores

24GB Memory

saveInterval=600
sectorUpdateTimeLimit=300
emptySectorUpdateInterval=2
workerThreads=16
generatorThreads=4
scriptBackgroundThreads=3
aliveSectorsPerPlayer=8
weakUpdate=true
profiling=true
sendCrashReports=true
hangDetection=true
backups=true
backupsPath=
simulateHighLoadServer=false
sendSectorDelay=2
placeInShipOnDeathDelay=7

Share this post


Link to post
Share on other sites

1. We were already having performance issues with sectors at 5, and as we upped it to 7, and then later to 12, the problems stayed about the same. I only upped it because our player base was requesting it, and we observed no change in how bad the server was performing as we gradually upped it.

2. ZFS is on a dedicated NAS, not local to the hypervisor itself. FreeNAS is running the ZFS zpool, so yeah.

3. I take it you didn't read the model of CPUs we're using. I guarantee it has HT ;)

4. We had a much higher worker count, but saw zero benefit from it. When trying to profile it and look at the worker thread graphics that were generated, a lot of worker threads were doing nothing, so we lowered the count in an effort to increase efficiency. Either way, nothing changed.

5. At this point, I am 100% confident this is bad code, and I'm very optimistic about the next patch helping, as the beta patch notes already talk about server performance fixes.

6. Also, I turned profiling off to try and improve performance and reduce spam to the console (so I could actually read the console). Not sure why you're recommending it to be on in a situation that involves poor performnace.

7. I had already, multiple times, referenced the link you mention, as we were building the server.

Share this post


Link to post
Share on other sites

 

First of all I agree with you that the code could use some optimization. That being said I'm not sure how complex the server codebase is as it stands.]

 

1. We were already having performance issues with sectors at 5, and as we upped it to 7, and then later to 12, the problems stayed about the same. I only upped it because our player base was requesting it, and we observed no change in how bad the server was performing as we gradually upped it.

That's interesting that you didn't see any performance hits, did you notice anything else in particular? Hard limits, etc.? Just curious for myself.

 

2. ZFS is on a dedicated NAS, not local to the hypervisor itself. FreeNAS is running the ZFS zpool, so yeah.

That still doesn't sound necessarily great to me. The fact that it's on a NAS means that it's susceptible to more efficiency problems, though I have to take your word on that you're pretty confident that's not causing any issues.

 

3. I take it you didn't read the model of CPUs we're using. I guarantee it has HT ;)

Nice one, I've definitely seen harshly diminishing returns or even overloading happening with too many threads or unbalanced configs.

 

4. We had a much higher worker count, but saw zero benefit from it. When trying to profile it and look at the worker thread graphics that were generated, a lot of worker threads were doing nothing, so we lowered the count in an effort to increase efficiency. Either way, nothing changed.

5. At this point, I am 100% confident this is bad code, and I'm very optimistic about the next patch helping, as the beta patch notes already talk about server performance fixes.

I see the same thing, it's pretty frustrating not being able to parallelize it when it seems like such an easy candidate for any sort of sharding. Keep in mind that the extra workers because threaded will also end up hogging iowait time depending on the implementation, but I think you're definitely hitting the nail on the head with the fact that at the end of the day there seems to be maybe one or so threads per user.

 

6. Also, I turned profiling off to try and improve performance and reduce spam to the console (so I could actually read the console). Not sure why you're recommending it to be on in a situation that involves poor performnace.

7. I had already, multiple times, referenced the link you mention, as we were building the server.

I wasn't sure how much you had looked into  your proc and your running system. Those things can be far more complex than they let on to be *especially* if you're virtualizing which I'm assuming you're not. Also sorry about the link I didn't see it noticed, the messages were pretty dense and just looking at the numbers on your config I've seen similar issues with the same types of ratios that get exacerbated by fiddling with more settings.

Share this post


Link to post
Share on other sites

Not sure what you're asking for about anything else in particular, so unsure how to answer that question.

 

In regards to the storage, I've monitored it from within the system itself (and yes, this is inside a VM) but also at the storage level. i do not see IOPS contention at the storage level, and I have not observed iowait on the Avorion side (in the VM). The amount of storage traffic that is happening is nowhere near saturation of the network links, nor the IOPS that the storage can serve.

 

I may not have mentioned that I used that server config link, was more pointing out that I exhausted multiple resources that I dug up online (including that one) before I posted here, as I would have preferred to solve this issue by myself, however the devs do need to get involved here.

Share this post


Link to post
Share on other sites

The common experience of most hosters is, that you get the best performance on dedicated Intel i7/i9. Running Avorion in a VM with NAS storage, and Intel Xeon may not be the best solution. Please upload some /status prints when the performance drops and a full serverlog to pastebin.com.

Share this post


Link to post
Share on other sites
Guest
Reply to this topic...

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.


×
×
  • Create New...