First of all I agree with you that the code could use some optimization. That being said I'm not sure how complex the server codebase is as it stands.]
That's interesting that you didn't see any performance hits, did you notice anything else in particular? Hard limits, etc.? Just curious for myself.
That still doesn't sound necessarily great to me. The fact that it's on a NAS means that it's susceptible to more efficiency problems, though I have to take your word on that you're pretty confident that's not causing any issues.
Nice one, I've definitely seen harshly diminishing returns or even overloading happening with too many threads or unbalanced configs.
I see the same thing, it's pretty frustrating not being able to parallelize it when it seems like such an easy candidate for any sort of sharding. Keep in mind that the extra workers because threaded will also end up hogging iowait time depending on the implementation, but I think you're definitely hitting the nail on the head with the fact that at the end of the day there seems to be maybe one or so threads per user.
I wasn't sure how much you had looked into your proc and your running system. Those things can be far more complex than they let on to be *especially* if you're virtualizing which I'm assuming you're not. Also sorry about the link I didn't see it noticed, the messages were pretty dense and just looking at the numbers on your config I've seen similar issues with the same types of ratios that get exacerbated by fiddling with more settings.