As CPUs have more and more cores available, big performance gains can be achieved by having your application take advantage of running tasks in parallel. An increased level of parallelism however introduces more complexity. An intinsic problem when trying to run things in parallel is deciding what to run in parallel. The amount of tasks than can run completely in parallel are seldom and some type of synchronisation will be required most of the time, already decreasing the possible gain you can make.
Another problem you face is to determine how much you want to run in parallel. Do you want to start a thread for every single small task you can think of? Or do you want to have a couple of bigger threads? Since you know that creating threads comes with a cost, but can you avoid this cost by using a thread pool instead? There are so many questions to be answered when starting to work with parallelism, but today I will be focussed on determining how much threads you should have running given a certain capacity of the CPU.
The amount of work you can (and want to) parallalise should depend on the CPU of the system itself. The more cores your CPU has, the more you can gain from running tasks in parallel, but this is not the only thing you should consider. The type of tasks you are running will determine your potential gains just as much. We all know that CPUs nowadays are much faster then reading or writing to file, and doing this type of IO in a blocked fashion will result in underusage of your CPU.
I created some experiments to see how different scenario’s affect the time it takes to run a certain task spread out over multiple threads. In these experiments I distuingish between two type of operations:
- Computational: which I do by just increasing a number.
- IO: which I simulate by having a fixed sleep.
All experiments were done on my laptop (with relatively weak hardware) consisting of a dual core (with hyperthreading) CPU. Results will vary on different systems but can provide a guideline.
For the first experiment I increase a number 100 milion times, and took the average run duration over 100 runs.
The results are a bit shocking as I expected to see the improvement to hold on for a bit longer. Instead the sweet pot seems to be two threads, which is a bit low on a machine that can run up to four threads at the same time. This may however be more related to how hyperthreading works. What is interesting to see is that even if you completely overestimate the amount of threads, the punishment is relatively low as you still improve compared to the single thread situation.
Doing the same but where the threads sleep after every 100.000 increments does show us a bit of a different story.
The sweet stop is no longer two threads, moreover there doesn’t seem to be a real sweet spot as increasing the number of threads does not decrease performance, it does however also not increase it any further beyond 4 threads. The effect that plays here is that switching between threads is only effective if the sleeping threads sleeps long enough. The actual performance gained is the difference between the sleep time and the context switch. But again we see that just taking a large number is acceptable.
None of this really matters though if your concurrent application is producing wrong results, which in this case it will! So how well does it compare to a synchronized version, one that will produce the correct result (using an atomic number instead).
This turns out to be the complete opposite of what we saw before. There is no performance gain to be achieved by parallelising this application. Instead the thread will be blocking eachother much longer than if a single thread would be allowed to do it all itself. This indicates that the JVM is smart enough not to release its access to the variable until another thread requires access to it, which means the single thread will aquire it once and keep it forever. As soon as you go up to two thread they will constantly have to pass on the access to the variable which cripples performance. Adding more threads does make the situation worse, but not by much, because the damage is already done. And it doesn’t really matter which thread asks for the access or gets it, and all other threads will be waiting anyway.
Yet again this shows that the main issue with parallelising is to decide what you want to parallalise as this will define how much gain you can expect.