Introduction
In the blog post "Why shared mutable state is the root of all evil." I was trying to explaining why it is important to use proper synchronization while accessing shared-mutable state in your concurrent programs to make sure they behave like expected. What I didn't get around to was talking about reasons why this synchronisation is so bad for performance and therefore the root of all evil for performance optimization.
One performance measure is the speed-up of a programme. The speed-up basically compares the time a programme takes when executed one a single core to the time it takes to execute on multiple cores.
To calculate the speed-up (S2) on a dual core system simply measure the time to execute on a single core (T1) and divide to the execution time of on a dual core (T2).
For example if you are running a single-threaded program it doesn't matter how many cores your host system will have. The execution time will stay roughly constant and additional available cores will remain unused and idle. The speed-up therefore will be always one.
If you are running a perfectly parallelizable program on a dual core system the execution time will be exactly half the time than for a single processor system. In this case the program has a speed-up of 2. In cases where the program speed-up grows linear with the number of cores you add to the system we call it a "linear speed-up". That kind of speed-ups are the ideal.
In practise these kind of speed-ups are really rare and this is due to Amdahl's law.
Below is a nice graph from wikipedia showing this law for different percentages of parallel proportions of the algorithm.
You can see that the different coloured graphs all flatten out after a while. Adding more threads to the equation doesn't result in any benefits any more. Even for an algorithm with 95% parallel proportion the maximal speed-up you can ever achieve is 20.
In practise having the algorithm run with a large number of threads is actually likely to reduce the speed-up again. This is due to the cost of multi-threading in the operating system like context switching.
So in a way we have two conflicting forces: Safety and Performance. You need to use synchronization to make your concurrent programs correct and safe but by introducing synchronization you are installing so called critical sections which are essentially single-threaded and therefore harming performance.
By minimizing shared mutable state you can reduce the amount of synchronization used in your programs and therefore enable them to a higher performance.
Speedup
One performance measure is the speed-up of a programme. The speed-up basically compares the time a programme takes when executed one a single core to the time it takes to execute on multiple cores.
To calculate the speed-up (S2) on a dual core system simply measure the time to execute on a single core (T1) and divide to the execution time of on a dual core (T2).
For example if you are running a single-threaded program it doesn't matter how many cores your host system will have. The execution time will stay roughly constant and additional available cores will remain unused and idle. The speed-up therefore will be always one.
If you are running a perfectly parallelizable program on a dual core system the execution time will be exactly half the time than for a single processor system. In this case the program has a speed-up of 2. In cases where the program speed-up grows linear with the number of cores you add to the system we call it a "linear speed-up". That kind of speed-ups are the ideal.
In practise these kind of speed-ups are really rare and this is due to Amdahl's law.
Amdahl's law
What Amdahl's law basically says is that the maximal possible speed-up of a is determined by the part of the program which cannot run in parallel. For example this means if 50% of your algorithm are structured in a way they can't run by multiple threads then the maximal speed-up you will ever be able to achieve is 2. Say this algorithm would take 1 minute to execute on a single threaded system. If you optimize the heck out of your host system and let the algorithm run on with 100 threads you would be able to reduce the timing for the parallel (50%) part of the algorithm by factor 100. You still would be stuck with the single-threaded (50%) part of the algorithm which doesn't benefit from the additional threads. In the end the best you can ever achieve is a runtime of 30 seconds.Below is a nice graph from wikipedia showing this law for different percentages of parallel proportions of the algorithm.
You can see that the different coloured graphs all flatten out after a while. Adding more threads to the equation doesn't result in any benefits any more. Even for an algorithm with 95% parallel proportion the maximal speed-up you can ever achieve is 20.
In practise having the algorithm run with a large number of threads is actually likely to reduce the speed-up again. This is due to the cost of multi-threading in the operating system like context switching.
What does this mean for my programs?
Like I tried to explain in my blog post "Free lunch for programers" the time of easy performance gains is over and only parallel programs will benefit from future advances in computer hardware. But because of Amdahl's law even parallel programs are very limited in its potential performance gains due to its non-parallel sections.So in a way we have two conflicting forces: Safety and Performance. You need to use synchronization to make your concurrent programs correct and safe but by introducing synchronization you are installing so called critical sections which are essentially single-threaded and therefore harming performance.
By minimizing shared mutable state you can reduce the amount of synchronization used in your programs and therefore enable them to a higher performance.
Conclusion
Shared mutable state really is the root of all evil. Eliminate it wherever possible and your live as a developer will be a lot easier.