Friday, 21 June 2013

Amdahl's law and the root of all evil

Introduction


In the blog post "Why shared mutable state is the root of all evil." I was trying to explaining why it is important to use proper synchronization while accessing shared-mutable state in your concurrent programs to make sure they behave like expected. What I didn't get around to was talking about reasons why this synchronisation is so bad for performance and therefore the root of all evil for performance optimization.

Speedup


One performance measure is the speed-up of a programme. The speed-up basically compares the time a programme takes when executed one a single core to the time it takes to execute on multiple cores.



To calculate the speed-up (S2) on a dual core system simply measure the time to execute on a single core (T1) and divide to the execution time of on a dual core (T2).

For example if you are running a single-threaded program it doesn't matter how many cores your host system will have. The execution time will stay roughly constant and additional available cores will remain unused and idle. The speed-up therefore will be always one.

If you are running a perfectly parallelizable program on a dual core system the execution time will be exactly half the time than for a single processor system. In this case the program has a speed-up of 2. In cases where the program speed-up grows linear with the number of cores you add to the system we call it a "linear speed-up". That kind of speed-ups are the ideal.

In practise these kind of speed-ups are really rare and this is due to Amdahl's law.


Amdahl's law

What Amdahl's law basically says is that the maximal possible speed-up of a is determined by the part of the program which cannot run in parallel. For example this means if 50% of your algorithm are structured in a way they can't run by multiple threads then the maximal speed-up you will ever be able to achieve is 2. Say this algorithm would take 1 minute to execute on a single threaded system. If you optimize the heck out of your host system and let the algorithm run on with 100 threads you would be able to reduce the timing for the parallel (50%) part of the algorithm by factor 100. You still would be stuck with the single-threaded (50%) part of the algorithm which doesn't benefit from the additional threads. In the end the best you can ever achieve is a runtime of 30 seconds.

Below is a nice graph from wikipedia showing this law for different percentages of parallel proportions of the algorithm.


You can see that the different coloured graphs all flatten out after a while. Adding more threads to the equation doesn't result in any benefits any more. Even for an algorithm with 95% parallel proportion the maximal speed-up you can ever achieve is 20.

In practise having the algorithm run with a large number of threads is actually likely to reduce the speed-up again. This is due to the cost of multi-threading in the operating system like context switching.



What does this mean for my programs?

Like I tried to explain in my blog post "Free lunch for programers" the time of easy performance gains is over and only parallel programs will benefit from future advances in computer hardware. But because of Amdahl's law even parallel programs are very limited in its potential performance gains due to its non-parallel sections.

So in a way we have two conflicting forces: Safety and Performance. You need to use synchronization to make your concurrent programs correct and safe but by introducing synchronization you are installing so called critical sections which are essentially single-threaded and therefore harming performance.

By minimizing shared mutable state you can reduce the amount of synchronization used in your programs and therefore enable them to a higher performance.

Conclusion

Shared mutable state really is the root of all evil. Eliminate it wherever possible and your live as a developer will be a lot easier.


Sunday, 16 June 2013

Free lunch for programers

In my blog post Why shared mutable state is the root of all evil. I commented on the fact that for the last 10 years the free lunch for programers was over and significant performance gains were only possible by developing concurrent programs.

Last night I came across Herb Sutter who described this idea for the first time in early 2005 in his article in the Dr. Dobb's Journal titled "The Free Lunch Is Over - A Fundamental Turn Toward Concurrency in Software". Credit where credit is due I thought I write a blog post about this article.

Herb really hit the nail on the head and it is actually surprising to find a 8 year old text about computer performance which is pretty much still up-to-date.

Here is a quote from the article:

Quick: What’s the clock speed on the CPU(s) in your current workstation? Are you running at 10GHz? On Intel chips, we reached 2GHz a long time ago (August 2001), and according to CPU trends before 2003, now in early 2005 we should have the first 10GHz Pentium-family chips. A quick look around shows that, well, actually, we don’t. What’s more, such chips are not even on the horizon—we have no good idea at all about when we might see them appear. Well, then, what about 4GHz? We’re at 3.4GHz already—surely 4GHz can’t be far away? Alas, even 4GHz seems to be remote indeed. In mid-2004, as you probably know, Intel first delayed its planned introduction of a 4GHz chip until 2005, and then in fall 2004 it officially abandoned its 4GHz plans entirely. As of this writing, Intel is planning to ramp up a little further to 3.73GHz in early 2005 (already included in Figure 1 as the upper-right-most dot), but the clock race really is over, at least for now; Intel’s and most processor vendors’ future lies elsewhere as chip companies aggressively pursue the same new multicore directions.

It is almost scary to see that even now 8 years later the article was first published the fastest Intel CPU is the i7-3970X which has 3.5GHz with 6 Cores and 15MB of cache. For a while the CPU can increase its clockspeed for a bit (turbo mode) but this happens only if the CPU temperature it not too high already. In other words there are still no common CPUs with 4GHz or more in the market.

However the trend to more transistors and multicore processors is continuing. Intel's forthcoming Xeon Phi uses the Many Integrated Core Architecture and will have up to 61 cores, 30MB of L2 cache and only 1.25GHz clockspeed per core.

Conclusion

To improve throughput, latency and utilization in a world of CPUs with ever growing number of cores applications need to able to run in parallel. As a consequence higher-level concurrent programming techniques like message passing, which don't rely on shared mutable, will become more and more important in the future.

Monday, 3 June 2013

Why double checked locking is broken?

Introduction


Yesterday I was talking to a good friend in the park and the topic of double checked locking for lazy instantiating of singletons came up. (Why we talked about this on a nice sunny day in the park is a separate issue, which won't be addressed in the blog ;)) At the time I was sure it was broken but I couldn't explain the reasons behind this strange behaviour. So I decided to have a look at it and write a quick blog about it.

How does double check locking look like in Java:

public class DoubleCheckedLocking {

    private static DoubleCheckedLocking singleton;
    private final String a;
    private final String b;

    private DoubleCheckedLocking() {
        this.a = "A";
        this.b = "B";
    }

    public DoubleCheckedLocking getSingleton() {
      if(singleton == null) {                       //1
        synchronized(this) {                        //2
          if(singleton == null) {                   //3
            singleton = new DoubleCheckedLocking(); //4
          }
        }
      }
      return singleton;                             //5
   }
}


The idea behind this pattern is that you want to lazy initialise your singleton but you want to avoid the performance penalty which comes from declaring the whole method synchronized. In the pattern the first thread would enter the method in line 1 and check that the singleton is null. It would then synchronize on the objects internal lock on line 2 and do a null check on the value of the singleton again on line 3. If the null check holds true the singleton will be initialised on line 5 finally. All further calls to the getSingleton method wouldn't see the singleton variable as null and would just return the cached instance. Unfortunately this pattern is not thread-safe and is considered broken. What could possibly go wrong in this situation? The problem sits on line 4. Assigning a reference to a variable and running the constructor is not an atomic operation. Two things need to happen:

  • Run the constructor
  • Assign newly created instance to variable (singleton)

The JVM is allowed to use out-of-order processing (see Why shared mutable state is root of all evil ) and can assign the reference to the variable before the constructors returns. That means that there is a window of vulnerability that singleton points to not fully instantiated object for a short while. For example if the constructor initialises internal fields a and b the unfinished object might have a initialised but not b. If a or b are final fields then it could actually happen that clients of singleton see them first as null and later properly initialised ("A"). It is only guaranteed that one threads sees the action of the other thread in proper order if you use synchronization for both reads and writes of the shared mutable state.

What can you do?

The easiest way to fix this is to not use lazy instantiation at all or synchronize the whole method.

public class ThreadSafeLoading {
    private static final ThreadSafeLoading singleton = new ThreadSafeLoading();
    
    public ThreadSafeLoading getSingleton() {
     return singleton;
    }
}

This works very well. The code to initiate a static variable runs during classloading time and before the class can be instantiated. The variable is declared final so it is ensured that all clients calling getSingleton will refer to the same instance of the class.




What if you must have the use lazy loading and can't synchronize the whole method? You can also declare the variable singleton as volatile. That way the JVM will bring memory barriers in place which will stop the possibility for out-of-order processing. Each write to volatile variable and each read from a volatile variable will create memory barrier and ensure that two threads will see actions in the other thread in proper order.

public class DoubleCheckedLocking {

    private static volatile DoubleCheckedLocking singleton;
    private final String a;
    private final String b;

    private DoubleCheckedLocking() {
        this.a = "A";
        this.b = "B";
    }
 
    public DoubleCheckedLocking getSingleton() {
      if(singleton == null) {                       //1
        synchronized(this) {                        //2
          if(singleton == null) {                   //3
            singleton = new DoubleCheckedLocking(); //4
          }
        }
      }
      return singleton;                             //5
    }
}

Saturday, 1 June 2013

Why shared mutable state is the root of all evil.

Introduction


A while back I went to a talk by Martin Odersky introducing Scala. Martin is the lead designer of Scala and a very smart guy. You can have a look at the slides from the talk here.

He described how the effect of Moore's law (that the number of transistors on a reasonable priced CPU doubles roughly every 2 years) doesn't lead to higher clock speeds anymore since the early 2000s. The number of transistors is still doubling but the extra transistors are used to include additional cores per CPU and to increase L2 and L3 cache sizes. Before the 2000s programmers got a free lunch meaning that the performance of their programs automatically improved with each new processor generation. This free lunch is now over and the performance of a single threaded program hasn't improved much over the last 10 years. To be able to create programs with smaller latency or higher throughput parallel programming mechanism are needed. 

What I found really profound was this pseudo equation which Martin introduced:

       non-determinism = parallel processing + mutable state

This equation basically means that both parallel processing and mutable state combined result in non-deterministic program behaviour. If you just do parallel processing and have only immutable state everything is fine and it is easy to reason about programs. On the other hand if you want to do parallel processing with mutable data you need to synchronize the access to the mutable variables which essentially renders these sections of the program single threaded. This  is not really new but I haven't seen this concepts expressed so elegantly. A non-deterministic program is broken.

Why are parallel programs without proper synchronization broken?

There are a few different things which can go wrong when no proper synchronization policy is enforced: Race conditions, visibility issues and out-of-order processing.

Race condition

A race condition means that the outcome of an algorithm is depend on its timing and lack of influences from other threads. Lets have a lock at the check-then-act race condition. In Java it would look something like this:

public class BrokenMap<E, V> extends HashMap<E ,V> {
    public T putIfAbsent(final E key, final T value) {
        if (!this.containsKey(key)) {
            return this.put(key, value);
        } else {
            return this.get(key);
        }
    )
}

In the putIfAbsent method we first check if the element is part of the map already. If this is not the case we add the element. Otherwise we just return the existing element. This code works perfectly fine in a single threaded environment. In a multithreaded environment however a second thread could have jumped in between the two calls to containsKey and put and put its value their first. Once thread one is continuing execution again it would override the already existing value of thread two in the map.

For the method putIfAbsent to be thread-safe it needs to guaranteed that the two calls to containsKey and put are always run uninterrupted. They need to be atomic. The easiest way to ensure this to put both calls into a synchronized block which essentially renders the program single threaded in this section.

Another form of race condition is read-modify-write. In Java it would look something like this:


public class BrokenCounter {
    private long count = 0;
    public long increment() {
        return ++count;
    }
}

The class BrokenCounter has a method increment. This class is not thread-safe because the call to increment the counter ++count is not just one operation. Under the hood it is actually 3 operations.
  1. Read the old value of count.
  2. Increment the value of count.
  3. Store the new value of count.
If a second thread interrupts the first thread somewhere between step 1 and 3 it would read the same value as thread one, increment it by one and store it back into the count variable. Once the first thread resumes it would also increment the old value and store it back into the count variable. The result of this would be a counter which has been only incremented by one even though it was called twice.

For the method increment to be thread-safe it needs to guarantee that calls to ++count will not be interrupted. The easiest way to achieve this is by putting the call into a synchronized block which essentially renders this section of the program single threaded.


Visibility

Visibility means essentially that an action in thread one is visible to a second thread. In a single threaded program when you write a new value to a variable it is guaranteed that all future reads to this variable will return the new value. In a multithreaded program such a guarantee doesn't exist. Lets have a look at the example program BrokenServer:

public class BrokenServer {

    private boolean stopped=false;
    
    public void run() {
        while(!stopped) {
            /* do something useful */
        }
    }
 
    public void cancel() {
        this.stopped = true;
    }
}


In this program a method run keeps processing stuff until it get interrupted by another thread calling the cancel method. It is a bid odd to imagine but the thread in the run method may actually never see the updated value of stopped. This is because the variable stopped may be cached in a local variable in the JVM or the value is resolved from the L1/L2 cache of the core rather than from main memory. To fix this we need to tell the JVM to include a memory barrier after writing the stopped variable and another one before reading the stopped variable. The memory barrier guarantees that all writes to the variable happen before all successive reads from the same variable (like you would expect in a single threaded program).

The easiest way to achieve this is to use a synchronized block around reading and writing the variable stopped which once again renders this part of the program single threaded.


Ordering


If you made it this far you must be truly interested in Java Concurrency. Lets have a look at one more case which is even more weird than the visibility issue above. The JVM and modern CPUs are allowed to execute code out of order if that helps with speeding up the code or improve memory access pattern.

Lets have a look at another piece of code:

public class BrokenOrder {
    int a = 0;
    int b = 0;
 
    public void assignment() {
        a = 1;
        b = 2;
    }
 
    public int getSum() {
        return a+b;
    } 
}
It is actually surprisingly difficult to reason about this very simple program. What would be the result of getSum if called in a multithreaded environment?

  • Returns 0 if getSum runs before assignment
  • Returns 3 if getSum runs after assignment
  • Returns 1 if getSum runs after a=2 assignment but before b=3 assignment
  • Returns 2 if getSum runs after b=2 assignment but before a=2 assignment

This last case can happen if the JVM / CPU decides to process the assignment method out of order. To prohibit this kind of non-deterministic behavior once again we need to use a memory barrier to prevent out-of-order processing. The easiest way to do this in Java is to use a synchronized block which renders the program single threaded in this section.

Conclusion


Shared mutable state and parallel processing doesn't go together at all. If you do not use proper synchronization you will create extremely difficult to find bugs and your programs are basically broken even if they appear to work just fine in most cases.

Sunday, 19 May 2013

NoSQL and the technology adoption lifecycle

Introduction


Last week I went to a talk titled "Considerations for using NoSQL technology on your next IT project" by Akmal Chaudhri. NoSQL is a vast field and the headline was rather vague so I was not sure what to expect. 

In the end I was really glad I went. Akmal is a great speaker with decades of experience in IT. He really helped put things in the NoSql space into perspective by comparing the "revolution" of object oriented data stores in the early 90s and the trend with XML based databases in the early 2000s with today's hype of NoSQL data bases. It is interesting to compare past trends or hypes which haven't quite taken off  with the current NoSQL landscape.

In the early 90s and 2000s Object-oriented and XML based databases seemed to be the next big thing and analysts predicted a strong growth for these technologies. In reality their adaptation became never widespread and their market share remained in the single digits. 

Technology Adoption Lifecycle

The technology adoption life-cycle classifies the technology adoption into 5 stages and can be visualized in Roger's bell curve.


  • Innovators - Techies like Google and Amazon. They innovate based on a deep need.
  • Early adopters - Visionaries who like to try out new technologies and are generally curious. These are the opinion leaders.
  • Early majority - Pragmatists who are not really interested in new technologies per se. They are looking for the best tool to get the job done as fast as possible.
  • Late Majority - Conservatives who try to preserve the status quo and stick with known technologies because they are perceived to be safer even if it means more work.
  • Laggards - Skeptics

Crossing the chasm


What really rang through for me was the picture of crossing the chasm. Crossing the chasm is a book by Geoffrey Moore "that focuses on the specifics of marketing high tech products during the early start up period..." [wiki]

The chasm lies between the early adopters and the early majority (see picture above). It demonstrates that it is quite difficult for new products / technologies to make the jump from the visionaries to a much wider adoption of the pragmatics who are not really interested in new technologies but just want to get the job done. A lot of products which once were new and exciting lie deep down in the chasm.

Conclusion

It is not clear if NoSQL really is the next big thing. There is a lot of hype around this new technology and NoSQL has yet to prove that it can jump over the chasm. From what I can see around me, NoSQL is in a state where the pragmatists are experimenting with the new technologies and developing proof of concepts to see if NoSQL has the ability to make their life easier. 



Tuesday, 14 May 2013

Continuous delivery using MongoDB

Introduction


Last weekend I organized, managed and supported our production release for the first time. Our current primary data store is SQL Server and the release follows roughly this order:
  • Shutdown application
  • Backup databases
  • Run data definition deployment scripts and data scripts
  • Deploy the application
  • Start-up application
  • Test application

From shutdown to start-up there are a good few hours off downtime for the application. That's why releases usually happen on the weekend. It would be nice however if we would be a bit more flexible and the team would appreciate it if we could do deployments after hours during the week. To be able to do this though we would need to cut down the deployment time.

One way of doing this is to use a schemaless database like MongoDB. Since MongoDB doesn't enforce a schema there are no data definition scripts to run either on the database side. The schema is enforced on the application level. That means that the application is responsible for writing data in a safe way and providing read methods which can retrieve the stored data again. By deploying the new version of the application we would therefore also roll out the new version of the schema implicitly.

Data migration in MongoDB

After deploying the new version of the application existing data needs to be migrated. Lets take for example the common case of adding a new field to a collection. In RDBMS we would add the column using a DDL statement and either set a default value or run a batch update to populate the column.

In MongoDB there is a incremental way of achieving the same thing. Currently I am reading the book "NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence" by Pramod J. Sadalage and Martin Fowler which introduced an interesting pattern to achieve incremental database deployment.

  


The application needs to make sure that during the transition phase it can read both the old version of the document and the new version. However once the document gets saved the application would save only the new version. That way the entire data set will get migrated over time. To implement this the book suggest to add a field schema_version to each document. This schema version could correspond to the version of the application. Based on this field the application can decide if a document has been migrated already. If a document is loaded which is of an older version the application can provide special code to execute for migrating to the next version and save it.

Once all documents are migrated the migration code in the application can safely be removed in the next release.



How are Java generics implemented under the hood?

Introduction


Today at lunch we had a chat about generics and their implementation in Java. Generics are available since Java 5. To stay byte code compatible between Java 1.4 and 5 generics are implemented by using type erasure. That means that at compile time all generic types are striped out and the resulting byte code only contains raw types like:

List a = ArrayList();
instead of
List<MyType> a = new ArrayList<MyType>();

This means that when we read elements of the list at runtime there needs be a cast added to check that both sides of the assignment have a compatible type:

MyType a = (MyType)list.get(0);
instead of
MyType a = list.get(0);

Java Class File Disassembler

To analyse how this cast is actually implemented in Java byte code one can use the Java Class File Disassembler. This command line tool ships with every Oracle JDK distribution and is called javap.

To inspect the byte code of a given class file use the command: javap -c <classfile>
For example:

javap -c org.henrikeichenhardt.javabytecode.Main

The output can get quite verbose and it makes sense to boil down the problem you want to look at to a few lines of code - just enough so you can create and compile you scenario.

Under the hood

For our scenario the Java code would look like this:

package org.henrikeichenhardt.javabytecode;

import java.util.ArrayList;
import java.util.List;

public class Main {
   public static void main(final String[] args) {
    final List<Main> list = new ArrayList<Main>();
    list.add(new Main());
    
    final Main value = list.get(0);
   }
}

We simply create an java.util.ArrayList with type of Main and add one element to this list. Then we retrieve this element again from the list. This few lines of code result in the following output of javap:

Compiled from "Main.java"
public class org.henrikeichenhardt.javabytecode.Main extends java.lang.Object{
public org.henrikeichenhardt.javabytecode.Main();
  Signature: ()V
  Code:
   0:   aload_0
   1:   invokespecial   #8; //Method java/lang/Object."<init>":()V
   4:   return

public static void main(java.lang.String[]);
  Signature: ([Ljava/lang/String;)V
  Code:
   0:   new     #16; //class java/util/ArrayList
   3:   dup
   4:   invokespecial   #18; //Method java/util/ArrayList."<init>":()V
   7:   astore_1
   8:   aload_1
   9:   new     #1; //class org/henrikeichenhardt/javabytecode/Main
   12:  dup
   13:  invokespecial   #19; //Method "<init>":()V
   16:  invokeinterface #20,  2; //InterfaceMethod java/util/List.add:(Ljava/lang/Object;)Z
   21:  pop
   22:  aload_1
   23:  iconst_0
   24:  invokeinterface #26,  2; //InterfaceMethod java/util/List.get:(I)Ljava/lang/Object;
   29:  checkcast       #1; //class org/henrikeichenhardt/javabytecode/Main
   32:  astore_2
   33:  return

}


The first block of information beginning with
public org.henrikeichenhardt.javabytecode.Main();
is the code to generate the (implicit) constructor of the Main object. We can ignore this for now. The second block of information is the byte code representing the code of the static main method
public static void main(java.lang.String[]);
On the first few lines the instances of the ArrayList is created and a reference is stored in on the stack. Then an instance of the Main object is created and added to the list. Now lets focus on reading an element from the list:

final Main value = list.get(0);

This resulting Java byte code looks like this:


   24:  invokeinterface #26,  2; //InterfaceMethod java/util/List.get:(I)Ljava/lang/Object;
   29:  checkcast       #1; //class org/henrikeichenhardt/javabytecode/Main
   32:  astore_2

Without going into to much detail here what basically happens is that invokeinterface calls the method get of our list and puts the result on top of the stack. Then the command checkcast is executed. It takes only one argument which is a reference to the type which needs to be checked (Main). This type is then compared with the top of the stack (where we previously put the result of list.get(0)).

How does checkcast work?


The documentation of the checkcast instruction reveals how this check is done:
  • If the stack reference is null nothing happens.
  • If the stack reference can be cast to the type provided nothing happens.
  • Otherwise a ClassCastException is thrown.
That means that checkcast is very similar to an instanceOf test. If the checkcast test is passed execution continues normally and the top of the stack is stored in a local variable.

So what the problem boils down to is how does checkcast know if it can cast a value to a type? The following is copied from [1]:

The following rules are used to determine whether an objectref that is not null can be cast to the resolved type: if S is the class of the object referred to by objectref and T is the resolved class, array, or interface type, checkcast determines whether objectref can be cast to type T as follows:
  • If S is an ordinary (nonarray) class, then:
    • If T is a class type, then S must be the same class as T, or S must be a subclass of T;
    • If T is an interface type, then S must implement interface T.
  • If S is an interface type, then:
    • If T is a class type, then T must be Object.
    • If T is an interface type, then T must be the same interface as S or a superinterface of S.
  • If S is a class representing the array type SC[], that is, an array of components of type SC, then:
    • If T is a class type, then T must be Object.
    • If T is an interface type, then T must be one of the interfaces implemented by arrays (JLS §4.10.3).
    • If T is an array type TC[], that is, an array of components of type TC, then one of the following must be true:
      • TC and SC are the same primitive type.
      • TC and SC are reference types, and type SC can be cast to TC by recursive application of these rules.
It is now up to the JVM implementation how efficient this checks are executed and if any profiling or heuristics are applied. In our example S is an ordinary class and T is a class type and both are the same which satisfies the first case above.

[1] Oracle checkcast