In order to parallelize the implementation the computation of the model expects should be performed in multiple threads since a significant amount of time is spent there during training.
The model expects are frequently updated during the computation which makes parallelization inefficient because updates to the model expect would need to be synchronized. I evaluated different strategies of synchronization (lock free updated and locking), but all turned out to only improve the training runtime marginal. The additional computation power is almost lost through more expensive writes and waiting time.
For these reason the following strategies turned out to work almost as good as no synchronization.
The model expects are local to each thread and the n copies are joined after computing them. This solution is almost as fast as not synchronizing the model expect updates at all (which of course results incorrect parameters, but seems good enough as a runtime performance baseline). The solution has the disadvantage that the required amount of memory raises with the amount of threads, but is not seen as a problem because the model expect usually only need several ten MBs of memory per copy and moden multi core system usually have many GBs of memory. Additionally this parallelization strategy makes good use of the CPU core caches compared to a solution which shares model expects.