Affects Version/s: 2.3.0, 2.4.5
Fix Version/s: None
When using transactions including aggregation and treeAggregation, and the seqOp and combOp accept level 1 and level 2 BLAS operations respectively, it will cause a JVM internal dead lock which is hard to detect.
Say the seqOp runs gemv, which is a level 2 BLAS operation and the combOp runs axpy, which is a level 1 BLAS operation. When a task takes seqOp meets another task takes combOp, the two task threads stuck. The call stacks are like this:
The threads states are all runnable, but actually they are not running.
When calling the function gemv, if there is not an existing BLAS instance, it will call the getInstance method to get a BLAS instance. The first entered thread will run the static code block of BLAS.scala, which tries loading a subclass of BLAS and instantiate the class with reflection.
When calling the function axpy, if there is not an existing BLAS instance, it will new an F2jBLAS instance directly because it is a level 1 BLAS operation.
The problem is, the classes NativeSystemBLAS, NativeRefBLAS and F2jBLAS which BLAS wants to load in the static code block are all subclasses of F2jBLAS, or even F2jBLAS it self. The sequence of loading class in the static code block of BLAS is like this:
- tries loading class BLAS -> lock the class BLAS
- tries loading class NativeSystemBLAS in the static code block -> lock the class NativeSystemBLAS
- recursively load F2jBLAS because it's the parent class of NativeSystemBLAS -> lock the class F2jBLAS
Simultaneously, the sequence to new an F2jBLAS in the axpy operation is like this:
- tries loading class F2jBLAS -> lock the class F2jBLAS
- recursively load BLAS because it's the parent class of F2jBLAS -> lock the class BLAS
When one task thread which runs the gemv operation just finished its second step above, and the other task thread which runs the axpy operation just finished its first step above, the gemv thread wants to load class F2jBLAS but it is locked by the axpy thread, and the axpy thread wants to load class BLAS but it is locked by the gemv thread, in which case a dead lock is generated.
A demo which can reproduce the problem is like this:
If BLAS operations in spark MLlib do not use F2jBLAS for level 1 operations but use the same instantiation as the nativeBLAS, there won't be such a problem.