Many ML/MLLIB algorithms use native BLAS (like Intel MKL, ATLAS, OpenBLAS) to improvement the performance.
The methods to use native BLAS is important for the performance, sometimes (high opportunity) native BLAS even causes worse performance.
For example, for the ALS recommendForAll method before SPARK 2.2 which uses BLAS gemm for matrix multiplication.
If you only test the matrix multiplication performance of native BLAS gemm (like Intel MKL, and OpenBLAS) and netlib-java F2j BLAS gemm , the native BLAS is about 10X performance improvement. But if you test the Spark Job end-to-end performance, F2j is much faster than native BLAS, very interesting.
I spend much time for this problem, and find we should not use native BLAS (like OpenBLAS and Intel MKL) which support multi-thread with no any setting. By default, this native BLAS will enable multi-thread, which will conflict with Spark executor. You can use multi-thread native BLAS, but it is better to disable multi-thread first.
I think we should add some comments in docs/ml-guilde.md for this first.