Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30916

Dead Lock when Loading BLAS Class

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.3.0, 2.4.5
    • Fix Version/s: None
    • Component/s: MLlib
    • Labels:
      None

      Description

      When using transactions including aggregation and treeAggregation, and the seqOp and combOp accept level 1 and level 2 BLAS operations respectively, it will cause a JVM internal dead lock which is hard to detect.

       

      Say the seqOp runs gemv, which is a level 2 BLAS operation and the combOp runs axpy, which is a level 1 BLAS operation. When a task takes seqOp meets another task takes combOp, the two task threads stuck. The call stacks are like this:

      The threads states are all runnable, but actually they are not running.

       

      When calling the function gemv, if there is not an existing BLAS instance, it will call the getInstance method to get a BLAS instance. The first entered thread will run the static code block of BLAS.scala, which tries loading a subclass of BLAS and instantiate the class with reflection.

       

      When calling the function axpy, if there is not an existing BLAS instance, it will new an F2jBLAS instance directly because it is a level 1 BLAS operation.

       

      The problem is, the classes NativeSystemBLAS, NativeRefBLAS and F2jBLAS which BLAS wants to load in the static code block are all subclasses of F2jBLAS, or even F2jBLAS it self. The sequence of loading class in the static code block of BLAS is like this:

      1. tries loading class BLAS -> lock the class BLAS
      2. tries loading class NativeSystemBLAS in the static code block -> lock the class NativeSystemBLAS
      3. recursively load F2jBLAS because it's the parent class of NativeSystemBLAS -> lock the class F2jBLAS
      4. ......

      Simultaneously, the sequence to new an F2jBLAS in the axpy operation is like this:

      1. tries loading class F2jBLAS -> lock the class F2jBLAS
      2. recursively load BLAS because it's the parent class of F2jBLAS -> lock the class BLAS
      3. ......

      When one task thread which runs the gemv operation just finished its second step above, and the other task thread which runs the axpy operation  just finished its first step above, the gemv thread wants to load class F2jBLAS but it is locked by the axpy thread, and the axpy thread wants to load class BLAS but it is locked by the gemv thread, in which case a dead lock is generated. 

       

      A demo which can reproduce the problem is like this:

      class Demo {
          public static void main(String[] args) {
              Thread t1 = new Thread(new Runnable() {
                  @Override
                  public void run() {
                      BLAS blas = BLAS.getInstance();
                      blas.print();
                  }
              });
              Thread t2 = new Thread(new Runnable() {
                  @Override
                  public void run() {
                      BLAS blas = new F2jBLAS();
                      blas.print();
                  }
              });
              t1.setName("native");
              t2.setName("f2j");
              t1.start();
              t2.start();
          }
      }
      
      abstract class BLAS {
          public static BLAS instance;
          abstract public void print();
          public static BLAS getInstance() {
              return instance;
          }
          private static BLAS load() throws Exception{
              Class klass = Class.forName("NativeSystemBlas");
              return (BLAS) klass.newInstance();
          }
          static {
              System.out.println("Entered static code block" );
              try {
                  instance = load();
              } catch (Exception e) {
                  System.out.println("error");
              }
          }
      }
      
      class F2jBLAS extends BLAS{
          @Override
          public void print() {
              System.out.println("print F2j");
          }
      }
      
      class NativeSystemBlas extends F2jBLAS {
          @Override
          public void print(){
              System.out.println("print NativeBlas");
          }
      }
      
      

      If BLAS operations in spark MLlib do not use F2jBLAS for level 1 operations but use the same instantiation as the nativeBLAS, there won't be such a problem.

        Attachments

        1. image-2020-02-21-16-31-34-274.png
          29 kB
          Mingda Jia
        2. image-2020-02-21-16-31-19-880.png
          82 kB
          Mingda Jia
        3. image-2020-02-21-16-30-45-553.png
          41 kB
          Mingda Jia
        4. image-2020-02-21-16-30-35-652.png
          46 kB
          Mingda Jia

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              martinJia Mingda Jia
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 2h
                2h
                Remaining:
                Remaining Estimate - 2h
                2h
                Logged:
                Time Spent - Not Specified
                Not Specified