Uploaded image for project: 'SystemML'
  1. SystemML
  2. SYSTEMML-2487

Native Dnn operations crashing in over-provisioned parfor

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: SystemML 1.2
    • Component/s: None
    • Labels:
      None

      Description

      In case parfor does not consume all the available parallelism, we propagate this parallelism down to individual operations with slight (max 50%) overprovisioning. For example, if we have 80vcores, and parfor is assigned k=47, we still assign k=2 to individual operations.

      However, with native DNN operations this causes JVM crashes as follows:

      #
      # A fatal error has been detected by the Java Runtime Environment:
      #
      #  SIGFPE (0x8) at pc=0x00007f5de21902d6, pid=335027, tid=0x00007f5df8bcb700
      #
      # JRE version: OpenJDK Runtime Environment (8.0_161-b14) (build 1.8.0_161-b14)
      # Java VM: OpenJDK 64-Bit Server VM (25.161-b14 mixed mode linux-amd64 )
      # Problematic frame:
      # C  [libmkl_avx512.so+0x206d2d6][thread 140041622857472 also had an error]
        mkl_dnn_avx512_bkdGemmDirectConv_F64+0x276
      

      Hence, when native BLAS or DNN libraries are loaded, we should be more conservative and not over-provision at all.

        Attachments

          Activity

            People

            • Assignee:
              mboehm7 Matthias Boehm
              Reporter:
              mboehm7 Matthias Boehm
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: