Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-25174

ApplicationMaster suspends when unregistering itself from RM with extreme large diagnostic message

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.1.1
    • 2.4.0
    • Spark Core, YARN
    • None

    Description

      We recently ran into SPARK-18016 which has been fixed in v2.3.0. This JIRA is not about the issue in SPARK-18016 but the side-effect which it brings. When SPARK-18016 occurs, ApplicationMaster fails unregistering itself because the exception contains extreme large error information.

      ERROR yarn.ApplicationMaster: User class threw exception: java.lang.RuntimeException: Error while decoding: java.util.concurrent.ExecutionException: java.lang.Exception: failed to compile: org.codehaus.janino.JaninoRuntimeException: Constant pool has grown past JVM limit of 0xFFFF
      /* 001 */ public java.lang.Object generate(Object[] references) {
      ....
      
      /* 395656 */       mutableRow.update(0, value);
      /* 395657 */     }
      /* 395658 */
      /* 395659 */     return mutableRow;
      /* 395660 */   }
      /* 395661 */ }
      

      The above codegen text is included in the final message for AM to wave goodbye to RM, while it ends up crashing the rm'sĀ ZKRMStateStore forĀ YARN-6125 not covering the unregisterApplicationMaster's message truncation. We also create an Jira on YARN Side https://issues.apache.org/jira/browse/YARN-8691

      Although SPARK-18016 fixed already, there are maybe other uncaught exceptions will cause this problem. I guess that we should limit the error message's size sent to RM while unregistering AM .

      Attachments

        Issue Links

          Activity

            People

              Qin Yao Kent Yao 2
              Qin Yao Kent Yao 2
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: