Uploaded image for project: 'REEF'
  1. REEF
  2. REEF-1782

REEF-on-REEF host driver closes prematurely

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Labels:
    • Environment:

      YARN 2.7.3+

      Description

      REEF-on-REEF application runs on YARN, and the inner application completes successfully; however, the host application's driver closes prematurely and has the FAILED/FAILED status in RM:

      $ yarn application -list -appStates ALL
                      Application-Id      Application-Name        Application-Type          User           Queue                   State             Final-State             Progress                        Tracking-URL
      application_1492554568254_0013     REEF-on-REEF:host                    YARN        hadoop      root.hadoop                 FAILED                  FAILED                 100% http://cisl-linux-070:8088/cluster/app/application_1492554568254_0013
      application_1492554568254_0014    REEF-on-REEF:hello                    YARN        hadoop      root.hadoop               FINISHED               SUCCEEDED                 100%                                 N/A
      

      Most likely, that happens because on completion the inner application closes some resources that either belong to the host app, or are shared with it.

      Here's a fragment of the dirver log:

      2017-04-18 19:15:52,332 INFO reef.examples.reefonreef.ReefOnReefDriver.onNext main | REEF-on-REEF inner job application_1492554568254_0014 completed: state DONE
      2017-04-18 19:15:52,332 FINER reef.runtime.common.REEFEnvironment.close main | ENTRY
      2017-04-18 19:15:52,332 FINER reef.wake.time.runtime.RuntimeClock.close main | ENTRY
      2017-04-18 19:15:52,332 FINER reef.wake.time.runtime.RuntimeClock.close main | RETURN Clock has already been closed
      2017-04-18 19:15:52,332 FINER reef.runtime.common.launch.REEFErrorHandler.close main | ENTRY
      2017-04-18 19:15:52,332 FINER reef.runtime.common.utils.RemoteManager.close main | ENTRY
      2017-04-18 19:15:52,332 FINE reef.wake.remote.impl.DefaultRemoteManagerImplementation.close main | RemoteManager: REEF_UNMANAGED_DRIVER Closing remote manager id: socket://10.200.91.65:16952
      2017-04-18 19:15:52,332 FINE reef.wake.remote.impl.DefaultRemoteManagerImplementation.close main | RemoteManager: REEF_UNMANAGED_DRIVER already closed
      2017-04-18 19:15:52,332 FINER reef.runtime.common.utils.RemoteManager.close main | RETURN
      2017-04-18 19:15:52,332 FINER reef.runtime.common.launch.REEFErrorHandler.close main | RETURN
      2017-04-18 19:15:52,332 FINER reef.runtime.common.REEFEnvironment.close main | RETURN
      2017-04-18 19:15:52,332 INFO reef.examples.reefonreef.ReefOnReefDriver.onNext main | REEF-on-REEF host job REEF-on-REEF:host completed: inner app application_1492554568254_0014 status SUBMITTED
      

      i.e. some driver resources has already been closed at the end of the inner app.

      Another good test for that behavior would be running two inner applications in Unmanaged AM mode sequentially from the same host driver.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                motus Sergiy Matusevych
                Reporter:
                motus Sergiy Matusevych
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - 168h
                  168h
                  Remaining:
                  Remaining Estimate - 168h
                  168h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified