Uploaded image for project: 'Oozie'
  1. Oozie
  2. OOZIE-2326

oozie/yarn/spark: active container remains after failed job

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 4.1.0
    • Fix Version/s: None
    • Component/s: workflow
    • Labels:
      None
    • Environment:

      pseudo-distributed (single VM), CentOS 6.6, CDH 5.4.3

      Description

      Issue occurs when I launch a Spark job (local mode) that fails. (My example failed because I tried to read a non-existent file). When this occur, the job fails, and YARN ends up in a weird state: the RM manager shows the launch job has completed...but a container for the job is still live on the slave node. Because I'm running in pseudo-dist mode, this totally hangs my cluster: no other jobs can run because there are only resources for a single container, and that container is running the dead Oozie launcher.

      If I wait long enough, YARN will eventually time out and release the container and start accepting new jobs. But until then I'm dead in the water.

      Attaching screen shots that show the state right after running the failed job:
      the RM shows no jobs running
      the node shows one container running
      Also attaching a log file for the oozie job and the container.

        Attachments

        1. container-logs.txt
          45 kB
          Diana Carroll
        2. ooziejob-logs.txt
          10 kB
          Diana Carroll
        3. yarnbug1.png
          150 kB
          Diana Carroll
        4. yarnbug2.png
          58 kB
          Diana Carroll

          Activity

            People

            • Assignee:
              satishsaley Satish Saley
              Reporter:
              dcarroll@cloudera.com Diana Carroll
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: