Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8756

Missing reasons for early task failures

    XMLWordPrintableJSON

    Details

      Description

      Some early task failures are not propagated to the framework. Here is an example of a marathon pod (mesos containerizer) definition with a non-existing image:

      {
        "id": "/fail",
        "containers": [
          {
            "name": "container-1",
            "resources": {
              "cpus": 0.1,
              "mem": 128
            },
            "image": {
              "id": "non-existing-image-56789",
              "kind": "DOCKER"
            }
          }
        ],
        "scaling": {
          "instances": 1,
          "kind": "fixed"
        },
        "networks": [
          {
            "mode": "host"
          }
        ],
        "volumes": [],
        "fetch": [],
        "scheduling": {
          "placement": {
            "constraints": []
          }
        }
      }
      

      Here the status update the framework receives is TASK_FAILED (Executor terminated).

      Here another example where a non-existing artifact is being fetched:

      {
        "id": "/fail2",
        "containers": [
          {
            "name": "container-1",
            "resources": {
              "cpus": 0.1,
              "mem": 128
            },
            "image": {
              "id": "nginx",
              "kind": "DOCKER",
              "forcePull": false
            },
            "artifacts": [
              {
                "uri": "http://example.com/smth-non-existing-12345.tar.gz"
              }
            ]
          }
        ],
        "scaling": {
          "instances": 1,
          "kind": "fixed"
        },
        "networks": [
          {
            "mode": "host"
          }
        ],
        "volumes": [],
        "fetch": [],
        "scheduling": {
          "placement": {
            "constraints": []
          }
        }
      }
      

      which results in the same status update as above.

      This is not an exhaustive list of such cases. I'm sure there are more failures along the fork-chain which are not properly propagated.

      Frameworks (and their users) should always receive meaningful task failures reasons no matter where those failures happened. Otherwise, the only way to find out what happened is to grep agent logs.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              zen-dog A. Dukhovniy
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: