Uploaded image for project: 'Mesos'
  1. Mesos
  2. MESOS-8756

Missing reasons for early task failures

    XMLWordPrintableJSON

Details

    Description

      Some early task failures are not propagated to the framework. Here is an example of a marathon pod (mesos containerizer) definition with a non-existing image:

      {
        "id": "/fail",
        "containers": [
          {
            "name": "container-1",
            "resources": {
              "cpus": 0.1,
              "mem": 128
            },
            "image": {
              "id": "non-existing-image-56789",
              "kind": "DOCKER"
            }
          }
        ],
        "scaling": {
          "instances": 1,
          "kind": "fixed"
        },
        "networks": [
          {
            "mode": "host"
          }
        ],
        "volumes": [],
        "fetch": [],
        "scheduling": {
          "placement": {
            "constraints": []
          }
        }
      }
      

      Here the status update the framework receives is TASK_FAILED (Executor terminated).

      Here another example where a non-existing artifact is being fetched:

      {
        "id": "/fail2",
        "containers": [
          {
            "name": "container-1",
            "resources": {
              "cpus": 0.1,
              "mem": 128
            },
            "image": {
              "id": "nginx",
              "kind": "DOCKER",
              "forcePull": false
            },
            "artifacts": [
              {
                "uri": "http://example.com/smth-non-existing-12345.tar.gz"
              }
            ]
          }
        ],
        "scaling": {
          "instances": 1,
          "kind": "fixed"
        },
        "networks": [
          {
            "mode": "host"
          }
        ],
        "volumes": [],
        "fetch": [],
        "scheduling": {
          "placement": {
            "constraints": []
          }
        }
      }
      

      which results in the same status update as above.

      This is not an exhaustive list of such cases. I'm sure there are more failures along the fork-chain which are not properly propagated.

      Frameworks (and their users) should always receive meaningful task failures reasons no matter where those failures happened. Otherwise, the only way to find out what happened is to grep agent logs.

      Attachments

        Activity

          People

            Unassigned Unassigned
            zen-dog A. Dukhovniy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: