[MESOS-8756] Missing reasons for early task failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: 1.6.0
Fix Version/s: None
Component/s: executor, master, scheduler api
Labels:
- integration
- observability

Epic Link:
task-failure-reasons

Description

Some early task failures are not propagated to the framework. Here is an example of a marathon pod (mesos containerizer) definition with a non-existing image:

{
  "id": "/fail",
  "containers": [
    {
      "name": "container-1",
      "resources": {
        "cpus": 0.1,
        "mem": 128
      },
      "image": {
        "id": "non-existing-image-56789",
        "kind": "DOCKER"
      }
    }
  ],
  "scaling": {
    "instances": 1,
    "kind": "fixed"
  },
  "networks": [
    {
      "mode": "host"
    }
  ],
  "volumes": [],
  "fetch": [],
  "scheduling": {
    "placement": {
      "constraints": []
    }
  }
}

Here the status update the framework receives is TASK_FAILED (Executor terminated).

Here another example where a non-existing artifact is being fetched:

{
  "id": "/fail2",
  "containers": [
    {
      "name": "container-1",
      "resources": {
        "cpus": 0.1,
        "mem": 128
      },
      "image": {
        "id": "nginx",
        "kind": "DOCKER",
        "forcePull": false
      },
      "artifacts": [
        {
          "uri": "http://example.com/smth-non-existing-12345.tar.gz"
        }
      ]
    }
  ],
  "scaling": {
    "instances": 1,
    "kind": "fixed"
  },
  "networks": [
    {
      "mode": "host"
    }
  ],
  "volumes": [],
  "fetch": [],
  "scheduling": {
    "placement": {
      "constraints": []
    }
  }
}

which results in the same status update as above.

This is not an exhaustive list of such cases. I'm sure there are more failures along the fork-chain which are not properly propagated.

Frameworks (and their users) should always receive meaningful task failures reasons no matter where those failures happened. Otherwise, the only way to find out what happened is to grep agent logs.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: A. Dukhovniy

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 03/Apr/18 11:35

Updated:: 29/Apr/19 09:26