Details
-
Sub-task
-
Status: Closed
-
Minor
-
Resolution: Done
-
1.9.1
Description
With Flink on Yarn, sometimes we ran into an exception like this:
java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id container_xxxx timed out.
We'd like to find out the host of the lost TaskManager to log into it for more details, we have to check the previous logs for the host information, which is a little time-consuming.
Maybe we can add more descriptive information to ResourceID of Yarn containers, e.g. "container_xxx@host_name:port_number".
Here's the demo:
class ResourceID { final String resourceId; final String details; public ResourceID(String resourceId) { this.resourceId = resourceId; this.details = resourceId; } public ResourceID(String resourceId, String details) { this.resourceId = resourceId; this.details = details; } public String toString() { return details; } } // in flink-yarn private void startTaskExecutorInContainer(Container container) { final String containerIdStr = container.getId().toString(); final String containerDetail = container.getId() + "@" + container.getNodeId(); final ResourceID resourceId = new ResourceID(containerIdStr, containerDetail); ... }