Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-15679 Improve Flink's ID system
  3. FLINK-15448

Log host informations for TaskManager failures.

    XMLWordPrintableJSON

Details

    Description

      With Flink on Yarn, sometimes we ran into an exception like this:

      java.util.concurrent.TimeoutException: The heartbeat of TaskManager with id container_xxxx  timed out.
      

      We'd like to find out the host of the lost TaskManager to log into it for more details, we have to check the previous logs for the host information, which is a little time-consuming.

      Maybe we can add more descriptive information to ResourceID of Yarn containers, e.g. "container_xxx@host_name:port_number".

      Here's the demo:

      class ResourceID {
        final String resourceId;
        final String details;
      
        public ResourceID(String resourceId) {
          this.resourceId = resourceId;
          this.details = resourceId;
        }
      
        public ResourceID(String resourceId, String details) {
          this.resourceId = resourceId;
          this.details = details;
        }
      
        public String toString() {
          return details;
        }	  
      }
      
      // in flink-yarn
      private void startTaskExecutorInContainer(Container container) {
        final String containerIdStr = container.getId().toString();
        final String containerDetail = container.getId() + "@" + container.getNodeId();  
        final ResourceID resourceId = new ResourceID(containerIdStr, containerDetail);
        ...
      }
      

      Attachments

        Activity

          People

            guoyangze Yangze Guo
            victor-wong jiasheng55
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 20m
                20m