Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-9947

lazy init appLogAggregatorImpl when log aggregation

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Major
    • Resolution: Unresolved
    • 3.1.3
    • None
    • nodemanager
    • None

    Description

      This issue introduce an method to lazy init appLogAggregatorImpl, which let it access hdfs as later as possible (when the app finish usually), to avoid access hdfs at same time when restart NMs in a cluster and reduce hdfs pressure. Lets go into the details below. 

      In current version, app log aggregator will check HDFS and try to create log app when init an app. This cause a problem when restart NMs in a  large cluster with a heavy hdfs. Restart NM will init all apps on a NM and the NM will try to connect HDFS. If the HDFS is heavily loaded, many NMs restart at same time will let the hdfs not respond. The NM will wait for HDFS's response and RM can't get NM's heartbeat and treat all containers as timeout.

      In our product environment with 3500+ NMs, we find the NMs restart will put heavy pressure on HDFS and the init app's operation is blocked on accessing hdfs (stack attached blow), which causes all the  container failed (we can find the container number in one NM fall to zero).

      We solve this problem by introduce lazy initialization in appLogAggregatorImpl. When init app, we just create appLogAggregatorImpl object with out verifyAndCreateRemoteLogDir(). We do the verifyAndCreateRemoteLogDir() when the app start aggregate logs. Because apps always are not finish or aggregate log at same time, the verifyAndCreateRemoteLogDir will execute dispersedly, which means NMs will not access hdfs at same time when they restart at same time.

       

      YARN-8418  solve the container logs' directory leaked problem by add a way to update credentials of NM. If we lazy init appLogAggregatorImpl, we don't need YARN-8418's logic because the lazy init logic happens after addCredentials logic, which means the credentials always refreshed before we use it.

       

      In summary, this issue do two things:

      1. Introducing a lazy init logic to appLogAggregatorImpl to avoid centralized access HDFS when  restart all NMs in a cluster.
      2. Reverting YARN-8481 because the lazy init logic guarantee refreshing the credentials.

      Attachments

        1. YARN-9947.001.patch
          40 kB
          Hu Ziqian

        Activity

          People

            ziqian hu Hu Ziqian
            ziqian hu Hu Ziqian
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: