Uploaded image for project: 'Chukwa'
  1. Chukwa
  2. CHUKWA-593

Archive daemon: infinite loop at midnight

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: 0.5.0
    • Component/s: MR Data Processors
    • Labels:
      None
    • Environment:

      Debian 5.0, Hadoop 0.20

    • Release Note:
      Fixed infinite loop archiving at midnight. (Sourygna Luangsay via Eric Yang)

      Description

      The archive manager Chukwa daemon enters an infinite loop between 24H to 1H. This entails an increase of the namenode load and a huge increase of both chukwa and namenode logs.

      Problem seems to come from the start function of ChukwaArchiveManager.java (in package org/apache/hadoop/chukwa/extraction/archive). At midnight, we get two directories in /chukwa/dataSinkArchives/ (one for the last day and one for the new day). This means that we neither enter the "daysInRawArchiveDir.length == 0" condition nor the "daysInRawArchiveDir.length == 1" one. processDay function is then called but few is done due to "modificationDate < oneHourAgo" condition.
      Finally, we loop without having slept or deleted last day directory. Such process repeats itself during one hour.

      Here is how I propose to change the "daysInRawArchiveDir.length == 1" condition block in the start function:
      148 if (daysInRawArchiveDir.length >= 1 ) {
      149 long nextRun = lastRun + (2*ONE_HOUR) - (1*60*1000);// 2h -1min
      150 if (now < nextRun)

      { 151 log.info("lastRun < 2 hours so skip archive for now, going to sleep for 30 minutes, currentDate is:" + new java.util.Date()); 152 Thread.sleep(30 * 60 * 1000); 153 continue; 154 }

      155 }

      As for me, it removed the infinite loop problem. But maybe there is a reason to separate "1 directory" case from "many directories" case. I've been reading documentation and subversion but could not find it.
      If there is one, could someone explain it to me?

      Regards.

        Activity

        Hide
        eyang Eric Yang added a comment -

        Thanks Sourygna. I just committed this.

        Show
        eyang Eric Yang added a comment - Thanks Sourygna. I just committed this.
        Hide
        eyang Eric Yang added a comment -
        if (modificationDate < oneHourAgo || workingDay < currentDay)
        

        +1, looks like the right fix.

        Show
        eyang Eric Yang added a comment - if (modificationDate < oneHourAgo || workingDay < currentDay) +1, looks like the right fix.
        Hide
        sourygna Sourygna Luangsay added a comment -

        My collectors, archive manager and every hadoop components of my cluster are NTP synced and have the same timezone.

        I understand the reason for daysInRawArchiveDir.length==1 instead of >=1. So maybe the fix should be in the processDay function. For instance, in the for loop, maybe we could change the condition "if (modificationDate < oneHourAgo) " for something like that: "if (modificationDate < oneHourAgo || workingDay < currentDay)" (I haven't tried this solution so I'm not sure if it's OK, I think I am going to keep the one I said in my first post since it works and remove the infinite loop). Such new condition would avoid latency, no?

        Show
        sourygna Sourygna Luangsay added a comment - My collectors, archive manager and every hadoop components of my cluster are NTP synced and have the same timezone. I understand the reason for daysInRawArchiveDir.length==1 instead of >=1. So maybe the fix should be in the processDay function. For instance, in the for loop, maybe we could change the condition "if (modificationDate < oneHourAgo) " for something like that: "if (modificationDate < oneHourAgo || workingDay < currentDay)" (I haven't tried this solution so I'm not sure if it's OK, I think I am going to keep the one I said in my first post since it works and remove the infinite loop). Such new condition would avoid latency, no?
        Hide
        eyang Eric Yang added a comment -

        processDay function should delete the previous day directory, if the previous day directory is empty. The hour between 2 days, the system is design to archive for previous day as soon as possible

        Do you have collectors running in multiple timezones, or server clock is out of sync by one hour?

        The busy loop should not happen unless there something continue to write to the previous day directory. daysInRawArchiveDir.length==1 is to ensure the roll up for previous day happens as soon as possible.

        If we change to >=1 then the roll up for previous day will not occur until 1:59AM of the current day. We should avoid this latency, if possible.

        Show
        eyang Eric Yang added a comment - processDay function should delete the previous day directory, if the previous day directory is empty. The hour between 2 days, the system is design to archive for previous day as soon as possible Do you have collectors running in multiple timezones, or server clock is out of sync by one hour? The busy loop should not happen unless there something continue to write to the previous day directory. daysInRawArchiveDir.length==1 is to ensure the roll up for previous day happens as soon as possible. If we change to >=1 then the roll up for previous day will not occur until 1:59AM of the current day. We should avoid this latency, if possible.

          People

          • Assignee:
            eyang Eric Yang
            Reporter:
            sourygna Sourygna Luangsay
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Time Tracking

              Estimated:
              Original Estimate - 10m
              10m
              Remaining:
              Remaining Estimate - 10m
              10m
              Logged:
              Time Spent - Not Specified
              Not Specified

                Development