Chukwa
  1. Chukwa
  2. CHUKWA-533

Improve fault-tolerance of collectors.

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: Data Collection
    • Labels:
      None
    • Release Note:
      Chukwa collector is more fault-tolerant of partial HDFS outages.

      Description

      There are currently a number of ways that a collector can die, typically due to errors on a DN or a NN that's being restarted. A collector should have some combination of retry logic followed by failing back to the agent, but the collector process should not die.

      1. CHUKWA-533-2.patch
        8 kB
        Bill Graham
      2. CHUKWA-533-1.patch
        5 kB
        Bill Graham

        Activity

        Bill Graham created issue -
        Hide
        Bill Graham added a comment -

        Examples from the logs when a NN gets unexpectedly rebooted:

        • From an active collector taking traffic:
          2010-10-12 04:05:13,721 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:2,numberchunks:105
          2010-10-12 04:05:15,508 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=24724 dataRate=823
          2010-10-12 04:05:45,515 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
          2010-10-12 04:05:46,894 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s).
          2010-10-12 04:05:59,899 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s).
          2010-10-12 04:06:03,903 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 1 time(s).
          2010-10-12 04:06:07,502 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 2 time(s).
          2010-10-12 04:06:11,506 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 3 time(s).
          2010-10-12 04:06:13,733 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
          2010-10-12 04:06:15,509 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 4 time(s).
          2010-10-12 04:06:15,521 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
          2010-10-12 04:06:19,512 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 5 time(s).
          2010-10-12 04:06:23,517 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 6 time(s).
          2010-10-12 04:06:27,521 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 7 time(s).
          2010-10-12 04:06:31,525 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 8 time(s).
          2010-10-12 04:06:35,529 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 9 time(s).
          2010-10-12 04:06:38,534 WARN LeaseChecker DFSClient - Problem renewing lease for DFSClient_-1129462781
          2010-10-12 04:06:43,545 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s).
          2010-10-12 04:06:45,527 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
          2010-10-12 04:06:47,550 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 1 time(s).
          2010-10-12 04:06:51,553 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 2 time(s).
          2010-10-12 04:06:55,556 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 3 time(s).
          2010-10-12 04:06:59,215 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 4 time(s).
          2010-10-12 04:07:03,219 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 5 time(s).
          2010-10-12 04:07:07,222 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 6 time(s).
          2010-10-12 04:07:11,225 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 7 time(s).
          2010-10-12 04:07:13,746 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
          2010-10-12 04:07:15,230 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 8 time(s).
          2010-10-12 04:07:15,534 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
          2010-10-12 04:07:19,235 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 9 time(s).
          2010-10-12 04:07:22,237 WARN LeaseChecker DFSClient - Problem renewing lease for DFSClient_-1129462781
          2010-10-12 04:07:27,242 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s).
          2010-10-12 04:07:31,246 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 1 time(s).
          2010-10-12 04:07:35,251 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 2 time(s).
          2010-10-12 04:07:39,254 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 3 time(s).
          2010-10-12 04:07:43,258 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 4 time(s).
          2010-10-12 04:07:45,541 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
          2010-10-12 04:07:47,261 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 5 time(s).
          
        • From an idle collector that got traffic as soon as the active collector died:
          2010-10-12 04:10:33,690 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
          2010-10-12 04:11:02,165 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
          2010-10-12 04:11:03,688 WARN Timer-196 SeqFileWriter - Got an exception in rotate
          2010-10-12 04:11:03,688 WARN LeaseChecker DFSClient - Problem renewing lease for DFSClient_23442132
          2010-10-12 04:11:03,693 FATAL Timer-196 SeqFileWriter - IO Exception in rotate. Exiting!
          2010-10-12 04:11:03,696 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
          2010-10-12 04:11:03,697 WARN Shutdown SeqFileWriter - cannot rename dataSink file:/chukwa/logs/201012035922632_c18rbhadoopwkrr10n1cnetcom_4435f4d212b9ca438d77e7e.chukwa
          
        Show
        Bill Graham added a comment - Examples from the logs when a NN gets unexpectedly rebooted: From an active collector taking traffic: 2010-10-12 04:05:13,721 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:2,numberchunks:105 2010-10-12 04:05:15,508 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=24724 dataRate=823 2010-10-12 04:05:45,515 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0 2010-10-12 04:05:46,894 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s). 2010-10-12 04:05:59,899 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s). 2010-10-12 04:06:03,903 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 1 time(s). 2010-10-12 04:06:07,502 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 2 time(s). 2010-10-12 04:06:11,506 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 3 time(s). 2010-10-12 04:06:13,733 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 2010-10-12 04:06:15,509 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 4 time(s). 2010-10-12 04:06:15,521 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0 2010-10-12 04:06:19,512 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 5 time(s). 2010-10-12 04:06:23,517 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 6 time(s). 2010-10-12 04:06:27,521 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 7 time(s). 2010-10-12 04:06:31,525 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 8 time(s). 2010-10-12 04:06:35,529 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 9 time(s). 2010-10-12 04:06:38,534 WARN LeaseChecker DFSClient - Problem renewing lease for DFSClient_-1129462781 2010-10-12 04:06:43,545 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s). 2010-10-12 04:06:45,527 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0 2010-10-12 04:06:47,550 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 1 time(s). 2010-10-12 04:06:51,553 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 2 time(s). 2010-10-12 04:06:55,556 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 3 time(s). 2010-10-12 04:06:59,215 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 4 time(s). 2010-10-12 04:07:03,219 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 5 time(s). 2010-10-12 04:07:07,222 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 6 time(s). 2010-10-12 04:07:11,225 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 7 time(s). 2010-10-12 04:07:13,746 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 2010-10-12 04:07:15,230 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 8 time(s). 2010-10-12 04:07:15,534 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0 2010-10-12 04:07:19,235 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 9 time(s). 2010-10-12 04:07:22,237 WARN LeaseChecker DFSClient - Problem renewing lease for DFSClient_-1129462781 2010-10-12 04:07:27,242 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 0 time(s). 2010-10-12 04:07:31,246 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 1 time(s). 2010-10-12 04:07:35,251 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 2 time(s). 2010-10-12 04:07:39,254 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 3 time(s). 2010-10-12 04:07:43,258 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 4 time(s). 2010-10-12 04:07:45,541 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0 2010-10-12 04:07:47,261 INFO LeaseChecker Client - Retrying connect to server: hadoop-nn.site.com/10.10.10.111:9000. Already tried 5 time(s). From an idle collector that got traffic as soon as the active collector died: 2010-10-12 04:10:33,690 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0 2010-10-12 04:11:02,165 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 2010-10-12 04:11:03,688 WARN Timer-196 SeqFileWriter - Got an exception in rotate 2010-10-12 04:11:03,688 WARN LeaseChecker DFSClient - Problem renewing lease for DFSClient_23442132 2010-10-12 04:11:03,693 FATAL Timer-196 SeqFileWriter - IO Exception in rotate. Exiting! 2010-10-12 04:11:03,696 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0 2010-10-12 04:11:03,697 WARN Shutdown SeqFileWriter - cannot rename dataSink file:/chukwa/logs/201012035922632_c18rbhadoopwkrr10n1cnetcom_4435f4d212b9ca438d77e7e.chukwa
        Hide
        Bill Graham added a comment -

        Here's a first pass at of a patch for review. I've changed the rotate and add methods to be more fault-tolerant (i.e. to be able to survive a temporary HDFS outage). The init method still requires HDFS, so HDFS must be running for the collector to start. We can revisit this decision if people see the need.

        I changed add to return COMMIT_FAIL if the chunks couldn't be added to the sequence file and I don't update the dataSize and bytesThisRotate unless the sequence file append succeeds. The ServletCollector returns a 503 if this method returns COMMIT_FAIL.

        I changed rotate to basically log and swallow the error.

        I changed ServletCollector to not update stats if it gets a COMMIT_FAIL response.

        The only issue that I see with this approach is that if the agent sends chunks and gets back commit pending acks for those chunks, HDFS can still go down and the file will not be rotated. This is the same though as the current behavior, except now the collector won't die. If guaranteed writes are desired, then the AsyncAckSender should be used.

        Show
        Bill Graham added a comment - Here's a first pass at of a patch for review. I've changed the rotate and add methods to be more fault-tolerant (i.e. to be able to survive a temporary HDFS outage). The init method still requires HDFS, so HDFS must be running for the collector to start. We can revisit this decision if people see the need. I changed add to return COMMIT_FAIL if the chunks couldn't be added to the sequence file and I don't update the dataSize and bytesThisRotate unless the sequence file append succeeds. The ServletCollector returns a 503 if this method returns COMMIT_FAIL . I changed rotate to basically log and swallow the error. I changed ServletCollector to not update stats if it gets a COMMIT_FAIL response. The only issue that I see with this approach is that if the agent sends chunks and gets back commit pending acks for those chunks, HDFS can still go down and the file will not be rotated. This is the same though as the current behavior, except now the collector won't die. If guaranteed writes are desired, then the AsyncAckSender should be used.
        Bill Graham made changes -
        Field Original Value New Value
        Attachment CHUKWA-533-1.patch [ 12460093 ]
        Hide
        Eric Yang added a comment -

        +1 looks good.

        Show
        Eric Yang added a comment - +1 looks good.
        Bill Graham made changes -
        Assignee Bill Graham [ billgraham ]
        Hide
        Bill Graham added a comment -

        Thanks Eric.

        Here's patch #2. It contains additional logic to handle when the previous output stream can't be closed before the move during rotate. This is for the case where HDFS went down and back up, so the file handle might not always be able to be closed per se, but the file could still be moved. This patch is deployed on our system and seems to be working well.

        Show
        Bill Graham added a comment - Thanks Eric. Here's patch #2. It contains additional logic to handle when the previous output stream can't be closed before the move during rotate . This is for the case where HDFS went down and back up, so the file handle might not always be able to be closed per se, but the file could still be moved. This patch is deployed on our system and seems to be working well.
        Bill Graham made changes -
        Attachment CHUKWA-533-2.patch [ 12460211 ]
        Bill Graham made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Release Note Chukwa collector is more fault-tolerant of partial HDFS outages.
        Hide
        Ari Rabkin added a comment -

        I'm told all the core datacollection unit tests pass, so I am +1 to commit this.

        Show
        Ari Rabkin added a comment - I'm told all the core datacollection unit tests pass, so I am +1 to commit this.
        Hide
        Bill Graham added a comment -

        All tests pass except these which I suspect are failing for unrelated reasons:

        build/test/TEST-org.apache.hadoop.chukwa.datacollection.TestOffsetStatsManager.txt:Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 13.1 sec
        build/test/TEST-org.apache.hadoop.chukwa.rest.resource.TestClientTrace.txt:Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.201 sec
        build/test/TEST-org.apache.hadoop.chukwa.rest.resource.TestUserResource.txt:Tests run: 2, Failures: 1, Errors: 1, Time elapsed: 2.881 sec

        TestHBaseWriter also fails, but that's because all tests are commented out.

        I'll commit this patch shortly.

        Show
        Bill Graham added a comment - All tests pass except these which I suspect are failing for unrelated reasons: build/test/TEST-org.apache.hadoop.chukwa.datacollection.TestOffsetStatsManager.txt:Tests run: 3, Failures: 1, Errors: 0, Time elapsed: 13.1 sec build/test/TEST-org.apache.hadoop.chukwa.rest.resource.TestClientTrace.txt:Tests run: 1, Failures: 1, Errors: 0, Time elapsed: 4.201 sec build/test/TEST-org.apache.hadoop.chukwa.rest.resource.TestUserResource.txt:Tests run: 2, Failures: 1, Errors: 1, Time elapsed: 2.881 sec TestHBaseWriter also fails, but that's because all tests are commented out. I'll commit this patch shortly.
        Hide
        Bill Graham added a comment -

        This is committed, thanks for the reviews.

        Show
        Bill Graham added a comment - This is committed, thanks for the reviews.
        Bill Graham made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Fix Version/s 0.5.0 [ 12315030 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            Bill Graham
            Reporter:
            Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development