Uploaded image for project: 'Chukwa'
  1. Chukwa
  2. CHUKWA-487

Collector left in a bad state after temprorary NN outage

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.4.0
    • Fix Version/s: None
    • Component/s: Data Collection
    • Labels:
      None

      Description

      When the name node returns errors to the collector, at some point the collector dies half way. This behavior should be changed to either resemble the agents and keep trying, or to completely shutdown. Instead, what I'm seeing is that the collector logs that it's shutting down, and the var/pidDir/Collector.pid file gets removed, but the collector continues to run, albeit not handling new data. Instead, this log entry is repeated ad infinitum:

      2010-05-06 17:35:06,375 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
      2010-05-06 17:36:06,379 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
      2010-05-06 17:37:06,384 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0

      1. CHUKWA-487.patch
        2 kB
        Ari Rabkin
      2. CHUKWA-487.threaddump.txt
        257 kB
        Bill Graham

        Activity

        Hide
        billgraham Bill Graham added a comment -

        Here's what I saw in the logs when I had to restart my NN. It took a little while to exit safe mode. I had to restore from he secondary name node so there might have been some data loss upon restore.

        131122010-05-06 17:32:19,515 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=318716 dataRate=10622
        2010-05-06 17:32:49,518 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=196741 dataRate=6557
        2010-05-06 17:33:06,367 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:129,numberchunks:217
        2010-05-06 17:33:19,521 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
        2010-05-06 17:33:49,523 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0
        2010-05-06 17:34:01,142 WARN org.apache.hadoop.dfs.DFSClient$LeaseChecker@36b60b93 DFSClient - Problem renewing lease for DFSClient_-10
        88933168: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot renew lease for DFSClient_-1088933168.
        Name node is in safe mode.
        The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
        at org.apache.hadoop.dfs.FSNamesystem.renewLease(FSNamesystem.java:1823)
        at org.apache.hadoop.dfs.NameNode.renewLease(NameNode.java:458)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
        at org.apache.hadoop.ipc.Client.call(Client.java:716)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:781)
        at java.lang.Thread.run(Thread.java:619)

        2010-05-06 17:34:01,608 WARN Timer-2094 SeqFileWriter - Got an exception in rotate
        org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot complete file /chukwa/logs/201006172737418_xxxxxxxxxcom_71ea99261284ab9f0566faa.chukwa. Name node is in safe mode.
        The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically.
        at org.apache.hadoop.dfs.FSNamesystem.completeFileInternal(FSNamesystem.java:1209)
        at org.apache.hadoop.dfs.FSNamesystem.completeFile(FSNamesystem.java:1200)
        at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:351)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890)
        at org.apache.hadoop.ipc.Client.call(Client.java:716)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
        at org.apache.hadoop.dfs.$Proxy0.complete(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
        at org.apache.hadoop.dfs.$Proxy0.complete(Unknown Source)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2736)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2657)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59)
        at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79)
        at org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter.rotate(SeqFileWriter.java:194)
        at org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter$1.run(SeqFileWriter.java:235)
        at java.util.TimerThread.mainLoop(Timer.java:512)
        at java.util.TimerThread.run(Timer.java:462)
        2010-05-06 17:34:01,647 FATAL Timer-2094 SeqFileWriter - IO Exception in rotate. Exiting!
        2010-05-06 17:34:01,661 FATAL btpool0-6248 SeqFileWriter - IOException when trying to write a chunk, Collector is going to exit!
        java.io.IOException: Stream closed.
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.isClosed(DFSClient.java:2245)
        at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2481)
        at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155)
        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132)
        at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121)
        at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112)
        at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86)
        at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47)
        at java.io.DataOutputStream.write(DataOutputStream.java:90)
        at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016)
        at org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter.add(SeqFileWriter.java:281)
        at org.apache.hadoop.chukwa.datacollection.collector.servlet.ServletCollector.accept(ServletCollector.java:152)
        at org.apache.hadoop.chukwa.datacollection.collector.servlet.ServletCollector.doPost(ServletCollector.java:190)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
        at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487)
        at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:362)
        at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181)
        at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729)
        at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
        at org.mortbay.jetty.Server.handle(Server.java:324)
        at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505)
        at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:843)
        at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:647)
        at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211)
        at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380)
        at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395)
        at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:450)
        2010-05-06 17:34:06,370 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:28,numberchunks:0
        2010-05-06 17:35:06,375 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
        2010-05-06 17:36:06,379 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
        2010-05-06 17:37:06,384 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0
        ...

        Show
        billgraham Bill Graham added a comment - Here's what I saw in the logs when I had to restart my NN. It took a little while to exit safe mode. I had to restore from he secondary name node so there might have been some data loss upon restore. 131122010-05-06 17:32:19,515 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=318716 dataRate=10622 2010-05-06 17:32:49,518 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=196741 dataRate=6557 2010-05-06 17:33:06,367 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:129,numberchunks:217 2010-05-06 17:33:19,521 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0 2010-05-06 17:33:49,523 INFO Timer-3 SeqFileWriter - stat:datacollection.writer.hdfs dataSize=0 dataRate=0 2010-05-06 17:34:01,142 WARN org.apache.hadoop.dfs.DFSClient$LeaseChecker@36b60b93 DFSClient - Problem renewing lease for DFSClient_-10 88933168: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot renew lease for DFSClient_-1088933168. Name node is in safe mode. The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically. at org.apache.hadoop.dfs.FSNamesystem.renewLease(FSNamesystem.java:1823) at org.apache.hadoop.dfs.NameNode.renewLease(NameNode.java:458) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890) at org.apache.hadoop.ipc.Client.call(Client.java:716) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source) at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy0.renewLease(Unknown Source) at org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.java:781) at java.lang.Thread.run(Thread.java:619) 2010-05-06 17:34:01,608 WARN Timer-2094 SeqFileWriter - Got an exception in rotate org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.dfs.SafeModeException: Cannot complete file /chukwa/logs/201006172737418_xxxxxxxxxcom_71ea99261284ab9f0566faa.chukwa. Name node is in safe mode. The ratio of reported blocks 0.0000 has not reached the threshold 0.9990. Safe mode will be turned off automatically. at org.apache.hadoop.dfs.FSNamesystem.completeFileInternal(FSNamesystem.java:1209) at org.apache.hadoop.dfs.FSNamesystem.completeFile(FSNamesystem.java:1200) at org.apache.hadoop.dfs.NameNode.complete(NameNode.java:351) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890) at org.apache.hadoop.ipc.Client.call(Client.java:716) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy0.complete(Unknown Source) at sun.reflect.GeneratedMethodAccessor5.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy0.complete(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:2736) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.close(DFSClient.java:2657) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:59) at org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:79) at org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter.rotate(SeqFileWriter.java:194) at org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter$1.run(SeqFileWriter.java:235) at java.util.TimerThread.mainLoop(Timer.java:512) at java.util.TimerThread.run(Timer.java:462) 2010-05-06 17:34:01,647 FATAL Timer-2094 SeqFileWriter - IO Exception in rotate. Exiting! 2010-05-06 17:34:01,661 FATAL btpool0-6248 SeqFileWriter - IOException when trying to write a chunk, Collector is going to exit! java.io.IOException: Stream closed. at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.isClosed(DFSClient.java:2245) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.writeChunk(DFSClient.java:2481) at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:155) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:132) at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:121) at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:112) at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:86) at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:47) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:1016) at org.apache.hadoop.chukwa.datacollection.writer.SeqFileWriter.add(SeqFileWriter.java:281) at org.apache.hadoop.chukwa.datacollection.collector.servlet.ServletCollector.accept(ServletCollector.java:152) at org.apache.hadoop.chukwa.datacollection.collector.servlet.ServletCollector.doPost(ServletCollector.java:190) at javax.servlet.http.HttpServlet.service(HttpServlet.java:727) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:487) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:362) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:729) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:505) at org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:843) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:647) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:211) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:380) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:395) at org.mortbay.thread.BoundedThreadPool$PoolThread.run(BoundedThreadPool.java:450) 2010-05-06 17:34:06,370 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:28,numberchunks:0 2010-05-06 17:35:06,375 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 2010-05-06 17:36:06,379 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 2010-05-06 17:37:06,384 INFO Timer-1 root - stats:ServletCollector,numberHTTPConnection:0,numberchunks:0 ...
        Hide
        asrabkin Ari Rabkin added a comment -

        So there's basically two fixes. We can System.exit() on that error and hope the daemon respawns, or try to handle it in Chukwa. I prefer the former approach. Comments or objections?

        Show
        asrabkin Ari Rabkin added a comment - So there's basically two fixes. We can System.exit() on that error and hope the daemon respawns, or try to handle it in Chukwa. I prefer the former approach. Comments or objections?
        Hide
        asrabkin Ari Rabkin added a comment -

        Ugh. We actually DO have code to do exactly that. It's even in 0.4 It calls System.exit() and all.

        Is some shutdown hook gumming up the works?

        Show
        asrabkin Ari Rabkin added a comment - Ugh. We actually DO have code to do exactly that. It's even in 0.4 It calls System.exit() and all. Is some shutdown hook gumming up the works?
        Hide
        asrabkin Ari Rabkin added a comment -

        Also, this is a big enough problem for reliability that it should block any 0.5.

        Show
        asrabkin Ari Rabkin added a comment - Also, this is a big enough problem for reliability that it should block any 0.5.
        Hide
        billgraham Bill Graham added a comment -

        I think the latter approach (trying to handle it in Chukwa) would be preferable. The question then is what to do with incoming data when HDFS is down. Could we go into a mode where we stop receiving new data from the agents while we retry HDFS every N seconds for up to M seconds before exiting? We'd also need to think through what to do i the current file being written to disappears due to a SNN restore with data loss.

        Show
        billgraham Bill Graham added a comment - I think the latter approach (trying to handle it in Chukwa) would be preferable. The question then is what to do with incoming data when HDFS is down. Could we go into a mode where we stop receiving new data from the agents while we retry HDFS every N seconds for up to M seconds before exiting? We'd also need to think through what to do i the current file being written to disappears due to a SNN restore with data loss.
        Hide
        asrabkin Ari Rabkin added a comment -

        Agents will retry if a collector goes down, without losing data, so it should be totally safe to let it crash and reboot. And I think restarting from an entirely-clean state is preferable to fixing it. This also sidesteps questions of file handles breaking.

        How common are SSN restores with data loss? I had assumed that after a file was closed, file data was as durable as NN metadata on disk.

        Show
        asrabkin Ari Rabkin added a comment - Agents will retry if a collector goes down, without losing data, so it should be totally safe to let it crash and reboot. And I think restarting from an entirely-clean state is preferable to fixing it. This also sidesteps questions of file handles breaking. How common are SSN restores with data loss? I had assumed that after a file was closed, file data was as durable as NN metadata on disk.
        Hide
        billgraham Bill Graham added a comment -

        Actually, looking closer I can't say for sure that I had data loss. It could be just that the bounce of the NN made the file unavailable. It appears in my case that perhaps the file couldn't have been closed or rotated because the NN had gone down.

        The only way that you could have data lose AFAIK would be if the current data dir including the edit log got corrupted since the last SNN checkpoint. I think this is something that is rare enough to not worry about. My concern was more of how to make sure the collector isn't left in a bad state if part of an un-closed file was lost.

        The crash and reboot scenario is better than what we have now, but a self-recovering solution would be ideal. This way if the NN crashed unexpectedly (perhaps during of-business hours), all the collectors wouldn't need to be restarted. Again though, this is probably a rare occurrence.

        Show
        billgraham Bill Graham added a comment - Actually, looking closer I can't say for sure that I had data loss. It could be just that the bounce of the NN made the file unavailable. It appears in my case that perhaps the file couldn't have been closed or rotated because the NN had gone down. The only way that you could have data lose AFAIK would be if the current data dir including the edit log got corrupted since the last SNN checkpoint. I think this is something that is rare enough to not worry about. My concern was more of how to make sure the collector isn't left in a bad state if part of an un-closed file was lost. The crash and reboot scenario is better than what we have now, but a self-recovering solution would be ideal. This way if the NN crashed unexpectedly (perhaps during of-business hours), all the collectors wouldn't need to be restarted. Again though, this is probably a rare occurrence.
        Hide
        billgraham Bill Graham added a comment -

        Just happened again, here's a thread dump. Looks like there's some lock contention on the sequence file while trying to shut down.

        Show
        billgraham Bill Graham added a comment - Just happened again, here's a thread dump. Looks like there's some lock contention on the sequence file while trying to shut down.
        Hide
        jboulon Jerome Boulon added a comment -

        Hi,
        I looked quickly at the code and that's something I changed in Honu because of a possible "virtual dead-lock".
        The main thread (writer) tries to acquire the lock to write to the sequence file while the shutdownHook will also try to get it.
        The issue is that the Writer get the lock for about 2 minutes (NN retries) and chance are that statistically the next one to get the lock will also be a queue add instead of the close.
        One quick workaround for now will be to wait no-longer than xxx sec on the close and also to configure your NN client to fail fast instead of retrying again and again in the specific case of NN is not available.

        Longer term could be to back off at the SeqFileWriter.add method.
        In Honu, I've put some timeOut aorund all locks & addQueue to make sure that nobody get blocked on a lock + I have a TRY_LATER error that is returned if a the add takes more than 2 seconds and READY/!READY state to accept/reject incoming requests.

        Show
        jboulon Jerome Boulon added a comment - Hi, I looked quickly at the code and that's something I changed in Honu because of a possible "virtual dead-lock". The main thread (writer) tries to acquire the lock to write to the sequence file while the shutdownHook will also try to get it. The issue is that the Writer get the lock for about 2 minutes (NN retries) and chance are that statistically the next one to get the lock will also be a queue add instead of the close. One quick workaround for now will be to wait no-longer than xxx sec on the close and also to configure your NN client to fail fast instead of retrying again and again in the specific case of NN is not available. Longer term could be to back off at the SeqFileWriter.add method. In Honu, I've put some timeOut aorund all locks & addQueue to make sure that nobody get blocked on a lock + I have a TRY_LATER error that is returned if a the add takes more than 2 seconds and READY/!READY state to accept/reject incoming requests.
        Hide
        asrabkin Ari Rabkin added a comment -

        Jerome – thanks for the diagnosis!

        I should be able to knock out a patch tomorrow (Monday).

        Show
        asrabkin Ari Rabkin added a comment - Jerome – thanks for the diagnosis! I should be able to knock out a patch tomorrow (Monday).
        Hide
        asrabkin Ari Rabkin added a comment -

        How about this? In close(), only try to acquire lock for 500 ms, else abort.

        Show
        asrabkin Ari Rabkin added a comment - How about this? In close(), only try to acquire lock for 500 ms, else abort.
        Hide
        billgraham Bill Graham added a comment -

        This approach makes sense to me w.r.t. the deadlock. I've deployed it an will see how it works.

        Ari, regarding your comment above about 'hoping the daemon respawns', how do we expect the daemon to do that? Is there logic in the chukwa start script to check for that case?

        Show
        billgraham Bill Graham added a comment - This approach makes sense to me w.r.t. the deadlock. I've deployed it an will see how it works. Ari, regarding your comment above about 'hoping the daemon respawns', how do we expect the daemon to do that? Is there logic in the chukwa start script to check for that case?
        Hide
        asrabkin Ari Rabkin added a comment -

        The respawn logic is no longer in Chukwa. However, my impression was that most sites have their own daemon-watching approach in the background to respawn crashed daemons. For instance, Berkeley uses runit.

        Show
        asrabkin Ari Rabkin added a comment - The respawn logic is no longer in Chukwa. However, my impression was that most sites have their own daemon-watching approach in the background to respawn crashed daemons. For instance, Berkeley uses runit.
        Hide
        billgraham Bill Graham added a comment -

        How difficult would it be to make this process smarter about recovering on its own? Ideally it would recover from a temporary node outage more gracefully than just shutting down.

        Show
        billgraham Bill Graham added a comment - How difficult would it be to make this process smarter about recovering on its own? Ideally it would recover from a temporary node outage more gracefully than just shutting down.
        Hide
        asrabkin Ari Rabkin added a comment -

        My sense of this is that failure-recovery logic is hard to write and hard to test. The bug that prompted this JIRA was a failure-handler bug. It's also not clear to me what the benefit is. Chukwa has mechanisms for agent-side retry, via CHUKWA-369.

        You can turn agent-side recovery on using option httpConnector.asyncAcks.

        This should all be documented at some point.

        Show
        asrabkin Ari Rabkin added a comment - My sense of this is that failure-recovery logic is hard to write and hard to test. The bug that prompted this JIRA was a failure-handler bug. It's also not clear to me what the benefit is. Chukwa has mechanisms for agent-side retry, via CHUKWA-369 . You can turn agent-side recovery on using option httpConnector.asyncAcks. This should all be documented at some point.
        Hide
        asrabkin Ari Rabkin added a comment -

        Bill, did this patch work? If there were no problems with it in production, I'll commit it.

        Show
        asrabkin Ari Rabkin added a comment - Bill, did this patch work? If there were no problems with it in production, I'll commit it.
        Hide
        billgraham Bill Graham added a comment -

        I deployed this patch into production not long after you submitted it, but we haven't had any NN failures since then so I can't tell if it's effective yet. I'm on vacation this week, but can try to force a failure next week to see how the collector handles it.

        Show
        billgraham Bill Graham added a comment - I deployed this patch into production not long after you submitted it, but we haven't had any NN failures since then so I can't tell if it's effective yet. I'm on vacation this week, but can try to force a failure next week to see how the collector handles it.
        Hide
        asrabkin Ari Rabkin added a comment -

        That might be excessive. I mostly wanted to be reassured that the patch didn't obstruct normal operation.

        Show
        asrabkin Ari Rabkin added a comment - That might be excessive. I mostly wanted to be reassured that the patch didn't obstruct normal operation.
        Hide
        asrabkin Ari Rabkin added a comment -

        I just committed this.

        Show
        asrabkin Ari Rabkin added a comment - I just committed this.

          People

          • Assignee:
            Unassigned
            Reporter:
            billgraham Bill Graham
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development