Flume
  1. Flume
  2. FLUME-1175

RollingFileSink complains of Bad File Descriptor upon a reconfig event

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Critical Critical
    • Resolution: Fixed
    • Affects Version/s: v1.2.0
    • Fix Version/s: v1.4.0
    • Component/s: Sinks+Sources
    • Labels:
      None
    • Environment:

      CentOS 6.2 64-bit

      Description

      Steps:
      1) Create a config file that looks something like this:
      agent.channels = c1
      agent.sources = r1
      agent.sinks = k1
      #
      agent.channels.c1.type = MEMORY
      #
      agent.sources.r1.channels = c1
      agent.sources.r1.type = SEQ
      #
      agent.sinks.k1.channel = c1
      agent.sinks.k1.type = FILE_ROLL
      agent.sinks.k1.sink.directory = /var/log/flume-ng
      agent.sinks.k1.sink.rollInterval = 0

      2) Start the Flume NG agent

      3) touch the config file so that a reconfig event is triggered within 30 secs

      4) tail the output file to observer the sequence generator events:
      tail -f /var/log/flume-ng/XXXXXXXXXXXX

      5) Notice that the flow suddenly stops at the reconfig event (within 30 secs after touching the config file). Flow doesn't continue. The flume log shows a Bad File Descriptor error for the RollingFileSink:

      2012-05-03 01:34:34,806 (SinkRunner-PollingRunner-DefaultSinkProcessor) [ERROR - org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:160)] Unable to deliver event. Exception follows.
      org.apache.flume.EventDeliveryException: Failed to process event: [Event headers =

      {timestamp=1336034074797, nanos=3762297996593382, pri=INFO, host=<mysupersecrethost>, FlumeOG=yes, execcmd=java.nio.HeapByteBuffer[pos=0 lim=24 cap=24], procsource=java.nio.HeapByteBuffer[pos=0 lim=6 cap=6], service=java.nio.HeapByteBuffer[pos=0 lim=4 cap=4]}

      , body.length = 26 ]
      at org.apache.flume.sink.RollingFileSink.process(RollingFileSink.java:201)
      at org.apache.flume.sink.DefaultSinkProcessor.process(DefaultSinkProcessor.java:68)
      at org.apache.flume.SinkRunner$PollingRunner.run(SinkRunner.java:147)
      at java.lang.Thread.run(Thread.java:662)
      Caused by: java.io.IOException: Bad file descriptor
      at java.io.FileOutputStream.writeBytes(Native Method)
      at java.io.FileOutputStream.write(FileOutputStream.java:282)
      at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65)
      at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
      at org.apache.flume.sink.RollingFileSink.process(RollingFileSink.java:193)
      ... 3 more

        Issue Links

          Activity

          Will McQueen created issue -
          Hide
          Will McQueen added a comment - - edited

          I'm not sure, but what I think may be happening is:
          1) Reconfig event occurs
          2) RollingFileSink finishes executing its process() method, or maybe the sink runner thread that's calling RollingFileSink.process gets interrupted somewhere in the middle of the process() call. Either way, I believe in these cases that it's possible that the process() method will return while the outputStream field still has a non-null value pointing to some BufferedOutputStreamObject
          3) As part of the reconfig steps, the sink's configure() is called next. And configure() sets the 'directory' field, but outputStream field is still set to the old value.
          4) Next, stop() is called (again as part of reconfig steps). outputStream field's object is flushed and closed, but not nulled-out afterwards.
          5) Next, start() is called. Reconfig steps are done.
          6) Next, sink runner calls process(), which checks if outputStream != null (which is true, since outputStream is pointing to the old, closed BufferedOutputStream).

          So maybe one possible fix could be to insert a line in stop() to null-out the outputStream field (and also null-out the serializer field while we're at it)... something like this:

          if (serializer != null) {
            try {
              serializer.flush();
              serializer.beforeClose();
            } catch (IOException e) {
              logger.error("Unable to cleanup serializer. Exception follows.", e);
            } finally {
              serializer = null;
            }
          }
          
          if (outputStream != null) {
            logger.debug("Closing file {}", pathController.getCurrentFile());
            try {
              outputStream.flush();
              outputStream.close();
            } catch (IOException e) {
              logger.error("Unable to close outputStream. Exception follows.", e);
            } finally {
              outputStream = null;
            }
          }
          
          Show
          Will McQueen added a comment - - edited I'm not sure, but what I think may be happening is: 1) Reconfig event occurs 2) RollingFileSink finishes executing its process() method, or maybe the sink runner thread that's calling RollingFileSink.process gets interrupted somewhere in the middle of the process() call. Either way, I believe in these cases that it's possible that the process() method will return while the outputStream field still has a non-null value pointing to some BufferedOutputStreamObject 3) As part of the reconfig steps, the sink's configure() is called next. And configure() sets the 'directory' field, but outputStream field is still set to the old value. 4) Next, stop() is called (again as part of reconfig steps). outputStream field's object is flushed and closed, but not nulled-out afterwards. 5) Next, start() is called. Reconfig steps are done. 6) Next, sink runner calls process(), which checks if outputStream != null (which is true, since outputStream is pointing to the old, closed BufferedOutputStream). So maybe one possible fix could be to insert a line in stop() to null-out the outputStream field (and also null-out the serializer field while we're at it)... something like this: if (serializer != null ) { try { serializer.flush(); serializer.beforeClose(); } catch (IOException e) { logger.error( "Unable to cleanup serializer. Exception follows." , e); } finally { serializer = null ; } } if (outputStream != null ) { logger.debug( "Closing file {}" , pathController.getCurrentFile()); try { outputStream.flush(); outputStream.close(); } catch (IOException e) { logger.error( "Unable to close outputStream. Exception follows." , e); } finally { outputStream = null ; } }
          Will McQueen made changes -
          Field Original Value New Value
          Priority Major [ 3 ] Critical [ 2 ]
          Mike Percy made changes -
          Fix Version/s v1.3.0 [ 12322140 ]
          Fix Version/s v1.2.0 [ 12320243 ]
          Brock Noland made changes -
          Fix Version/s v1.4.0 [ 12323372 ]
          Fix Version/s v1.3.0 [ 12322140 ]
          Hide
          Roshan Naik added a comment -

          Will McQueen, are you working on this ?

          Show
          Roshan Naik added a comment - Will McQueen, are you working on this ?
          Hide
          Will McQueen added a comment -

          Hi Roshan,

          Not at the moment, but if you have some time to take it up then that's fine
          with me.

          Cheers,
          Will

          Show
          Will McQueen added a comment - Hi Roshan, Not at the moment, but if you have some time to take it up then that's fine with me. Cheers, Will
          Roshan Naik made changes -
          Assignee Roshan Naik [ roshan_naik ]
          Hide
          Roshan Naik added a comment -

          ok.. i have verified that null-ing it out is the right solution.

          One other change i wanted to get opinions on before submitting a patch

          The catch clauses in stop() are swallowing any exception from flush() or close(). But it seems to me that if flush()/close() fails, in the reconfig scenario, an exception should be thrown and let flume to die. In a normal shutdown case this is not necessary though. However the stop() does not allow exceptions to be thrown.

          Show
          Roshan Naik added a comment - ok.. i have verified that null-ing it out is the right solution. One other change i wanted to get opinions on before submitting a patch The catch clauses in stop() are swallowing any exception from flush() or close(). But it seems to me that if flush()/close() fails, in the reconfig scenario, an exception should be thrown and let flume to die. In a normal shutdown case this is not necessary though. However the stop() does not allow exceptions to be thrown.
          Hide
          Roshan Naik added a comment -

          maybe that issue is a deeper topic than this jira ?

          Show
          Roshan Naik added a comment - maybe that issue is a deeper topic than this jira ?
          Hide
          Roshan Naik added a comment -
          • nulling out outputStream in RollingFileSink.stop()
          • removing superfluous calls to outputStream.flush()
          Show
          Roshan Naik added a comment - nulling out outputStream in RollingFileSink.stop() removing superfluous calls to outputStream.flush()
          Roshan Naik made changes -
          Attachment FLUME-1175.patch [ 12555560 ]
          Roshan Naik made changes -
          Remote Link This issue links to "Review Board Link (Web Link)" [ 11527 ]
          Hide
          Roshan Naik added a comment -

          nulling out outputStream in RollingFileSink.stop()
          removing superfluous calls to outputStream.flush()

          Show
          Roshan Naik added a comment - nulling out outputStream in RollingFileSink.stop() removing superfluous calls to outputStream.flush()
          Roshan Naik made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Brock Noland added a comment -

          Roshan,

          Regarding the logging the exception. Yes that is a deeper topic. I don't think our semantics around this are clear. Additionally, lets say you have 5 flows in agent, do you want one of them to kill the whole agent? This are things which aren't perfectly clear. Without a clear statement of how things should work I think what we should do is catch and log anything inside of the stop/configure/start.

          Show
          Brock Noland added a comment - Roshan, Regarding the logging the exception. Yes that is a deeper topic. I don't think our semantics around this are clear. Additionally, lets say you have 5 flows in agent, do you want one of them to kill the whole agent? This are things which aren't perfectly clear. Without a clear statement of how things should work I think what we should do is catch and log anything inside of the stop/configure/start.
          Hide
          Brock Noland added a comment -

          Thank you for the patch Roshan! I have committed this to trunk and 1.4!

          Show
          Brock Noland added a comment - Thank you for the patch Roshan! I have committed this to trunk and 1.4!
          Brock Noland made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          Hudson added a comment -

          Integrated in flume-trunk #337 (See https://builds.apache.org/job/flume-trunk/337/)
          FLUME-1175: RollingFileSink complains of Bad File Descriptor upon a reconfig event (Revision e858adee407bf76cf2a036433ed4173714505924)

          Result = SUCCESS
          brock : http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=e858adee407bf76cf2a036433ed4173714505924
          Files :

          • flume-ng-core/src/main/java/org/apache/flume/sink/RollingFileSink.java
          Show
          Hudson added a comment - Integrated in flume-trunk #337 (See https://builds.apache.org/job/flume-trunk/337/ ) FLUME-1175 : RollingFileSink complains of Bad File Descriptor upon a reconfig event (Revision e858adee407bf76cf2a036433ed4173714505924) Result = SUCCESS brock : http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=e858adee407bf76cf2a036433ed4173714505924 Files : flume-ng-core/src/main/java/org/apache/flume/sink/RollingFileSink.java

            People

            • Assignee:
              Roshan Naik
              Reporter:
              Will McQueen
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development