Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-3083

Taildir source can miss events if file updated in same second as file close

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.7.0
    • Fix Version/s: 1.8.0
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      I suspect there is a scenario where the taildir source can miss reading events from a log file due to how the source determines whether a file has been updated. In ReliableTaildirEventReader:

      boolean updated = tf.getLastUpdated() < f.lastModified()
      ...
      tf.setNeedTail(updated);
      

      Consider this sequence of events from TaildirSource.process(). Assume they all happen within the same second and there is just a single log file.

      1. Call ReliableTaildirEventReader.updateTailFiles()
        • This call will set ReliableTaildirEventReader.updateTime to current time in milliseconds
        • Assume the underlying file has not been updated within the last idleTimeout milliseconds
      2. Due to idleness, the tail file's inode is added to TaildirSource.idleInodes in idleFileCheckerRunnable
      3. tf.needTail is false. Skip reading file.
      4. Underlying file is updated with events E1
      5. TaildirSource.closeTailFiles()
        • Call TaildirSource.tailFileProcess() before close to read any pending events
        • Events E1 are read and processed
        • Since events were read, call ReliableTaildirEventReader.commit() which updates the tail file's position and sets its last updated time to ReliableTaildirEventReader.updateTime from 1.a
        • Close file
      6. Events E2 are written to underlying file. File's modification time is in the same second as the tail file's last updated time.
      7. Since the time returned by File.lastModified() is the mtime in seconds converted to milliseconds the file's last modified time is less than the tail file's last updated time and taildir won't reopen the file to read E2.
        • This behaviour of File.lastModified() may be platform/jvm specific. I confirmed the behavior using OpenJDK 8 on Ubuntu precise.

      Can someone confirm this?

      1. FLUME-3083-0.patch
        2 kB
        eskrm
      2. FLUME-3083-1.patch
        2 kB
        eskrm

        Issue Links

          Activity

          Hide
          hudson Hudson added a comment -

          FAILURE: Integrated in Jenkins build Flume-trunk-hbase-1 #246 (See https://builds.apache.org/job/Flume-trunk-hbase-1/246/)
          FLUME-3083. Check byte position of file in update condition of Taildir (bessbd: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=dfa0627573b9a75a25dc7149a7d63c9bac953ff4)

          • (edit) flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/ReliableTaildirEventReader.java
          • (edit) flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirEventReader.java
          Show
          hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Flume-trunk-hbase-1 #246 (See https://builds.apache.org/job/Flume-trunk-hbase-1/246/ ) FLUME-3083 . Check byte position of file in update condition of Taildir (bessbd: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=dfa0627573b9a75a25dc7149a7d63c9bac953ff4 ) (edit) flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/ReliableTaildirEventReader.java (edit) flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirEventReader.java
          Hide
          bessbd Bessenyei Balázs Donát added a comment -

          Satoshi Iijima: thank you for the review!

          Show
          bessbd Bessenyei Balázs Donát added a comment - Satoshi Iijima : thank you for the review!
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flume/pull/128

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flume/pull/128
          Hide
          bessbd Bessenyei Balázs Donát added a comment -

          eskrm: thank you for the patch!

          Show
          bessbd Bessenyei Balázs Donát added a comment - eskrm : thank you for the patch!
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit dfa0627573b9a75a25dc7149a7d63c9bac953ff4 in flume's branch refs/heads/trunk from eskrm
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=dfa0627 ]

          FLUME-3083. Check byte position of file in update condition of Taildir Source

          This patch addresses an edge case of the Taildir Source wherein it can miss
          reading events written in the same second as the file closing.

          This closes #128

          Reviewers: Satoshi Iijima, Bessenyei Balázs Donát

          (eskrm via Bessenyei Balázs Donát)

          Show
          jira-bot ASF subversion and git services added a comment - Commit dfa0627573b9a75a25dc7149a7d63c9bac953ff4 in flume's branch refs/heads/trunk from eskrm [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=dfa0627 ] FLUME-3083 . Check byte position of file in update condition of Taildir Source This patch addresses an edge case of the Taildir Source wherein it can miss reading events written in the same second as the file closing. This closes #128 Reviewers: Satoshi Iijima, Bessenyei Balázs Donát (eskrm via Bessenyei Balázs Donát)
          Hide
          eskrm eskrm added a comment -

          The new file would also need to have the same path as well as inode. The reader will reset pos to 0 if the tracked path changes on an inode. I agree though, it's possible. I've added another patch that checks both conditions.

          Show
          eskrm eskrm added a comment - The new file would also need to have the same path as well as inode. The reader will reset pos to 0 if the tracked path changes on an inode. I agree though, it's possible. I've added another patch that checks both conditions.
          Hide
          iijima_satoshi Satoshi Iijima added a comment -

          eskrm,
          I found a potential corner case for the patch.
          When the tail file are truncated (deleted or archived to compressed file) and a new file is generated, the new file can occasionally have the same inode as the truncated file.
          Furthermore, the new file can have the same file size as the truncated file and the new file has not been written after that.
          Even in this case, miss reading events can be possible.

          The condition to check last modified time should be left and a new condition to check position (file size) had better be added as the OR condition.

          Show
          iijima_satoshi Satoshi Iijima added a comment - eskrm , I found a potential corner case for the patch. When the tail file are truncated (deleted or archived to compressed file) and a new file is generated, the new file can occasionally have the same inode as the truncated file. Furthermore, the new file can have the same file size as the truncated file and the new file has not been written after that. Even in this case, miss reading events can be possible. The condition to check last modified time should be left and a new condition to check position (file size) had better be added as the OR condition.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user eskrm opened a pull request:

          https://github.com/apache/flume/pull/128

          FLUME-3083. Change file update condition in Taildir Source from mtime to byte position

          This is to resolve an edge case where events can be missed during tail file close. More details are provided at https://issues.apache.org/jira/browse/FLUME-3083.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/eskrm/flume FLUME-3083

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flume/pull/128.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #128


          commit 82a8b42c15a12bb4d7cc78b834bb2c6e198f5822
          Author: eskrm <eskrm@users.noreply.github.com>
          Date: 2017-04-09T20:02:54Z

          FLUME-3083. Change file update condition in Taildir Source from mtime to byte position


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user eskrm opened a pull request: https://github.com/apache/flume/pull/128 FLUME-3083 . Change file update condition in Taildir Source from mtime to byte position This is to resolve an edge case where events can be missed during tail file close. More details are provided at https://issues.apache.org/jira/browse/FLUME-3083 . You can merge this pull request into a Git repository by running: $ git pull https://github.com/eskrm/flume FLUME-3083 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flume/pull/128.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #128 commit 82a8b42c15a12bb4d7cc78b834bb2c6e198f5822 Author: eskrm <eskrm@users.noreply.github.com> Date: 2017-04-09T20:02:54Z FLUME-3083 . Change file update condition in Taildir Source from mtime to byte position
          Hide
          iijima_satoshi Satoshi Iijima added a comment -

          eskrm,
          I understood your case.
          It can be possible in the case that event(s) have written at the same seconds of idleTimeout value from the last updated time.
          +1 the patch.

          Additionally, I think that longer-than-default idleTimeout is preferred in this case. That would reduce reopening the tail file.

          Show
          iijima_satoshi Satoshi Iijima added a comment - eskrm , I understood your case. It can be possible in the case that event(s) have written at the same seconds of idleTimeout value from the last updated time. +1 the patch. Additionally, I think that longer-than-default idleTimeout is preferred in this case. That would reduce reopening the tail file.
          Hide
          eskrm eskrm added a comment -

          Satoshi Iijima,

          Why wouldn't it be possible regardless of the value of idleTimeout? A tail file is closed from idleness based on its last updated at value. It's only reopened for reading if the file's mtime is strictly greater than the tail file's last updated at value. So even if the tail file was last updated in the current second after being idle for 2 minutes and then closed, if the file's mtime is within the same second of that value it won't reopen the file to read events written just after the last commit due to File.lastModified being in second granularity. That's the working theory anyway.

          A production system I work on is using the taildir source with the default idleTimeout value of 2 minutes. I observed missing events this past week on a few hosts each for a different 15m interval which is 1-1 with a log file. The thing they had in common was that they were closed and then written to in the same second and never reopened to read the remaining event(s). An example of one:

          The taildir position file:

          {
              "file": "/var/lib/flume-ng/taildir/logs/data.0.BBL1.2017-04-05T033000Z.15m.csv", 
              "inode": 11141305, 
              "pos": 88317952
          },
          

          The flume log line when this file was closed:

          /var/log/flume-ng/flume.log.5:05 Apr 2017 03:46:15,322 INFO  [PollableSourceRunner-TaildirSource-taildir] (org.apache.flume.source.taildir.TaildirSource.closeTailFiles:288)  - Closed file: /var/lib/flume-ng/taildir/logs/data.0.BBL1.2017-04-05T033000Z.15m.csv, inode: 11141305, pos: 88317952
          

          File's size and mtime:

          -rw-r--r-- 1 root root 88318908 2017-04-05 03:46:15.704907240 +0000 /var/lib/flume-ng/taildir/logs/data.0.BBL1.2017-04-05T033000Z.15m.csv
          

          The mtime is milliseconds after the close time, but within the same second, and taildir hasn't read the last 956 bytes.

          Show
          eskrm eskrm added a comment - Satoshi Iijima , Why wouldn't it be possible regardless of the value of idleTimeout? A tail file is closed from idleness based on its last updated at value. It's only reopened for reading if the file's mtime is strictly greater than the tail file's last updated at value. So even if the tail file was last updated in the current second after being idle for 2 minutes and then closed, if the file's mtime is within the same second of that value it won't reopen the file to read events written just after the last commit due to File.lastModified being in second granularity. That's the working theory anyway. A production system I work on is using the taildir source with the default idleTimeout value of 2 minutes. I observed missing events this past week on a few hosts each for a different 15m interval which is 1-1 with a log file. The thing they had in common was that they were closed and then written to in the same second and never reopened to read the remaining event(s). An example of one: The taildir position file: { "file": "/var/lib/flume-ng/taildir/logs/data.0.BBL1.2017-04-05T033000Z.15m.csv", "inode": 11141305, "pos": 88317952 }, The flume log line when this file was closed: /var/log/flume-ng/flume.log.5:05 Apr 2017 03:46:15,322 INFO [PollableSourceRunner-TaildirSource-taildir] (org.apache.flume.source.taildir.TaildirSource.closeTailFiles:288) - Closed file: /var/lib/flume-ng/taildir/logs/data.0.BBL1.2017-04-05T033000Z.15m.csv, inode: 11141305, pos: 88317952 File's size and mtime: -rw-r--r-- 1 root root 88318908 2017-04-05 03:46:15.704907240 +0000 /var/lib/flume-ng/taildir/logs/data.0.BBL1.2017-04-05T033000Z.15m.csv The mtime is milliseconds after the close time, but within the same second, and taildir hasn't read the last 956 bytes.
          Hide
          iijima_satoshi Satoshi Iijima added a comment -

          If idleTimeout property has been set less than 1000 milliseconds, missing events that you point out might be possible.
          But this setting is very inefficient because TaildirSource.closeTailFiles() would be called continually and needlessly.
          If idleTimeout has been properly set (default: 120000 milliseconds), current TaildirSource would not miss reading events.

          Show
          iijima_satoshi Satoshi Iijima added a comment - If idleTimeout property has been set less than 1000 milliseconds, missing events that you point out might be possible. But this setting is very inefficient because TaildirSource.closeTailFiles() would be called continually and needlessly. If idleTimeout has been properly set (default: 120000 milliseconds), current TaildirSource would not miss reading events.

            People

            • Assignee:
              eskrm eskrm
              Reporter:
              eskrm eskrm
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development