Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: v1.7.0
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      This is the proposal of implementing a new tailing source.
      This source watches the specified files, and tails them in nearly real-time once appends are detected to these files.

      • This source is reliable and will not miss data even when the tailing files rotate.
      • It periodically writes the last read position of each file in a position file using the JSON format.
      • If Flume is stopped or down for some reason, it can restart tailing from the position written on the existing position file.
      • It can add event headers to each tailing file group.

      A attached patch includes a config documentation of this.
      This source requires Unix-style file system and Java 1.7 or later.

      1. FLUME-2498.patch
        63 kB
        Satoshi Iijima
      2. FLUME-2498-2.patch
        63 kB
        Satoshi Iijima
      3. FLUME-2498-3.patch
        66 kB
        Roshan Naik
      4. FLUME-2498-4.patch
        67 kB
        Johny Rufus
      5. FLUME-2498-5.patch
        72 kB
        Roshan Naik

        Issue Links

          Activity

          Hide
          Satoshi Iijima added a comment -

          Hari, Santiago, Otis, Roshan - Thank you for your replying to dev ML.
          I am sorry that I could not reply for a unknown reason after that...
          I would like to discuss about this source here.

          Show
          Satoshi Iijima added a comment - Hari, Santiago, Otis, Roshan - Thank you for your replying to dev ML. I am sorry that I could not reply for a unknown reason after that... I would like to discuss about this source here.
          Hide
          Satoshi Iijima added a comment -

          I compared this source with other source components which can tail or read file(s)
          If there are incorrect contents, please let me know.

          Compare with tail-pollable-source (FLUME-2344) and jambalay-file-source
          http://vschart.com/compare/flume-taildir-source/vs/flume-tail-pollable-source-flume-2344/vs/flume-jambalay-file-source

          Compare with Spooling directory source and Exec source (tail -F)
          http://vschart.com/compare/flume-taildir-source/vs/flume-spooling-directory/vs/flume-exec-source-tail-f

          Show
          Satoshi Iijima added a comment - I compared this source with other source components which can tail or read file(s) If there are incorrect contents, please let me know. Compare with tail-pollable-source ( FLUME-2344 ) and jambalay-file-source http://vschart.com/compare/flume-taildir-source/vs/flume-tail-pollable-source-flume-2344/vs/flume-jambalay-file-source Compare with Spooling directory source and Exec source (tail -F) http://vschart.com/compare/flume-taildir-source/vs/flume-spooling-directory/vs/flume-exec-source-tail-f
          Hide
          Otis Gospodnetic added a comment -

          Thanks!
          Questions:

          • what about Windows support? Should that be another row?
          • what does Pollable mean?
          • what does Append header mean/do? When is this used, how/why is it useful?
          Show
          Otis Gospodnetic added a comment - Thanks! Questions: what about Windows support? Should that be another row? what does Pollable mean? what does Append header mean/do? When is this used, how/why is it useful?
          Hide
          Satoshi Iijima added a comment -

          I have added 'Windows support' and 'Minimum Java version' to the above links.
          I think it is possible to support Windows because Windows has file ID instead of inode.
          If someone attaches a patch in which this source can run under Windows from now, then I think it is good, although I do not have Windows environment.

          'Pollable' means that the source implements PollableSource or that it continues to process if tailing files was entirely consumed.

          'Append header' means that the source can append headers to the events.
          When the source tails multiple files, it is useful to be able to append headers to events of each file such as Spooling Directory source.

          Show
          Satoshi Iijima added a comment - I have added 'Windows support' and 'Minimum Java version' to the above links. I think it is possible to support Windows because Windows has file ID instead of inode. If someone attaches a patch in which this source can run under Windows from now, then I think it is good, although I do not have Windows environment. 'Pollable' means that the source implements PollableSource or that it continues to process if tailing files was entirely consumed. 'Append header' means that the source can append headers to the events. When the source tails multiple files, it is useful to be able to append headers to events of each file such as Spooling Directory source.
          Hide
          Satoshi Iijima added a comment -

          Appearing below is answers to the questions posted to the mailing lists.

          > Btw., because the position in the file is checkpointed periodically, does
          > that mean that it is possible that, after a restart, some number of lines
          > that have already been tailed, will be read again?

          Yes. They will not be read again.
          On restart this source will start reading from the last read position in position file.

          > - How does it know when to stop tailing the current file and switch to or start tailing another file
          > - When there is a backlog of many files being built up... how does it order the files for consumption

          This source does not have the order because it is basically supposed to tail appended lines of files in nearly real-time.
          If there is a backlog of many files on start-up, one file will be selected in random order and be read to EOF, then the next file will be selected in the same way.
          Using 'skipToEnd' property, it can also start tailing from EOF of the current files.

          > - Sounds like there is some C/C++ native code + JNI to work with inodes ? what api are you using.

          This source uses java.nio.file.Files.getAttribute() of Java 7 API to identify inode of a file.

          > - does it auto delete the consumed files ?

          No, the consumed files need not be deleted in this source. Files and positions of each file that should be tailed are recorded in the position file.
          For example, a log file of a application such as /var/log/app/access.log can be directly specified in flume.conf

          Show
          Satoshi Iijima added a comment - Appearing below is answers to the questions posted to the mailing lists. > Btw., because the position in the file is checkpointed periodically, does > that mean that it is possible that, after a restart, some number of lines > that have already been tailed, will be read again? Yes. They will not be read again. On restart this source will start reading from the last read position in position file. > - How does it know when to stop tailing the current file and switch to or start tailing another file > - When there is a backlog of many files being built up... how does it order the files for consumption This source does not have the order because it is basically supposed to tail appended lines of files in nearly real-time. If there is a backlog of many files on start-up, one file will be selected in random order and be read to EOF, then the next file will be selected in the same way. Using 'skipToEnd' property, it can also start tailing from EOF of the current files. > - Sounds like there is some C/C++ native code + JNI to work with inodes ? what api are you using. This source uses java.nio.file.Files.getAttribute() of Java 7 API to identify inode of a file. > - does it auto delete the consumed files ? No, the consumed files need not be deleted in this source. Files and positions of each file that should be tailed are recorded in the position file. For example, a log file of a application such as /var/log/app/access.log can be directly specified in flume.conf
          Hide
          Otis Gospodnetic added a comment -

          Yes. They will not be read again.

          I can't tell if it's Yes or No to the "will lines be read again"

          If there is a backlog of many files on start-up, one file will be selected in random order

          Would it be possible to look at the timestamp on unread files? Or see if they have a numeric extension, like .1, .2, etc. and use some heuristics to try and read them in the correct order?

          No, the consumed files need not be deleted in this source.

          I think the person was asking whether this Taildir Source implementation deletes a file when it's done reading it or not. I think the answer is that it does NOT delete the file and that file deletion is somebody else's responsibility. Correct?

          Show
          Otis Gospodnetic added a comment - Yes. They will not be read again. I can't tell if it's Yes or No to the "will lines be read again" If there is a backlog of many files on start-up, one file will be selected in random order Would it be possible to look at the timestamp on unread files? Or see if they have a numeric extension, like .1, .2, etc. and use some heuristics to try and read them in the correct order? No, the consumed files need not be deleted in this source. I think the person was asking whether this Taildir Source implementation deletes a file when it's done reading it or not. I think the answer is that it does NOT delete the file and that file deletion is somebody else's responsibility. Correct?
          Hide
          Satoshi Iijima added a comment -

          It would be possible to create a patch to control the order of files for consumption in some way, for example, sorting inode list using comparator.

          It is correct. This source does not have the option to delete the files now.

          Show
          Satoshi Iijima added a comment - It would be possible to create a patch to control the order of files for consumption in some way, for example, sorting inode list using comparator. It is correct. This source does not have the option to delete the files now.
          Hide
          Otis Gospodnetic added a comment -

          It would be possible to create a patch to control the order of files for consumption in some way, for example, sorting inode list using comparator.

          +1 for this.

          Can this source be used to tail a file where something is writing Avro or Thrift? Or just plain-text, line-oriented data?

          Show
          Otis Gospodnetic added a comment - It would be possible to create a patch to control the order of files for consumption in some way, for example, sorting inode list using comparator. +1 for this. Can this source be used to tail a file where something is writing Avro or Thrift? Or just plain-text, line-oriented data?
          Hide
          Satoshi Iijima added a comment -

          This source supports only plain-text and is supposed to tail files as new-line separated data.
          It would be possible to tail non plain-text file if deserializer is implemented to TailFile class and can be specified in flume.conf by a patch.

          Show
          Satoshi Iijima added a comment - This source supports only plain-text and is supposed to tail files as new-line separated data. It would be possible to tail non plain-text file if deserializer is implemented to TailFile class and can be specified in flume.conf by a patch.
          Hide
          Juhani Connolly added a comment -

          Since most of the implementation details should be the same as an internal tool I wrote a while back I should be able to answer a couple of the remaining queries

          > I can't tell if it's Yes or No to the "will lines be read again"

          Duplicate reads are possible if the process is restarted and tailing has to be resumed. Checkpoints are periodically written. We didn't see it as an issue as flume can create duplicate lines in other places, with the objective being to prevent log loss, not duplication.

          > I think the person was asking whether this Taildir Source implementation deletes a file when it's done reading it or not. I think the answer is that it does NOT delete the file and that file deletion is somebody else's responsibility. Correct?

          This is correct. We use flume and the source as an "invisible" entity. We have it running on many internal services who do not need to worry about its existence as it works behind the scenes. We never had a need for it to delete the files, and for something tailing in real time, I suspect such a thing would be awkward. When would you delete a file that's actively being appended to? Once you're "done" reading, it may still get more appends. We close the file handles if there are no appends for a while, just to avoid hogging the file handle and so that log rotations and such are not obstructed, again with the objective that flume/the tailer be as unobstructive as possible.

          Show
          Juhani Connolly added a comment - Since most of the implementation details should be the same as an internal tool I wrote a while back I should be able to answer a couple of the remaining queries > I can't tell if it's Yes or No to the "will lines be read again" Duplicate reads are possible if the process is restarted and tailing has to be resumed. Checkpoints are periodically written. We didn't see it as an issue as flume can create duplicate lines in other places, with the objective being to prevent log loss, not duplication. > I think the person was asking whether this Taildir Source implementation deletes a file when it's done reading it or not. I think the answer is that it does NOT delete the file and that file deletion is somebody else's responsibility. Correct? This is correct. We use flume and the source as an "invisible" entity. We have it running on many internal services who do not need to worry about its existence as it works behind the scenes. We never had a need for it to delete the files, and for something tailing in real time, I suspect such a thing would be awkward. When would you delete a file that's actively being appended to? Once you're "done" reading, it may still get more appends. We close the file handles if there are no appends for a while, just to avoid hogging the file handle and so that log rotations and such are not obstructed, again with the objective that flume/the tailer be as unobstructive as possible.
          Hide
          Satoshi Iijima added a comment -

          Thanks, Juhani-san.

          Show
          Satoshi Iijima added a comment - Thanks, Juhani-san.
          Hide
          Satoshi Iijima added a comment -

          Updated the patch. I have made a fix to read multiple byte characters which are encoded in UTF-8, such as Japanese.

          Show
          Satoshi Iijima added a comment - Updated the patch. I have made a fix to read multiple byte characters which are encoded in UTF-8, such as Japanese.
          Hide
          Benjamin Fiorini added a comment - - edited

          Great source, works very well from what I could test !

          Even though it's easy to work around, it would be cool to also have:

          • regex in the directories as well (eg: /var/spool/flume/.*/.*\.reports)
          • be able to add the filename in the headers (could be useful if a regex is used in the filegroup)
          Show
          Benjamin Fiorini added a comment - - edited Great source, works very well from what I could test ! Even though it's easy to work around, it would be cool to also have: regex in the directories as well (eg: /var/spool/flume/.*/.*\.reports ) be able to add the filename in the headers (could be useful if a regex is used in the filegroup)
          Hide
          Satoshi Iijima added a comment -

          It sounds good. I agree with implementing these features if this patch is merged to trunk repository.

          Show
          Satoshi Iijima added a comment - It sounds good. I agree with implementing these features if this patch is merged to trunk repository.
          Hide
          Otis Gospodnetic added a comment -

          +1. Too many people asking about tailing and this patch in particular to be ignored by Flume committers IMHO.

          Show
          Otis Gospodnetic added a comment - +1. Too many people asking about tailing and this patch in particular to be ignored by Flume committers IMHO.
          Hide
          Ashish Paliwal added a comment -

          Can't we add this as an experimental feature? May be we incorporate User feedback as we get it

          Show
          Ashish Paliwal added a comment - Can't we add this as an experimental feature? May be we incorporate User feedback as we get it
          Hide
          Satoshi Iijima added a comment -

          Hari or other committers, What about adding this as an experimental source?

          Show
          Satoshi Iijima added a comment - Hari or other committers, What about adding this as an experimental source?
          Hide
          Otis Gospodnetic added a comment -

          There was talk on the mailing list about including this in 1.7. Could somebody please set Fix Version to 1.7 before this info gets lost?

          Show
          Otis Gospodnetic added a comment - There was talk on the mailing list about including this in 1.7. Could somebody please set Fix Version to 1.7 before this info gets lost?
          Hide
          Satoshi Iijima added a comment -

          Set Fix Version to v1.7.0.

          Show
          Satoshi Iijima added a comment - Set Fix Version to v1.7.0.
          Hide
          jian jin added a comment -

          when this could be merged into trunk?

          Show
          jian jin added a comment - when this could be merged into trunk?
          Hide
          Hari Shreedharan added a comment -

          I will review this over the next few days.

          Show
          Hari Shreedharan added a comment - I will review this over the next few days.
          Hide
          flankwang added a comment -

          I want to merge this patch to apache-flume-1.5.2,But I don`t know how to make it.

          Show
          flankwang added a comment - I want to merge this patch to apache-flume-1.5.2,But I don`t know how to make it.
          Hide
          jian jin added a comment -

          you could not do that directly based on flume source code, some confliction if you apply the patch.

          Show
          jian jin added a comment - you could not do that directly based on flume source code, some confliction if you apply the patch.
          Hide
          jian jin added a comment -

          Any progress? One comment about the implementation is : It read the content byte by byte. I do not know if that is necessary, if not, that is really impacted the performance.

          Show
          jian jin added a comment - Any progress? One comment about the implementation is : It read the content byte by byte. I do not know if that is necessary, if not, that is really impacted the performance.
          Hide
          sutanu das added a comment -

          Can this patch be backported for Flume 1.5 please?

          We run Hortonworks Flume 1.5.2.2 and they will not backport this patch for us.

          Reason we need this patch:

          1. We want to restart log-files ingestion as events at a point which flume stopped/recovered, yet, the loglines keep growing/appending

          2. We want to get logfile even if logs rotate with new_names eg tail.log.x get rotated to tail.log.y – where -F of exec source doesnt work and spoolDir doesnt work either (b/c of timestamp limitations)

          Show
          sutanu das added a comment - Can this patch be backported for Flume 1.5 please? We run Hortonworks Flume 1.5.2.2 and they will not backport this patch for us. Reason we need this patch: 1. We want to restart log-files ingestion as events at a point which flume stopped/recovered, yet, the loglines keep growing/appending 2. We want to get logfile even if logs rotate with new_names eg tail.log.x get rotated to tail.log.y – where -F of exec source doesnt work and spoolDir doesnt work either (b/c of timestamp limitations)
          Hide
          Hari Shreedharan added a comment -

          Johny Rufus/Roshan Naik - Do you think one of you would be able take a look at this one?

          Show
          Hari Shreedharan added a comment - Johny Rufus / Roshan Naik - Do you think one of you would be able take a look at this one?
          Hide
          Roshan Naik added a comment -

          I am beginning to look at this patch.

          Show
          Roshan Naik added a comment - I am beginning to look at this patch.
          Hide
          Roshan Naik added a comment -

          Given that this a useful patch that has been waiting for sometime .. i will try to fast track it. So I am uploading a revised patch with the below noted fixes. I will make my review comments in the next comment:

          Build changes:

          • fixed merge issues on to 1.7 latest
          • changed version to 1.7.0-SNAPSHOT
          • added taildir-source to flume-ng-dist/pom.xml
          • fixed : flume-taildir-source/pom.xml had wrong groupid

          Code/doc changes:

          • provided implementation for abstract methods : getBackOffSleepIncrement & getMaxBackOffSleepInterval
          • documented both these two new settings
          • provided support for short name (instead of FQCN) : sources.s1.type = TAILDIR
          • documented the name of the header used by the byteOffsetHeader setting
          • typos in error message
          Show
          Roshan Naik added a comment - Given that this a useful patch that has been waiting for sometime .. i will try to fast track it. So I am uploading a revised patch with the below noted fixes. I will make my review comments in the next comment: Build changes: fixed merge issues on to 1.7 latest changed version to 1.7.0-SNAPSHOT added taildir-source to flume-ng-dist/pom.xml fixed : flume-taildir-source/pom.xml had wrong groupid Code/doc changes: provided implementation for abstract methods : getBackOffSleepIncrement & getMaxBackOffSleepInterval documented both these two new settings provided support for short name (instead of FQCN) : sources.s1.type = TAILDIR documented the name of the header used by the byteOffsetHeader setting typos in error message
          Hide
          Roshan Naik added a comment -

          General Comments

          1. This patch seems relatively mature in its implementation. After making the above fixes, I gave it some testing on my mac and tried to cover some potential corner cases and it handled them pretty well.
          2. Like the filegroup feature.
          3. Like the fact that it can track many files at once.
          4. Handles the case when the event/line is still not completely written
          5. Seems like it is able to pick up appends to files that have been previously closed due to timeout. Thats very nice!
          6. Is tolerant to deletion of file and recreation of new file with same name. (treats them as diff files). Again very nice!
          7. Ran code coverage on the unit tests. Coverage is pretty good (80% line coverage).

          Questions:

          1. Was not able to verify if it handles subdirectories also ? can you confirm whether or not it handles it ?
          2. Wasn't clear how often it commits to the position.json file ? Intuitively i would say for every batch committed into the channel the json file should get updated.
          3. can a regex be applied to the directory also and not just file name ?
          4. Windows : What areas in this implementation do you feel may break on Windows ?
          5. Is there some limit on how many files it will track ?

          Suggestions

          1. major - need to document that it will not delete or rename files, and that there is an expectation of this should be done externally (unlike spooldir)
          2. major - it definitely needs deserializer support. readevent() can forward it to configured deserializer.
          3. major - Does not have a max event size setting (i.e. line length for text files). good to default to a large number (8k ?) for. Deserializer support will automatically give this.
          4. major - files to consume should be selected in order of creation time by default.
          5. major - I think readline() has a bug. it is treating \r without a \n immediately following it as a new line.
            Patch in FLUME-2508 might be useful for this.
          6. minor - If the file is being overwritten (instead of append) it could log an error and exclude that file ?
          Show
          Roshan Naik added a comment - General Comments This patch seems relatively mature in its implementation. After making the above fixes, I gave it some testing on my mac and tried to cover some potential corner cases and it handled them pretty well. Like the filegroup feature. Like the fact that it can track many files at once. Handles the case when the event/line is still not completely written Seems like it is able to pick up appends to files that have been previously closed due to timeout. Thats very nice! Is tolerant to deletion of file and recreation of new file with same name. (treats them as diff files). Again very nice! Ran code coverage on the unit tests. Coverage is pretty good (80% line coverage). Questions: Was not able to verify if it handles subdirectories also ? can you confirm whether or not it handles it ? Wasn't clear how often it commits to the position.json file ? Intuitively i would say for every batch committed into the channel the json file should get updated. can a regex be applied to the directory also and not just file name ? Windows : What areas in this implementation do you feel may break on Windows ? Is there some limit on how many files it will track ? Suggestions major - need to document that it will not delete or rename files, and that there is an expectation of this should be done externally (unlike spooldir) major - it definitely needs deserializer support. readevent() can forward it to configured deserializer. major - Does not have a max event size setting (i.e. line length for text files). good to default to a large number (8k ?) for. Deserializer support will automatically give this. major - files to consume should be selected in order of creation time by default. major - I think readline() has a bug. it is treating \r without a \n immediately following it as a new line. Patch in FLUME-2508 might be useful for this. minor - If the file is being overwritten (instead of append) it could log an error and exclude that file ?
          Hide
          jian jin added a comment -

          A question:

          Could we improve the readline() which call the read() to get data byte by byte, which is slow.

          Show
          jian jin added a comment - A question: Could we improve the readline() which call the read() to get data byte by byte, which is slow.
          Hide
          Satoshi Iijima added a comment - - edited

          In my production environment, this source can tail more than several thousands of line appends per second with a few percent CPU usage at a host which has 4 CPU cores. I think it is enough.

          Show
          Satoshi Iijima added a comment - - edited In my production environment, this source can tail more than several thousands of line appends per second with a few percent CPU usage at a host which has 4 CPU cores. I think it is enough.
          Hide
          Satoshi Iijima added a comment -

          Thank you for reviewing and updating a patch, Roshan.

          1. Was not able to verify if it handles subdirectories also ? can you confirm whether or not it handles it ?

          Now it cannot handles subdirectories. But it would be better to be able to track files of subdirectories.

          2. Wasn't clear how often it commits to the position.json file ? Intuitively i would say for every batch committed into the channel the json file should get updated.

          If position.json is updated for every batch committed, it impacts the performance in a small way.
          On the other hand, if only position.json is updated in regular interval, data loss do not occur when flume restarts for some reason.

          3. can a regex be applied to the directory also and not just file name ?

          Now this source cannot apply it. But this feature sounds good.
          It would be good to implement these feature (of question 1 and 3) after this patch is merged to trunk.

          4. Windows : What areas in this implementation do you feel may break on Windows ?

          This source use inode to identify uniqueness of file. It would need to use file ID instead of inode on winodws.

          5. Is there some limit on how many files it will track ?

          Although I do not confirm the limit on a test, there are many hosts where this source tracks several hundreds of files in my production emvioronment.

          Show
          Satoshi Iijima added a comment - Thank you for reviewing and updating a patch, Roshan. 1. Was not able to verify if it handles subdirectories also ? can you confirm whether or not it handles it ? Now it cannot handles subdirectories. But it would be better to be able to track files of subdirectories. 2. Wasn't clear how often it commits to the position.json file ? Intuitively i would say for every batch committed into the channel the json file should get updated. If position.json is updated for every batch committed, it impacts the performance in a small way. On the other hand, if only position.json is updated in regular interval, data loss do not occur when flume restarts for some reason. 3. can a regex be applied to the directory also and not just file name ? Now this source cannot apply it. But this feature sounds good. It would be good to implement these feature (of question 1 and 3) after this patch is merged to trunk. 4. Windows : What areas in this implementation do you feel may break on Windows ? This source use inode to identify uniqueness of file. It would need to use file ID instead of inode on winodws. 5. Is there some limit on how many files it will track ? Although I do not confirm the limit on a test, there are many hosts where this source tracks several hundreds of files in my production emvioronment.
          Hide
          Otis Gospodnetic added a comment -

          Could this be used for tailing binary files?

          Show
          Otis Gospodnetic added a comment - Could this be used for tailing binary files?
          Hide
          Roshan Naik added a comment -

          If it supported deserializers, you could give it a custom deserializer that splits the binary files into individual events... then yes.

          However, this source has a notion of checking if it has reached the EOF without reading a newline in the current event ... this is one area that needs a bit of investigation to see if the same behavior can be achieved with deserializer support.

          Show
          Roshan Naik added a comment - If it supported deserializers, you could give it a custom deserializer that splits the binary files into individual events... then yes. However, this source has a notion of checking if it has reached the EOF without reading a newline in the current event ... this is one area that needs a bit of investigation to see if the same behavior can be achieved with deserializer support.
          Hide
          jian jin added a comment -

          If that is case, i think it is enough. But Are u using SSD? I test it locally, it is not so fast.

          Show
          jian jin added a comment - If that is case, i think it is enough. But Are u using SSD? I test it locally, it is not so fast.
          Hide
          Roshan Naik added a comment - - edited

          Satoshi Iijima do u think u can look into

          • supporting deserializers
          • consuming files in order of creation time ?
          Show
          Roshan Naik added a comment - - edited Satoshi Iijima do u think u can look into supporting deserializers consuming files in order of creation time ?
          Hide
          Juhani Connolly added a comment -

          Pardon my being forward, but are these actually features we actually need for the patch to be released?

          I think we should be fixing any actual bugs(the /r /n issue you mentioned), and documentation(e.g. documenting it is not appropriate for binary files) and then committing. After that others are free to further improve on the source by adding deserializer support rather than further delaying inclusion. Committing without deserializer support does not strike me as harmful to users, just a missing feature that would be nice to have and would be an appropriate follow-up patch(same with most of the other suggestions)

          As you mentioned, it is pretty mature in implementation. It's been in production use for about an year now, on a very large number of servers. Trying to throw in more features in this patch(rather than a separate one) is just going to mean additional debugging and delays. Inclusion has no impact on other components so it is not harmful to them, and the main considerations should be "is it needed"(I would say yes) and "does it work as documented"(possibly needing a newline handling fix and documentation on what it does/does not handle). Committing it opens it up to modification by more people to contribute the features they would like to see added.

          Show
          Juhani Connolly added a comment - Pardon my being forward, but are these actually features we actually need for the patch to be released? I think we should be fixing any actual bugs(the /r /n issue you mentioned), and documentation(e.g. documenting it is not appropriate for binary files) and then committing. After that others are free to further improve on the source by adding deserializer support rather than further delaying inclusion. Committing without deserializer support does not strike me as harmful to users, just a missing feature that would be nice to have and would be an appropriate follow-up patch(same with most of the other suggestions) As you mentioned, it is pretty mature in implementation. It's been in production use for about an year now, on a very large number of servers. Trying to throw in more features in this patch(rather than a separate one) is just going to mean additional debugging and delays. Inclusion has no impact on other components so it is not harmful to them, and the main considerations should be "is it needed"(I would say yes) and "does it work as documented"(possibly needing a newline handling fix and documentation on what it does/does not handle). Committing it opens it up to modification by more people to contribute the features they would like to see added.
          Hide
          Ashish Paliwal added a comment -

          +1, I think once we have this gets committed other user can provide patches.

          Show
          Ashish Paliwal added a comment - +1, I think once we have this gets committed other user can provide patches.
          Hide
          Roshan Naik added a comment -
          • Consuming in sorted order : This does seem to be important in order to avoid data loss and exceedingly long delays in data delivery. Basically if it doesn't consume soon enough there is a greater danger that the other process which is deleting away the log files is likely to do so without the tail dir actually consuming the file. Also end users can easily get concerned if they see newer data first and not the older data.. leading to suspicion of data loss. This seemed like a small change. For now the need for multiple types of ordering is not needed (as might have been previously discussed). Just suffices to change the default scheme.
          • Deserializer Support : I am ok with not supporting deserializer for now. i feel it might be possible to add in later and still remain backward compatible as the default deserializer is LINE. I proposed this as it seemed a small change while also automatically addressing the max event size and newline issues.
          • The new line issue - that is definitely worth fixing to avoid bad data.
          • Max event size - good to have but may not be a blocker.
          • Updating position.json on every commit can lead to excessive duplication on failure scenarios. Not clear how if it have a significant perf impact... But it is not a blocker IMO.
          • Rest i guess a are trivial doc changes.

          I have been sensitive to ensure that my review 'suggestions' were relatively small in terms of changes needed or time required. I dont think it was unreasonable to check what the author felt about being able to address them in this round and proceed accordingly ... if not, potentially others or myself might chip in and take care of it.

          Anyway it was important to make those observations as part of the review. And to fast track things.. I need clarity from Satoshi on which of them he can address quickly so that I can proceed accordingly. So please let me know.

          Show
          Roshan Naik added a comment - Consuming in sorted order : This does seem to be important in order to avoid data loss and exceedingly long delays in data delivery. Basically if it doesn't consume soon enough there is a greater danger that the other process which is deleting away the log files is likely to do so without the tail dir actually consuming the file. Also end users can easily get concerned if they see newer data first and not the older data.. leading to suspicion of data loss. This seemed like a small change. For now the need for multiple types of ordering is not needed (as might have been previously discussed). Just suffices to change the default scheme. Deserializer Support : I am ok with not supporting deserializer for now. i feel it might be possible to add in later and still remain backward compatible as the default deserializer is LINE. I proposed this as it seemed a small change while also automatically addressing the max event size and newline issues. The new line issue - that is definitely worth fixing to avoid bad data. Max event size - good to have but may not be a blocker. Updating position.json on every commit can lead to excessive duplication on failure scenarios. Not clear how if it have a significant perf impact... But it is not a blocker IMO. Rest i guess a are trivial doc changes. I have been sensitive to ensure that my review 'suggestions' were relatively small in terms of changes needed or time required. I dont think it was unreasonable to check what the author felt about being able to address them in this round and proceed accordingly ... if not, potentially others or myself might chip in and take care of it. Anyway it was important to make those observations as part of the review. And to fast track things.. I need clarity from Satoshi on which of them he can address quickly so that I can proceed accordingly. So please let me know.
          Hide
          Satoshi Iijima added a comment -
          • The new line issue
          • doc changes

          I think above should be fixed before committing, too.
          Others would be nice to be addressed by providing patches later.

          I do not have much time to address them now because I assign other tasks.
          I am happy if Roshan or others address them.

          Show
          Satoshi Iijima added a comment - The new line issue doc changes I think above should be fixed before committing, too. Others would be nice to be addressed by providing patches later. I do not have much time to address them now because I assign other tasks. I am happy if Roshan or others address them.
          Hide
          Roshan Naik added a comment -

          Let me take a stab at the ordering issue.

          Any volunteer to take a stab at the new line issue ? ..i.e. handling both types of line endings correctly ?

          Show
          Roshan Naik added a comment - Let me take a stab at the ordering issue. Any volunteer to take a stab at the new line issue ? ..i.e. handling both types of line endings correctly ?
          Hide
          Johny Rufus added a comment -

          Roshan Naik, let me look at the new line issue.

          Show
          Johny Rufus added a comment - Roshan Naik , let me look at the new line issue.
          Hide
          Roshan Naik added a comment -

          Thanks Johny Rufus !
          You might be able to leverage the code in FLUME-2508
          It handles cases like a single file having both types of line endings (which is rare but does occur in mixed OS environments)

          Show
          Roshan Naik added a comment - Thanks Johny Rufus ! You might be able to leverage the code in FLUME-2508 It handles cases like a single file having both types of line endings (which is rare but does occur in mixed OS environments)
          Hide
          Johny Rufus added a comment -

          Hi Roshan Naik, Using the System.lineSeparator() is going to return the current system's line separator which may not be the same case with the file being processed.
          So typically we should
          1) figure out the end of a line using '\n' (should work for both unix and windows)
          2) and remove '\n' or '\r\n' in the end depending upon which one is present (should work for both unix and windows)

          Let me know if this sounds good.

          Show
          Johny Rufus added a comment - Hi Roshan Naik , Using the System.lineSeparator() is going to return the current system's line separator which may not be the same case with the file being processed. So typically we should 1) figure out the end of a line using '\n' (should work for both unix and windows) 2) and remove '\n' or '\r\n' in the end depending upon which one is present (should work for both unix and windows) Let me know if this sounds good.
          Hide
          Roshan Naik added a comment -

          yes I guess that sounds like a good idea. Good to have a little unit test for that function with two or three different type of lines feeding into it.

          Show
          Roshan Naik added a comment - yes I guess that sounds like a good idea. Good to have a little unit test for that function with two or three different type of lines feeding into it.
          Hide
          Roshan Naik added a comment -

          if it works well we should use your implementation for FLUME-2508 also

          Show
          Roshan Naik added a comment - if it works well we should use your implementation for FLUME-2508 also
          Hide
          Johny Rufus added a comment -

          Sure, working on it, let me attach the patch with extra test case, once done

          Show
          Johny Rufus added a comment - Sure, working on it, let me attach the patch with extra test case, once done
          Hide
          Johny Rufus added a comment -

          Attached FLUME-2498-4.patch that includes
          1) handling of \n and \r\n along with a test case to test the line boundaries
          2) A couple of doc additions that mention that the binary files are not supported and this source does not rename or delete or do any modifications to the file being tailed [Both doc changes are as per the comments in this discussion thread]

          Show
          Johny Rufus added a comment - Attached FLUME-2498 -4.patch that includes 1) handling of \n and \r\n along with a test case to test the line boundaries 2) A couple of doc additions that mention that the binary files are not supported and this source does not rename or delete or do any modifications to the file being tailed [Both doc changes are as per the comments in this discussion thread]
          Hide
          Roshan Naik added a comment - - edited

          Qiuck update.. changing the order has proved a bit more complex than expected. thought i had it but the tests indicate otherwise.

          Still working on it.
          Also it is becoming evident to me that consumption order based on lastModified time is more appropriate than creation time as old files may still be getting updated. that way a file deleting agent can remove files that have not been modified for a long time rather than based on creation time.

          Show
          Roshan Naik added a comment - - edited Qiuck update.. changing the order has proved a bit more complex than expected. thought i had it but the tests indicate otherwise. Still working on it. Also it is becoming evident to me that consumption order based on lastModified time is more appropriate than creation time as old files may still be getting updated. that way a file deleting agent can remove files that have not been modified for a long time rather than based on creation time.
          Hide
          Roshan Naik added a comment -

          Uploading Patch.

          • Changed the file consume order to be based on modification time (let me if you feel this is inappropriate choice, can easily switch to creation time if needed).
          • Also added to unit test to check file consume order
          • it includes changes from patch v4 submitted by Johny (can u please verify this Johny Rufus ?)
            -corrected the description of 'filegroups' setting in doc. Satoshi Iijima can you please validate ?

          Although i have a unit test to verify the consume order, would be good if someone does a review of it to ensure i didnt mess up. Satoshi Iijima johny, ashish or others ? I have kept my changes to bare minimal to reduce churn.

          Show
          Roshan Naik added a comment - Uploading Patch. Changed the file consume order to be based on modification time (let me if you feel this is inappropriate choice, can easily switch to creation time if needed). Also added to unit test to check file consume order it includes changes from patch v4 submitted by Johny (can u please verify this Johny Rufus ?) -corrected the description of 'filegroups' setting in doc. Satoshi Iijima can you please validate ? Although i have a unit test to verify the consume order, would be good if someone does a review of it to ensure i didnt mess up. Satoshi Iijima johny, ashish or others ? I have kept my changes to bare minimal to reduce churn.
          Hide
          Johny Rufus added a comment -

          Roshan Naik, verified changes from patch v4 are there in the latest patch.
          Will try to look at the consume order changes in some time.

          Show
          Johny Rufus added a comment - Roshan Naik , verified changes from patch v4 are there in the latest patch. Will try to look at the consume order changes in some time.
          Hide
          Johny Rufus added a comment -

          +1 for the changes related to ConsumeOrder

          Show
          Johny Rufus added a comment - +1 for the changes related to ConsumeOrder
          Hide
          Satoshi Iijima added a comment -

          +1 for the doc changes of 'filegroups' setting.

          Show
          Satoshi Iijima added a comment - +1 for the doc changes of 'filegroups' setting.
          Hide
          Roshan Naik added a comment -

          Seems like there are no blocker issues. And all changes have been reviewed by others and myself.
          So +1 from me.
          Will initiate the commit now.

          Show
          Roshan Naik added a comment - Seems like there are no blocker issues. And all changes have been reviewed by others and myself. So +1 from me. Will initiate the commit now.
          Hide
          ASF subversion and git services added a comment -

          Commit 757a560db73c2e6fbec56deea4c753a45ccf9032 in flume's branch refs/heads/trunk from Roshan Naik
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=757a560 ]

          FLUME-2498. Implement Taildir Source

          (Satoshi Iijima via Roshan Naik)

          Show
          ASF subversion and git services added a comment - Commit 757a560db73c2e6fbec56deea4c753a45ccf9032 in flume's branch refs/heads/trunk from Roshan Naik [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=757a560 ] FLUME-2498 . Implement Taildir Source (Satoshi Iijima via Roshan Naik)
          Hide
          ASF subversion and git services added a comment -

          Commit d02013f4e1ee429b57f24bdfad72e6c6707d0653 in flume's branch refs/heads/flume-1.7 from Roshan Naik
          [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=d02013f ]

          FLUME-2498. Implement Taildir Source

          (Satoshi Iijima via Roshan Naik)

          Show
          ASF subversion and git services added a comment - Commit d02013f4e1ee429b57f24bdfad72e6c6707d0653 in flume's branch refs/heads/flume-1.7 from Roshan Naik [ https://git-wip-us.apache.org/repos/asf?p=flume.git;h=d02013f ] FLUME-2498 . Implement Taildir Source (Satoshi Iijima via Roshan Naik)
          Hide
          Roshan Naik added a comment -

          Committed.
          trunk: 757a560db73c2e6fbec56deea4c753a45ccf9032
          flume-1.7: d02013f4e1ee429b57f24bdfad72e6c6707d0653

          Thanks Satoshi Iijima for this. This is a useful piece of functionality and surely users will appreciate it. Thanks for the patience.

          Also thanks to Johny Rufus for the newline fix and all who participated in this review.

          Show
          Roshan Naik added a comment - Committed. trunk: 757a560db73c2e6fbec56deea4c753a45ccf9032 flume-1.7: d02013f4e1ee429b57f24bdfad72e6c6707d0653 Thanks Satoshi Iijima for this. This is a useful piece of functionality and surely users will appreciate it. Thanks for the patience. Also thanks to Johny Rufus for the newline fix and all who participated in this review.
          Hide
          Hudson added a comment -

          SUCCESS: Integrated in Flume-trunk-hbase-1 #119 (See https://builds.apache.org/job/Flume-trunk-hbase-1/119/)
          FLUME-2498. Implement Taildir Source (roshan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=757a560db73c2e6fbec56deea4c753a45ccf9032)

          • flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirEventReader.java
          • pom.xml
          • flume-ng-configuration/src/main/java/org/apache/flume/conf/source/SourceConfiguration.java
          • flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirSource.java
          • flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/ReliableTaildirEventReader.java
          • flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSource.java
          • flume-ng-configuration/src/main/java/org/apache/flume/conf/source/SourceType.java
          • flume-ng-sources/flume-taildir-source/pom.xml
          • flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSourceConfigurationConstants.java
          • flume-ng-dist/pom.xml
          • flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TailFile.java
          • flume-ng-sources/pom.xml
          • flume-ng-doc/sphinx/FlumeUserGuide.rst
          Show
          Hudson added a comment - SUCCESS: Integrated in Flume-trunk-hbase-1 #119 (See https://builds.apache.org/job/Flume-trunk-hbase-1/119/ ) FLUME-2498 . Implement Taildir Source (roshan: http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=757a560db73c2e6fbec56deea4c753a45ccf9032 ) flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirEventReader.java pom.xml flume-ng-configuration/src/main/java/org/apache/flume/conf/source/SourceConfiguration.java flume-ng-sources/flume-taildir-source/src/test/java/org/apache/flume/source/taildir/TestTaildirSource.java flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/ReliableTaildirEventReader.java flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSource.java flume-ng-configuration/src/main/java/org/apache/flume/conf/source/SourceType.java flume-ng-sources/flume-taildir-source/pom.xml flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TaildirSourceConfigurationConstants.java flume-ng-dist/pom.xml flume-ng-sources/flume-taildir-source/src/main/java/org/apache/flume/source/taildir/TailFile.java flume-ng-sources/pom.xml flume-ng-doc/sphinx/FlumeUserGuide.rst
          Hide
          Satoshi Iijima added a comment -

          Thanks to Roshan for committing this. Thanks Johny and all other reviewer.

          Show
          Satoshi Iijima added a comment - Thanks to Roshan for committing this. Thanks Johny and all other reviewer.
          Hide
          Haralds Ulmanis added a comment -

          It is supposed to work with regex path .. but if your regex is directory, then it does not work
          e.g. /var/log/.*/abc.log
          I did lookup code .. only regex in file name works.

          Maybe add file manager who will add files to list. e.g.
          Simplified idea: On start add all directories matching regex (directory part) to inotify ... and then process inotify create events.
          if dir -> add watch
          if file -> add to file list to tail.

          Regards

          Show
          Haralds Ulmanis added a comment - It is supposed to work with regex path .. but if your regex is directory, then it does not work e.g. /var/log/.*/abc.log I did lookup code .. only regex in file name works. Maybe add file manager who will add files to list. e.g. Simplified idea: On start add all directories matching regex (directory part) to inotify ... and then process inotify create events. if dir -> add watch if file -> add to file list to tail. Regards
          Hide
          Roshan Naik added a comment -

          Haralds Ulmanis could u open a jira for that feature request.. and consider submitting a patch for it ?

          Show
          Roshan Naik added a comment - Haralds Ulmanis could u open a jira for that feature request.. and consider submitting a patch for it ?
          Hide
          Haralds Ulmanis added a comment -

          Ok ... I'm already writing it. Not exactly patch, but another module.

          Show
          Haralds Ulmanis added a comment - Ok ... I'm already writing it. Not exactly patch, but another module.
          Hide
          Hari Shreedharan added a comment -

          This is already committed. Can you create a new jira and submit the patch there.

          Show
          Hari Shreedharan added a comment - This is already committed. Can you create a new jira and submit the patch there.
          Hide
          Jun Seok Hong added a comment - - edited

          It was a mistake. I removed the file.

          Show
          Jun Seok Hong added a comment - - edited It was a mistake. I removed the file.
          Hide
          mouwei added a comment -

          Hi,
          I fond a bug of this tailsource.

          When I use the regular expression to match files under a folder. when some of file was rolling by log4j, this file's start position which is used to record tail position will be setted to 0. And then all of matched files will be readed again.

          after checking the code. I find below info:
          The process() will update all of inodes info by "existingInodes.addAll(reader.updateTailFiles());"
          But the the skipToEnd will be setted to "false" when update this file.
          " public List<Long> updateTailFiles() throws IOException

          { return updateTailFiles(false); }

          "
          when this file was rolled. below code will be executed. this startPos will be setted to 0. It will be readed again.
          if (tf == null || !tf.getPath().equals(f.getAbsolutePath()))

          { long startPos = skipToEnd ? f.length() : 0; tf = openFile(f, headers, inode, startPos); }

          Does anyone occurred same problem or is there any setting I missed?

          Show
          mouwei added a comment - Hi, I fond a bug of this tailsource. When I use the regular expression to match files under a folder. when some of file was rolling by log4j, this file's start position which is used to record tail position will be setted to 0. And then all of matched files will be readed again. after checking the code. I find below info: The process() will update all of inodes info by "existingInodes.addAll(reader.updateTailFiles());" But the the skipToEnd will be setted to "false" when update this file. " public List<Long> updateTailFiles() throws IOException { return updateTailFiles(false); } " when this file was rolled. below code will be executed. this startPos will be setted to 0. It will be readed again. if (tf == null || !tf.getPath().equals(f.getAbsolutePath())) { long startPos = skipToEnd ? f.length() : 0; tf = openFile(f, headers, inode, startPos); } Does anyone occurred same problem or is there any setting I missed?
          Hide
          Jason Kushmaul added a comment -

          I added a new ticket FLUME-2994 to add windows support to taildir
          and patch available.

          mouwei, you should create a new ticket.

          Show
          Jason Kushmaul added a comment - I added a new ticket FLUME-2994 to add windows support to taildir and patch available. mouwei , you should create a new ticket.

            People

            • Assignee:
              Unassigned
              Reporter:
              Satoshi Iijima
            • Votes:
              9 Vote for this issue
              Watchers:
              26 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development