Flume
  1. Flume
  2. FLUME-486

When using taildir in a single node configuration records are being lost

    Details

      Description

      When using tailDir to ingest file using flumes we're seeing large numbers of records go missing from the beginning of the file. If we use the text source at the same file all records ingest successfully. The workflow we're using is very simple, one node is used per source and sink. The node reads the data using tailDir and writes out using escapedCustomDfs.

        Activity

        Hide
        Jonathan Hsieh added a comment -

        Does this problem occur when you use collectorSink instead of escapedCustomDfs?

        escapedCustomDfs needs something to close it for it to flush data properly, the collectorSink has a "roller" that periodically closes one file out and opens a new one. Since text source has an end, when the end of file is reached it can shutdown the node. tail will keep trying to read more data and will never shut it self down!

        -Jon.

        Show
        Jonathan Hsieh added a comment - Does this problem occur when you use collectorSink instead of escapedCustomDfs? escapedCustomDfs needs something to close it for it to flush data properly, the collectorSink has a "roller" that periodically closes one file out and opens a new one. Since text source has an end, when the end of file is reached it can shutdown the node. tail will keep trying to read more data and will never shut it self down! -Jon.
        Hide
        Disabled imported user added a comment -

        Hey Jon,

        It occurs when using escapedCustomDfs, though it isn't on flush, as there are missing events all the way across the data source, some from the very beginning of the file and others mid-way through and towards the end. It doesn't appear to be a flushing problem. It is non-deterministic at the moment, so we can't devise a workaround or develop a means for ensuring the integrity of the file.

        When we look at the stats for Flume, the number of events sent directly correlates with number in the source file, however in the version in HDFS a whole lot is missing. The other problem we're seeing is when the custom decorator we're written is included in the workflow, custom DFS complains that the attributes we've added aren't there, as if the body has either changed or is missing? This doesn't happen with the Text source. When we get time we'll do some more debug, deadlines are looming and we need to progress.

        We're probably going to have a change of approach for the time being which is to only ingest nightly using 'text' when the log file rotates, kind of defeats our goals of streaming data as it arrives, but we are able to guarantee the integrity of the data. We have also experimented with vanilla tail as well, the problem is present in both versions - I assume TailDir invokes vanilla Tail when it finds a new file.

        Thanks for your help,

        Justin

        Show
        Disabled imported user added a comment - Hey Jon, It occurs when using escapedCustomDfs, though it isn't on flush, as there are missing events all the way across the data source, some from the very beginning of the file and others mid-way through and towards the end. It doesn't appear to be a flushing problem. It is non-deterministic at the moment, so we can't devise a workaround or develop a means for ensuring the integrity of the file. When we look at the stats for Flume, the number of events sent directly correlates with number in the source file, however in the version in HDFS a whole lot is missing. The other problem we're seeing is when the custom decorator we're written is included in the workflow, custom DFS complains that the attributes we've added aren't there, as if the body has either changed or is missing? This doesn't happen with the Text source. When we get time we'll do some more debug, deadlines are looming and we need to progress. We're probably going to have a change of approach for the time being which is to only ingest nightly using 'text' when the log file rotates, kind of defeats our goals of streaming data as it arrives, but we are able to guarantee the integrity of the data. We have also experimented with vanilla tail as well, the problem is present in both versions - I assume TailDir invokes vanilla Tail when it finds a new file. Thanks for your help, Justin
        Hide
        Jonathan Hsieh added a comment - - edited

        I just noticed the version you are using – you should upgrade to the 0.9.2 or soon to be released 0.9.3 version. If you depend on rpms/debs, there will be a new release of that in a few weeks.

        Show
        Jonathan Hsieh added a comment - - edited I just noticed the version you are using – you should upgrade to the 0.9.2 or soon to be released 0.9.3 version. If you depend on rpms/debs, there will be a new release of that in a few weeks.
        Hide
        E. Sammer added a comment -

        Justin:

        Is this still an issue? Were you able to try a newer version of Flume? The tail source underwent some major refactoring and should be far more stable now.

        Show
        E. Sammer added a comment - Justin: Is this still an issue? Were you able to try a newer version of Flume? The tail source underwent some major refactoring and should be far more stable now.

          People

          • Assignee:
            Unassigned
            Reporter:
            Disabled imported user
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development