Flume
  1. Flume
  2. FLUME-720

CollectorSink doesn't pass the new format parameter

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: v0.9.5
    • Fix Version/s: v0.9.5
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      CollectorSink doesn't properly pass the format parameter down to the EscapedCustomDfs sink.
      For example, this is working fine:
      collectorSource(54001) | escapedCustomDfs("hdfs://hadoop1-m1:8020/", "test", seqfile("SnappyCodec") );

      However, this is using the codec defined in flume-conf.xml
      collectorSource(54001) | collectorSink("hdfs://hadoop1-m1:8020/", "test-", 600000, seqfile("SnappyCodec") );

      By itself this bug would not be very serious, however the problem is that escapedCustomDfs/customDfs are using the same compressor, and they apply it on the whole file, in addition to the compression done natively by the sequence file - this makes the sequence file double compressed and invalid.
      As far as I can tell, the only way to get a valid compressed sequence file is by setting flume.collector.dfs.compress.codec to "None" in flume-site.xml and use the format parameter to specify which compression to use for the sequence file, except that doesn't work...

        Activity

        Hide
        Jonathan Hsieh added a comment -

        You are correct about the double compression if the compression setting is set in the xml file as well as an argument to the seqfile format. I've confirmed that the collectorSink not taking the format as an argument is problem. For now here is a work around:

        Replace the collectorSink with (escapedFormatDfs is the same as escapedCustomDfs, but a better named):

        collector(600000) { escapedFormatDfs("hdfs://hadoop1-m1:8020/", "test-%

        {rolltag}

        ", seqfile("SnappyCodec")) }

        Show
        Jonathan Hsieh added a comment - You are correct about the double compression if the compression setting is set in the xml file as well as an argument to the seqfile format. I've confirmed that the collectorSink not taking the format as an argument is problem. For now here is a work around: Replace the collectorSink with (escapedFormatDfs is the same as escapedCustomDfs, but a better named): collector(600000) { escapedFormatDfs("hdfs://hadoop1-m1:8020/", "test-% {rolltag} ", seqfile("SnappyCodec")) }
        Hide
        Eran Kutner added a comment -

        That works. Thanks!

        Show
        Eran Kutner added a comment - That works. Thanks!
        Hide
        Jonathan Hsieh added a comment -

        Here's a straightforward ways to reproduce this problem:

        Create data that is supposed to be seq file:
        bin/flume sink 'collectorSink("file:///tmp/bz","bzip",5000, seqfile("bzip2"))'
        ...
        Type stuff and write some events.

        Read file that is supposed to be seq file:
        bin/flume source 'seqfile("/tmp/bz/bzipxxxxxx")'

        The latter command will fail if the file is not a seq file. If you look at the generated files you could see if it is a avrojson text file, or look for magic bytes that say SEQ (sequence file) and java classnames for the selected codec.

        Show
        Jonathan Hsieh added a comment - Here's a straightforward ways to reproduce this problem: Create data that is supposed to be seq file: bin/flume sink 'collectorSink("file:///tmp/bz","bzip",5000, seqfile("bzip2"))' ... Type stuff and write some events. Read file that is supposed to be seq file: bin/flume source 'seqfile("/tmp/bz/bzipxxxxxx")' The latter command will fail if the file is not a seq file. If you look at the generated files you could see if it is a avrojson text file, or look for magic bytes that say SEQ (sequence file) and java classnames for the selected codec.
        Hide
        Jonathan Hsieh added a comment -
        Show
        Jonathan Hsieh added a comment - review is here: https://review.cloudera.org/r/1886/
        Hide
        Jonathan Hsieh added a comment -

        No review for 3 weeks. Committing.

        Show
        Jonathan Hsieh added a comment - No review for 3 weeks. Committing.

          People

          • Assignee:
            Jonathan Hsieh
            Reporter:
            Eran Kutner
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development