Flume
  1. Flume
  2. FLUME-1100

HDFSWriterFactory and HDFSFormatterFactory should allow extension

    Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: v1.1.0
    • Fix Version/s: v1.4.0
    • Component/s: None
    • Labels:
      None

      Description

      Users should be able to plugin a class for both factories

      1. FLUME-1100-pluggable-FlumeFormatter.patch
        25 kB
        Chris Birchall
      2. FLUME-1100-upgraded-factory.patch
        1.0 kB
        Ted Malaska
      3. FLUME-1100-3b.patch
        58 kB
        Mike Percy
      4. FLUME-1100-3d.patch
        58 kB
        Mike Percy

        Issue Links

          Activity

          Hide
          Chris Birchall added a comment -

          I have a patch for one half of this issue, namely making HDFSFormatterFactory pluggable.

          The code uses the same idea as EventSerializerFactory, where users can pass either one of a few hardcoded strings ("Text", "Writable") or the class name of a Builder class.

          The code is also available for review here: https://github.com/cb372/flume/tree/custom-hdfs-formatter

          If you think this is reasonable then I can easily rewrite HDFSWriterFactory in a similar way.

          Show
          Chris Birchall added a comment - I have a patch for one half of this issue, namely making HDFSFormatterFactory pluggable. The code uses the same idea as EventSerializerFactory, where users can pass either one of a few hardcoded strings ("Text", "Writable") or the class name of a Builder class. The code is also available for review here: https://github.com/cb372/flume/tree/custom-hdfs-formatter If you think this is reasonable then I can easily rewrite HDFSWriterFactory in a similar way.
          Hide
          Chris Birchall added a comment -

          Make HDFSFormatterFactory pluggable

          Show
          Chris Birchall added a comment - Make HDFSFormatterFactory pluggable
          Hide
          Ted Malaska added a comment -

          Hi Chris,

          I had this same problem at a client's site this week. I solved in a similar way.

          But my client and I would like your second option. Just extending DFSWriterFactory to allow a class path for the value of "hdfs.writeFormat" in the property file.

          Please let me know what you think. My patch file is "FLUME-1100-upgraded-factory.patch"

          Show
          Ted Malaska added a comment - Hi Chris, I had this same problem at a client's site this week. I solved in a similar way. But my client and I would like your second option. Just extending DFSWriterFactory to allow a class path for the value of "hdfs.writeFormat" in the property file. Please let me know what you think. My patch file is " FLUME-1100 -upgraded-factory.patch"
          Hide
          Chris Birchall added a comment -

          Hi Ted, thanks for the patch.

          I think your patch and mine do basically the same thing. The only difference is that using your patch the user sets "hdfs.writeFormat" to the class name of a FlumeFormatter implementation, whereas with my patch they set it to the class name of a Builder.

          The nice thing about using a Builder is that we can pass a configuration Context to its build() method, so we can supply custom configuration parameters for our FlumeFormatter in the Flume config file.

          e.g.

          agent_foo.sinks.hdfs-sink.writeFormat=com.mycompany.flume.MyCustomFormatter$Builder
          agent_foo.sinks.hdfs-sink.writeFormat.ignoreHeaders=foo,bar
          agent_foo.sinks.hdfs-sink.writeFormat.maxEventLengthBytes=10000

          Show
          Chris Birchall added a comment - Hi Ted, thanks for the patch. I think your patch and mine do basically the same thing. The only difference is that using your patch the user sets "hdfs.writeFormat" to the class name of a FlumeFormatter implementation, whereas with my patch they set it to the class name of a Builder. The nice thing about using a Builder is that we can pass a configuration Context to its build() method, so we can supply custom configuration parameters for our FlumeFormatter in the Flume config file. e.g. agent_foo.sinks.hdfs-sink.writeFormat=com.mycompany.flume.MyCustomFormatter$Builder agent_foo.sinks.hdfs-sink.writeFormat.ignoreHeaders=foo,bar agent_foo.sinks.hdfs-sink.writeFormat.maxEventLengthBytes=10000
          Hide
          Ted Malaska added a comment -

          Agreed. I like the idea of using the config.

          But I wonder what your thoughts are on using a builder approach vs using an approach like the org.apache.hadoop.conf.Configurable interface in Hadoop.

          It would seem like the configurable approach would require less code and give the same results.

          Show
          Ted Malaska added a comment - Agreed. I like the idea of using the config. But I wonder what your thoughts are on using a builder approach vs using an approach like the org.apache.hadoop.conf.Configurable interface in Hadoop. It would seem like the configurable approach would require less code and give the same results.
          Hide
          Chris Birchall added a comment -

          I think that would also work fine. I just tried to be consistent with (read: shamelessly copy-pasted) the builder pattern that was already being used elsewhere in Flume (see EventSerializerFactory).

          Also I tend to prefer passing configuration to an object at construction time and making the object immutable, rather than creating the object and then configuring it. But that's just a matter of taste. I'm not sure whether Flume has a policy on such matters.

          Show
          Chris Birchall added a comment - I think that would also work fine. I just tried to be consistent with (read: shamelessly copy-pasted) the builder pattern that was already being used elsewhere in Flume (see EventSerializerFactory). Also I tend to prefer passing configuration to an object at construction time and making the object immutable, rather than creating the object and then configuring it. But that's just a matter of taste. I'm not sure whether Flume has a policy on such matters.
          Hide
          Ted Malaska added a comment -

          Agreed. There should be a consistent way of building these types of objects and you are being consistent with EventSerializer. So your approach is more correct.

          Now in the thought of making things consistent I have a question about writerFactory. Why does the existing implementation allow writerFactory to be passed in through a constructor but doesn't allow the same for the formatterFactory?

          Does it make since for both of these factories to follow your builder pattern?

          Show
          Ted Malaska added a comment - Agreed. There should be a consistent way of building these types of objects and you are being consistent with EventSerializer. So your approach is more correct. Now in the thought of making things consistent I have a question about writerFactory. Why does the existing implementation allow writerFactory to be passed in through a constructor but doesn't allow the same for the formatterFactory? Does it make since for both of these factories to follow your builder pattern?
          Hide
          Chris Birchall added a comment -

          Yes, I think HDFSWriterFactory should be re-written in the same way:

          • Replace the hardcoded strings with an enum
          • Use a builder instead of direct construction

          I was planning to do this myself, but I was hoping to get some kind of comment from somebody core to the project before writing too much code. I'm not sure if I'm supposed to formally request a review, or if the review system is only for committers.

          Show
          Chris Birchall added a comment - Yes, I think HDFSWriterFactory should be re-written in the same way: Replace the hardcoded strings with an enum Use a builder instead of direct construction I was planning to do this myself, but I was hoping to get some kind of comment from somebody core to the project before writing too much code. I'm not sure if I'm supposed to formally request a review, or if the review system is only for committers.
          Hide
          Ted Malaska added a comment -

          I'm not sure, but the committers are most likely watching from high above.

          I would hope to hear from them after the holiday weekend.

          Show
          Ted Malaska added a comment - I'm not sure, but the committers are most likely watching from high above. I would hope to hear from them after the holiday weekend.
          Hide
          Hari Shreedharan added a comment -

          Hey guys,

          Sorry for the delay in response. Noticed that you guys were having a detailed discussion on some stuff here. Could you both please

          (1) briefly summarize the discussion (for the lazy ones like me)
          (2) decide which of the two patches should be reviewed.

          Once you have done that, please post the patch for review on reviewboard and we'd be glad to review and commit.

          Show
          Hari Shreedharan added a comment - Hey guys, Sorry for the delay in response. Noticed that you guys were having a detailed discussion on some stuff here. Could you both please (1) briefly summarize the discussion (for the lazy ones like me) (2) decide which of the two patches should be reviewed. Once you have done that, please post the patch for review on reviewboard and we'd be glad to review and commit.
          Hide
          Ted Malaska added a comment -

          Hey Hari,

          (1) Summarization: What started out as a issue to allow the users to inject custom formatters into Flume HDFSEventSink, turned into a discussion on how Flume as a whole will allow injection of custom classes. i.e. formatters, writers, EventSerializers, ...

          (2) Which patch: Now Chris's initial patch is still the winner because it strives to be consistant with existing injection code in Flume (i.e EventSerializers). But Chris's patch only focuses on injection of the formatter where as we both agree if this builder pattern is selected as the standard for flume then the patch is just the beginning of something much bigger. In that the builder pattern becomes the universal way in which things are injected. So things like the writer injection should also follow the same builder pattern.

          (3) Summary of other topics: You will also find in the above talks a discussion about the difference between the builder pattern and the Configurable interface pattern from Hadoop.

          Show
          Ted Malaska added a comment - Hey Hari, (1) Summarization: What started out as a issue to allow the users to inject custom formatters into Flume HDFSEventSink, turned into a discussion on how Flume as a whole will allow injection of custom classes. i.e. formatters, writers, EventSerializers, ... (2) Which patch: Now Chris's initial patch is still the winner because it strives to be consistant with existing injection code in Flume (i.e EventSerializers). But Chris's patch only focuses on injection of the formatter where as we both agree if this builder pattern is selected as the standard for flume then the patch is just the beginning of something much bigger. In that the builder pattern becomes the universal way in which things are injected. So things like the writer injection should also follow the same builder pattern. (3) Summary of other topics: You will also find in the above talks a discussion about the difference between the builder pattern and the Configurable interface pattern from Hadoop.
          Hide
          Hari Shreedharan added a comment -

          Thanks Ted. Sounds good.

          Chris - When the patch is ready for review, please upload to reviewboard so one of the committers can review.

          Show
          Hari Shreedharan added a comment - Thanks Ted. Sounds good. Chris - When the patch is ready for review, please upload to reviewboard so one of the committers can review.
          Hide
          Chris Birchall added a comment -

          Hari,

          Thanks for the comment. I've posted the patch for review.

          This is my first review, so let me know if there are any problems with the patch format, etc.
          I made it using "git diff --no-prefix trunk > FLUME-1100.patch"

          Show
          Chris Birchall added a comment - Hari, Thanks for the comment. I've posted the patch for review. This is my first review, so let me know if there are any problems with the patch format, etc. I made it using "git diff --no-prefix trunk > FLUME-1100 .patch"
          Hide
          Mike Percy added a comment -

          Just adding a link to the review here in the JIRA

          Show
          Mike Percy added a comment - Just adding a link to the review here in the JIRA
          Hide
          Mike Percy added a comment -

          Rebased Chris' test and removed a few whitespace-only changes. Unit tests pass.

          Show
          Mike Percy added a comment - Rebased Chris' test and removed a few whitespace-only changes. Unit tests pass.
          Hide
          Chris Birchall added a comment -

          Mike, thanks a lot for taking care of the rebasing of the patch. I promise I was planning to get around to it...

          Show
          Chris Birchall added a comment - Mike, thanks a lot for taking care of the rebasing of the patch. I promise I was planning to get around to it...
          Hide
          Mike Percy added a comment -

          No worries

          Show
          Mike Percy added a comment - No worries
          Hide
          Mike Percy added a comment -

          Had to rebase again!

          Show
          Mike Percy added a comment - Had to rebase again!
          Hide
          Mike Percy added a comment -

          Patch committed. Thanks for your contribution Chris!

          Trunk rev: 0dacb250aab933abc077f8f72a616e5005a06135

          Show
          Mike Percy added a comment - Patch committed. Thanks for your contribution Chris! Trunk rev: 0dacb250aab933abc077f8f72a616e5005a06135
          Hide
          Hudson added a comment -

          Integrated in flume-trunk #343 (See https://builds.apache.org/job/flume-trunk/343/)
          FLUME-1100. HDFSWriterFactory and HDFSFormatterFactory should allow extension. (Revision 0dacb250aab933abc077f8f72a616e5005a06135)

          Result = SUCCESS
          mpercy : http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=0dacb250aab933abc077f8f72a616e5005a06135
          Files :

          • flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestHDFSCompressedDataStream.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSFormatterFactory.java
          • flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/HDFSBadSeqWriter.java
          • flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/HDFSBadDataStream.java
          • flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/MockHDFSWriter.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSCompressedDataStream.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSSequenceFile.java
          • flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestBucketWriter.java
          • flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestSeqFileFormatterFactory.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSWritableFormatter.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/SeqFileFormatter.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/SeqFileFormatterType.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/SeqFileFormatterFactory.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSWriter.java
          • flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/MyCustomFormatter.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSTextFormatter.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java
          • flume-ng-core/src/main/java/org/apache/flume/sink/FlumeFormatter.java
          • flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSDataStream.java
          Show
          Hudson added a comment - Integrated in flume-trunk #343 (See https://builds.apache.org/job/flume-trunk/343/ ) FLUME-1100 . HDFSWriterFactory and HDFSFormatterFactory should allow extension. (Revision 0dacb250aab933abc077f8f72a616e5005a06135) Result = SUCCESS mpercy : http://git-wip-us.apache.org/repos/asf/flume/repo?p=flume.git&a=commit&h=0dacb250aab933abc077f8f72a616e5005a06135 Files : flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestHDFSCompressedDataStream.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSFormatterFactory.java flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/HDFSBadSeqWriter.java flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/HDFSBadDataStream.java flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/MockHDFSWriter.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSCompressedDataStream.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSSequenceFile.java flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestBucketWriter.java flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/TestSeqFileFormatterFactory.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSWritableFormatter.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/SeqFileFormatter.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/SeqFileFormatterType.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/SeqFileFormatterFactory.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSWriter.java flume-ng-sinks/flume-hdfs-sink/src/test/java/org/apache/flume/sink/hdfs/MyCustomFormatter.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSEventSink.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSTextFormatter.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/BucketWriter.java flume-ng-core/src/main/java/org/apache/flume/sink/FlumeFormatter.java flume-ng-sinks/flume-hdfs-sink/src/main/java/org/apache/flume/sink/hdfs/HDFSDataStream.java

            People

            • Assignee:
              Chris Birchall
              Reporter:
              Brock Noland
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development