Uploaded image for project: 'Flume'
  1. Flume
  2. FLUME-2352

HDFSCompressedDataStream should support appendBatch

    Details

    • Type: Improvement
    • Status: Patch Available
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.5.0
    • Fix Version/s: None
    • Component/s: Sinks+Sources
    • Labels:
      None

      Description

      compressing events in batch is much more efficient than compressing one by one.
      I set hdfs.batchSize to 200000, when I use appendBatch() in BucketWriter, the append operation cost less than 1 seconds, while one by one might cost 10 seconds.

        Activity

        Hide
        chenshangan521@163.com chenshangan added a comment -

        In our case, one batch consists of above 100 categories, each category has an independent bucketWriter, so actually one bucketWriter takes about 2000 messages in a batch.

        Show
        chenshangan521@163.com chenshangan added a comment - In our case, one batch consists of above 100 categories, each category has an independent bucketWriter, so actually one bucketWriter takes about 2000 messages in a batch.
        Hide
        hshreedharan Hari Shreedharan added a comment -

        Also, batch size of 200000 is not realistic. I'd like to see if batch sizes between 1000 and 10000 show any difference with events of 500 bytes or so.

        Show
        hshreedharan Hari Shreedharan added a comment - Also, batch size of 200000 is not realistic. I'd like to see if batch sizes between 1000 and 10000 show any difference with events of 500 bytes or so.
        Hide
        hshreedharan Hari Shreedharan added a comment -

        This seems like a good idea. I did a quick review and it looks good. Since the serializer is the same for the life of the sink, we don't need to do an instanceOf check every time we write an event. We only need to do it once and reuse this info. We should fix that.

        Show
        hshreedharan Hari Shreedharan added a comment - This seems like a good idea. I did a quick review and it looks good. Since the serializer is the same for the life of the sink, we don't need to do an instanceOf check every time we write an event. We only need to do it once and reuse this info. We should fix that.
        Hide
        chenshangan521@163.com chenshangan added a comment -

        Hari Shreedharan I've updated the patch with testcase

        Show
        chenshangan521@163.com chenshangan added a comment - Hari Shreedharan I've updated the patch with testcase
        Hide
        chenshangan521@163.com chenshangan added a comment - - edited

        batchSize:200000

        append in batch

        all take append sync
        50254.6 47443.7 815.389 885.778

        append one by one

        all take append sync
        52779.9 29259 18647.8 2243.67

        indicator explain:
        all: overall time processing a batch
        take: time cost in taking events from channel
        append: time cost in append op
        sync:time cost in flush

        append op time significantly decreases, but the take time increases, I don't know why.
        Another experiment is that send a large file and calculate the whole time, batch-append only cost half the time of one-by-one.

        Show
        chenshangan521@163.com chenshangan added a comment - - edited batchSize:200000 append in batch all take append sync 50254.6 47443.7 815.389 885.778 append one by one all take append sync 52779.9 29259 18647.8 2243.67 indicator explain: all: overall time processing a batch take: time cost in taking events from channel append: time cost in append op sync:time cost in flush append op time significantly decreases, but the take time increases, I don't know why. Another experiment is that send a large file and calculate the whole time, batch-append only cost half the time of one-by-one.
        Hide
        hshreedharan Hari Shreedharan added a comment -

        Do you have any specific data to prove that adding events as one batch helps? I believe compression codecs actually hold events in a buffer and compress only when a flush is called or the buffer is full.

        Show
        hshreedharan Hari Shreedharan added a comment - Do you have any specific data to prove that adding events as one batch helps? I believe compression codecs actually hold events in a buffer and compress only when a flush is called or the buffer is full.
        Hide
        hshreedharan Hari Shreedharan added a comment -

        That sounds good. You'd need to support both the interfaces though, so you'd need to do some reflection.

        Show
        hshreedharan Hari Shreedharan added a comment - That sounds good. You'd need to support both the interfaces though, so you'd need to do some reflection.
        Hide
        chenshangan521@163.com chenshangan added a comment -

        Hari Shreedharan - What do you think if I add a new interface named BatchEventSerializer that extends the EventSerializer interface and add a new method in it ? I also deal with the interface HDFSWriter in the same way.

        Show
        chenshangan521@163.com chenshangan added a comment - Hari Shreedharan - What do you think if I add a new interface named BatchEventSerializer that extends the EventSerializer interface and add a new method in it ? I also deal with the interface HDFSWriter in the same way.
        Hide
        hshreedharan Hari Shreedharan added a comment -

        chenshangan - I like the idea but it is not possible to add a new method to the EventSerializer interface as it affects not just the built-in serializers, since it is not binary compatible with older code. There are custom serializers out there that would break if the interface is changed. If possible, can you try to do this without the change in the serializer interface.

        Show
        hshreedharan Hari Shreedharan added a comment - chenshangan - I like the idea but it is not possible to add a new method to the EventSerializer interface as it affects not just the built-in serializers, since it is not binary compatible with older code. There are custom serializers out there that would break if the interface is changed. If possible, can you try to do this without the change in the serializer interface.

          People

          • Assignee:
            chenshangan521@163.com chenshangan
            Reporter:
            chenshangan521@163.com chenshangan
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:

              Development