Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: v1.3.0
    • Component/s: None
    • Labels:
      None

      Description

      Add a configuration option for the number of items to send to the channel in a single transaction.

      This will help a lot with FileChannel which needs to fsync every commit.

      1. FLUME-1361-2.patch
        6 kB
        Juhani Connolly

        Issue Links

          Activity

          Hide
          NO NAME added a comment -

          Hey Juhani,

          If you could share any performance improvement you get from this (even roughly) that would be great.

          I was looking at:
          https://issues.apache.org/jira/browse/FLUME-1339

          with Hari, but my instinct is that adding event batching is really what you want for this, not necessarily building a standalone client.

          • Patrick
          Show
          NO NAME added a comment - Hey Juhani, If you could share any performance improvement you get from this (even roughly) that would be great. I was looking at: https://issues.apache.org/jira/browse/FLUME-1339 with Hari, but my instinct is that adding event batching is really what you want for this, not necessarily building a standalone client. Patrick
          Hide
          Juhani Connolly added a comment -

          With a setup up of:

          Exec source tailing tomcat logs
          Sending to file channel
          Which is drained by an avro sink

          With the current implementation of FileChannel, and a single disk(so checkpoint/data dirs both on the same disk) we were getting only 10 events/sec throughput. What I have gathered from other discussions and my own assumptions that follow from them(please correct me if this is wrong) is that this is because commits trigger an fsync, which then triggers at least 2 seeks(one for the data dir, one for the checkpoint dir) + seeks for everything else recently written to disk(e.g. tomcat logs). On a system with 2-3 exclusive disks dedicated to flume, the writes would be sequential and probably not a problem.

          With this patch, we were getting full throughput of our live logs(amounting to 650ish events per second per server). I have yet to test what the maximum is, but regardless, it solves what I believe will be a very common use case(tailing exec source to file channel)

          Apparently the review requests no longer get auto-linked... added a link to the review request... I'll fix up the docs tomorrow once I get back to my work computer

          Show
          Juhani Connolly added a comment - With a setup up of: Exec source tailing tomcat logs Sending to file channel Which is drained by an avro sink With the current implementation of FileChannel, and a single disk(so checkpoint/data dirs both on the same disk) we were getting only 10 events/sec throughput. What I have gathered from other discussions and my own assumptions that follow from them(please correct me if this is wrong) is that this is because commits trigger an fsync, which then triggers at least 2 seeks(one for the data dir, one for the checkpoint dir) + seeks for everything else recently written to disk(e.g. tomcat logs). On a system with 2-3 exclusive disks dedicated to flume, the writes would be sequential and probably not a problem. With this patch, we were getting full throughput of our live logs(amounting to 650ish events per second per server). I have yet to test what the maximum is, but regardless, it solves what I believe will be a very common use case(tailing exec source to file channel) Apparently the review requests no longer get auto-linked... added a link to the review request... I'll fix up the docs tomorrow once I get back to my work computer
          Hide
          NO NAME added a comment - - edited

          Hey Juhani,

          Yep - you've got it. The ideal setup for a FileChannel would either be:

          1) Using a dedicated disk for Flume and flushing to disk on every event.
          or
          2) Using a shared disk for flume and batching disk sync's to prevent excess seeking.

          The first case is similar to using a WAL, frequent seeks but a dedicated disk, so you can get high throughput. If you try to use FileChannel with a shared disk, and you are sync'ing on every event, throughput is going to be bad.

          So I'd expect adding batching to give better throughput, and it sounds like it is.

          One question is whether batching should happen as part of the source or if it should be a first-order feature of the channel, since people will have this problem with other types of sources (e.g. syslog source) whenever they want to do durable writes at high throughput.

          Show
          NO NAME added a comment - - edited Hey Juhani, Yep - you've got it. The ideal setup for a FileChannel would either be: 1) Using a dedicated disk for Flume and flushing to disk on every event. or 2) Using a shared disk for flume and batching disk sync's to prevent excess seeking. The first case is similar to using a WAL, frequent seeks but a dedicated disk, so you can get high throughput. If you try to use FileChannel with a shared disk, and you are sync'ing on every event, throughput is going to be bad. So I'd expect adding batching to give better throughput, and it sounds like it is. One question is whether batching should happen as part of the source or if it should be a first-order feature of the channel, since people will have this problem with other types of sources (e.g. syslog source) whenever they want to do durable writes at high throughput.
          Hide
          Juhani Connolly added a comment -

          It would be nice to see batching as part of the channel and I've mentioned it on the mailing list before. I did this because we needed it now, it is simple, and doing it channel side looks a lot more awkward and gives less control. Anyway, 3am here now, sleep, and I'll fix up for the comments on the review tomorrow first thing.

          Show
          Juhani Connolly added a comment - It would be nice to see batching as part of the channel and I've mentioned it on the mailing list before. I did this because we needed it now, it is simple, and doing it channel side looks a lot more awkward and gives less control. Anyway, 3am here now, sleep, and I'll fix up for the comments on the review tomorrow first thing.
          Hide
          NO NAME added a comment -

          I think it's fine to have this batching in the exec source as a short term fix.

          Even if we add batching as a core component of flume people might still want this anyways to batch the source at a different granularity.

          Show
          NO NAME added a comment - I think it's fine to have this batching in the exec source as a short term fix. Even if we add batching as a core component of flume people might still want this anyways to batch the source at a different granularity.
          Hide
          Hari Shreedharan added a comment -

          Patch committed. Thanks Juhani!

          Show
          Hari Shreedharan added a comment - Patch committed. Thanks Juhani!

            People

            • Assignee:
              Juhani Connolly
              Reporter:
              Juhani Connolly
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development