Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.6.0
    • None
    • None

    Description

      While testing ListenSyslog at various data rates it was observed that a significant amount of packets were dropped when using UDP.

      Attachments

        Activity

          bbende Bryan Bende added a comment -

          Been investigating this by running a series of tests on my laptop, and uncovered a few things...

          • There is a yield in the onTrigger method when polling the queue with a 100ms wait and getting nothing, which could hurt performance if it ends up yielding and missing 1 second messages coming in 10s of thousands of messages per second, since we have the 100ms poll we don't really need a yield
          • The internal queue was hard coded to a max capacity of 10 which seems a bit too small to handle possible surges, it would be much better to let the user make a decision here about how much data to buffer in memory
          • Running tests on my laptop where I send millions of messages over a few minutes, I would eventually see a check point from the FlowFile repository with a stop-the-world check of upwards of 10-11 seconds, and during this time messages were still being read in from the channel and queue which could easily fill the queue and start blocking and eventually back up to the OS buffer and potentially drop the messages. It is not clear if this would happen on a high performance server, but after discussing with markap14 we determined that adjusting nifi.flowfile.repository.partitions in nifi.properties and reducing it significantly from 256 (used 8 in this case) would reduce the amount FileOutputStreams that need to be flushed and thus reduce the overall wait

          Using the previous 0.5.1 release, I was barely able to achieve 5k messages per second without any data loss.

          I then applied a patch that addresses the first two items above, and tested with the following configuration which seems to be a sweet spot on my laptop:

          • JDK 1.8
          • 2GB Heap
          • G1GC
          • Reduced nifi.flowfile.repository.partitions to 8
          • Increased nifi.provenance.repository.rollover.time to 60 seconds
          • Set root logger to WARN
          • 2MB Socket Buffer
          • 10k Internal Queue size (default value from new patch)

          Test1
          1 concurrent task, parsing on, batch size of 1: Up to 11k messages/sec with no loss
          4 concurrent tasks, parsing on, batch size of 1: Up to 15k messages/sec with no loss
          1 concurrent tasks, parsing off, batch size of 1000: Up to 53k messages/sec with no loss

          I will momentarily post the patch described above.

          bbende Bryan Bende added a comment - Been investigating this by running a series of tests on my laptop, and uncovered a few things... There is a yield in the onTrigger method when polling the queue with a 100ms wait and getting nothing, which could hurt performance if it ends up yielding and missing 1 second messages coming in 10s of thousands of messages per second, since we have the 100ms poll we don't really need a yield The internal queue was hard coded to a max capacity of 10 which seems a bit too small to handle possible surges, it would be much better to let the user make a decision here about how much data to buffer in memory Running tests on my laptop where I send millions of messages over a few minutes, I would eventually see a check point from the FlowFile repository with a stop-the-world check of upwards of 10-11 seconds, and during this time messages were still being read in from the channel and queue which could easily fill the queue and start blocking and eventually back up to the OS buffer and potentially drop the messages. It is not clear if this would happen on a high performance server, but after discussing with markap14 we determined that adjusting nifi.flowfile.repository.partitions in nifi.properties and reducing it significantly from 256 (used 8 in this case) would reduce the amount FileOutputStreams that need to be flushed and thus reduce the overall wait Using the previous 0.5.1 release, I was barely able to achieve 5k messages per second without any data loss. I then applied a patch that addresses the first two items above, and tested with the following configuration which seems to be a sweet spot on my laptop: JDK 1.8 2GB Heap G1GC Reduced nifi.flowfile.repository.partitions to 8 Increased nifi.provenance.repository.rollover.time to 60 seconds Set root logger to WARN 2MB Socket Buffer 10k Internal Queue size (default value from new patch) Test1 1 concurrent task, parsing on, batch size of 1: Up to 11k messages/sec with no loss 4 concurrent tasks, parsing on, batch size of 1: Up to 15k messages/sec with no loss 1 concurrent tasks, parsing off, batch size of 1000: Up to 53k messages/sec with no loss I will momentarily post the patch described above.
          bbende Bryan Bende added a comment -

          Realized from testing this more that we could benefit from also doing a long poll within the loop that was performing the batching in ListenSyslog, and also realized that ListenRELP will have the same issues as ListenSyslog.

          Second patch needs to be applied on top of the first patch and addresses the points above.

          bbende Bryan Bende added a comment - Realized from testing this more that we could benefit from also doing a long poll within the loop that was performing the batching in ListenSyslog, and also realized that ListenRELP will have the same issues as ListenSyslog. Second patch needs to be applied on top of the first patch and addresses the points above.
          joewitt Joe Witt added a comment -

          Very nice Bryan. Quite significant difference. +1

          joewitt Joe Witt added a comment - Very nice Bryan. Quite significant difference. +1

          Commit 19e53962ca45fd00a46efd671b674d388ff10053 in nifi's branch refs/heads/master from bbende
          [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=19e5396 ]

          NIFI-1579 Performance improvements for ListenSyslog which include removing an unnecessary yield and exposing a configurable size for the internal queue used by the processor, changing ListenSyslog to use a 20ms poll and use a long poll when batching, also including same improvements for ListenRELP

          jira-bot ASF subversion and git services added a comment - Commit 19e53962ca45fd00a46efd671b674d388ff10053 in nifi's branch refs/heads/master from bbende [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=19e5396 ] NIFI-1579 Performance improvements for ListenSyslog which include removing an unnecessary yield and exposing a configurable size for the internal queue used by the processor, changing ListenSyslog to use a 20ms poll and use a long poll when batching, also including same improvements for ListenRELP
          bbende Bryan Bende added a comment -

          Pushed to master

          bbende Bryan Bende added a comment - Pushed to master

          People

            bbende Bryan Bende
            bbende Bryan Bende
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: