[NIFI-1579] Improve ListenSyslog Performance - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.0
Component/s: None
Labels:
None

Description

While testing ListenSyslog at various data rates it was observed that a significant amount of packets were dropped when using UDP.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

NIFI-1579-ListenSyslog_Performance.patch
01/Mar/16 21:44
9 kB
Bryan Bende
NIFI-1579-ListenSyslog_Performance-Part2.patch
02/Mar/16 18:28
9 kB
Bryan Bende

Activity

Ascending order - Click to sort in descending order

Bryan Bende added a comment - 01/Mar/16 21:35

Been investigating this by running a series of tests on my laptop, and uncovered a few things...

There is a yield in the onTrigger method when polling the queue with a 100ms wait and getting nothing, which could hurt performance if it ends up yielding and missing 1 second messages coming in 10s of thousands of messages per second, since we have the 100ms poll we don't really need a yield

The internal queue was hard coded to a max capacity of 10 which seems a bit too small to handle possible surges, it would be much better to let the user make a decision here about how much data to buffer in memory

Running tests on my laptop where I send millions of messages over a few minutes, I would eventually see a check point from the FlowFile repository with a stop-the-world check of upwards of 10-11 seconds, and during this time messages were still being read in from the channel and queue which could easily fill the queue and start blocking and eventually back up to the OS buffer and potentially drop the messages. It is not clear if this would happen on a high performance server, but after discussing with markap14 we determined that adjusting nifi.flowfile.repository.partitions in nifi.properties and reducing it significantly from 256 (used 8 in this case) would reduce the amount FileOutputStreams that need to be flushed and thus reduce the overall wait

Using the previous 0.5.1 release, I was barely able to achieve 5k messages per second without any data loss.

I then applied a patch that addresses the first two items above, and tested with the following configuration which seems to be a sweet spot on my laptop:

JDK 1.8
2GB Heap
G1GC
Reduced nifi.flowfile.repository.partitions to 8
Increased nifi.provenance.repository.rollover.time to 60 seconds
Set root logger to WARN
2MB Socket Buffer
10k Internal Queue size (default value from new patch)

Test1
1 concurrent task, parsing on, batch size of 1: Up to 11k messages/sec with no loss
4 concurrent tasks, parsing on, batch size of 1: Up to 15k messages/sec with no loss
1 concurrent tasks, parsing off, batch size of 1000: Up to 53k messages/sec with no loss

I will momentarily post the patch described above.

Bryan Bende added a comment - 01/Mar/16 21:35 Been investigating this by running a series of tests on my laptop, and uncovered a few things... There is a yield in the onTrigger method when polling the queue with a 100ms wait and getting nothing, which could hurt performance if it ends up yielding and missing 1 second messages coming in 10s of thousands of messages per second, since we have the 100ms poll we don't really need a yield The internal queue was hard coded to a max capacity of 10 which seems a bit too small to handle possible surges, it would be much better to let the user make a decision here about how much data to buffer in memory Running tests on my laptop where I send millions of messages over a few minutes, I would eventually see a check point from the FlowFile repository with a stop-the-world check of upwards of 10-11 seconds, and during this time messages were still being read in from the channel and queue which could easily fill the queue and start blocking and eventually back up to the OS buffer and potentially drop the messages. It is not clear if this would happen on a high performance server, but after discussing with markap14 we determined that adjusting nifi.flowfile.repository.partitions in nifi.properties and reducing it significantly from 256 (used 8 in this case) would reduce the amount FileOutputStreams that need to be flushed and thus reduce the overall wait Using the previous 0.5.1 release, I was barely able to achieve 5k messages per second without any data loss. I then applied a patch that addresses the first two items above, and tested with the following configuration which seems to be a sweet spot on my laptop: JDK 1.8 2GB Heap G1GC Reduced nifi.flowfile.repository.partitions to 8 Increased nifi.provenance.repository.rollover.time to 60 seconds Set root logger to WARN 2MB Socket Buffer 10k Internal Queue size (default value from new patch) Test1 1 concurrent task, parsing on, batch size of 1: Up to 11k messages/sec with no loss 4 concurrent tasks, parsing on, batch size of 1: Up to 15k messages/sec with no loss 1 concurrent tasks, parsing off, batch size of 1000: Up to 53k messages/sec with no loss I will momentarily post the patch described above.

Bryan Bende added a comment - 02/Mar/16 18:28

Realized from testing this more that we could benefit from also doing a long poll within the loop that was performing the batching in ListenSyslog, and also realized that ListenRELP will have the same issues as ListenSyslog.

Second patch needs to be applied on top of the first patch and addresses the points above.

Bryan Bende added a comment - 02/Mar/16 18:28 Realized from testing this more that we could benefit from also doing a long poll within the loop that was performing the batching in ListenSyslog, and also realized that ListenRELP will have the same issues as ListenSyslog. Second patch needs to be applied on top of the first patch and addresses the points above.

Joe Witt added a comment - 04/Mar/16 14:08

Very nice Bryan. Quite significant difference. +1

Joe Witt added a comment - 04/Mar/16 14:08 Very nice Bryan. Quite significant difference. +1

ASF subversion and git services added a comment - 04/Mar/16 14:23

Commit 19e53962ca45fd00a46efd671b674d388ff10053 in nifi's branch refs/heads/master from bbende
[ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=19e5396 ]

~~NIFI-1579~~ Performance improvements for ListenSyslog which include removing an unnecessary yield and exposing a configurable size for the internal queue used by the processor, changing ListenSyslog to use a 20ms poll and use a long poll when batching, also including same improvements for ListenRELP

ASF subversion and git services added a comment - 04/Mar/16 14:23 Commit 19e53962ca45fd00a46efd671b674d388ff10053 in nifi's branch refs/heads/master from bbende [ https://git-wip-us.apache.org/repos/asf?p=nifi.git;h=19e5396 ] NIFI-1579 Performance improvements for ListenSyslog which include removing an unnecessary yield and exposing a configurable size for the internal queue used by the processor, changing ListenSyslog to use a 20ms poll and use a long poll when batching, also including same improvements for ListenRELP

Bryan Bende added a comment - 04/Mar/16 14:24

Pushed to master

Bryan Bende added a comment - 04/Mar/16 14:24 Pushed to master

Apache NiFi

Improve ListenSyslog Performance

Details

Description

Attachments

Attachments

Activity

People

Dates