Been investigating this by running a series of tests on my laptop, and uncovered a few things...
- There is a yield in the onTrigger method when polling the queue with a 100ms wait and getting nothing, which could hurt performance if it ends up yielding and missing 1 second messages coming in 10s of thousands of messages per second, since we have the 100ms poll we don't really need a yield
- The internal queue was hard coded to a max capacity of 10 which seems a bit too small to handle possible surges, it would be much better to let the user make a decision here about how much data to buffer in memory
- Running tests on my laptop where I send millions of messages over a few minutes, I would eventually see a check point from the FlowFile repository with a stop-the-world check of upwards of 10-11 seconds, and during this time messages were still being read in from the channel and queue which could easily fill the queue and start blocking and eventually back up to the OS buffer and potentially drop the messages. It is not clear if this would happen on a high performance server, but after discussing with markap14 we determined that adjusting nifi.flowfile.repository.partitions in nifi.properties and reducing it significantly from 256 (used 8 in this case) would reduce the amount FileOutputStreams that need to be flushed and thus reduce the overall wait
Using the previous 0.5.1 release, I was barely able to achieve 5k messages per second without any data loss.
I then applied a patch that addresses the first two items above, and tested with the following configuration which seems to be a sweet spot on my laptop:
- JDK 1.8
- 2GB Heap
- G1GC
- Reduced nifi.flowfile.repository.partitions to 8
- Increased nifi.provenance.repository.rollover.time to 60 seconds
- Set root logger to WARN
- 2MB Socket Buffer
- 10k Internal Queue size (default value from new patch)
Test1
1 concurrent task, parsing on, batch size of 1: Up to 11k messages/sec with no loss
4 concurrent tasks, parsing on, batch size of 1: Up to 15k messages/sec with no loss
1 concurrent tasks, parsing off, batch size of 1000: Up to 53k messages/sec with no loss
I will momentarily post the patch described above.
Been investigating this by running a series of tests on my laptop, and uncovered a few things...
Using the previous 0.5.1 release, I was barely able to achieve 5k messages per second without any data loss.
I then applied a patch that addresses the first two items above, and tested with the following configuration which seems to be a sweet spot on my laptop:
Test1
1 concurrent task, parsing on, batch size of 1: Up to 11k messages/sec with no loss
4 concurrent tasks, parsing on, batch size of 1: Up to 15k messages/sec with no loss
1 concurrent tasks, parsing off, batch size of 1000: Up to 53k messages/sec with no loss
I will momentarily post the patch described above.