Uploaded image for project: 'Log4j 2'
  1. Log4j 2
  2. LOG4J2-1430

Add optional support for Conversant DisruptorBlockingQueue in AsyncAppender

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.6.1
    • Fix Version/s: 2.7
    • Component/s: Appenders
    • Labels:
      None

      Description

      Conversant Disruptor works as an implementation of BlockingQueue that is much faster than ArrayBlockingQueue. I did some benchmarks earlier and found it to be a bit faster:

      AsyncAppender/ArrayBlockingQueue

      Benchmark                                                       Mode  Samples        Score       Error  Units
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params    thrpt       20  1101267.173 ± 17583.204  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params    thrpt       20  1128269.255 ± 12188.910  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param      thrpt       20  1525470.805 ± 56515.933  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params     thrpt       20  1789434.196 ± 42733.475  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params     thrpt       20  1803276.278 ± 34938.176  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params     thrpt       20  1468550.776 ± 26402.286  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params     thrpt       20  1322304.349 ± 22417.997  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params     thrpt       20  1179756.489 ± 16502.276  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params     thrpt       20  1324660.677 ± 18893.944  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params     thrpt       20  1309365.962 ± 19602.489  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params     thrpt       20  1422144.180 ± 20815.042  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple      thrpt       20  1247862.372 ± 18300.764  ops/s
      

      AsyncAppender/DisruptorBlockingQueue

      Benchmark                                                       Mode  Samples        Score        Error  Units
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params    thrpt       20  3704735.586 ±  59766.253  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params    thrpt       20  3622175.410 ±  31975.353  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param      thrpt       20  6862480.428 ± 121473.276  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params     thrpt       20  6193288.988 ±  93545.144  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params     thrpt       20  5715621.712 ± 131878.581  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params     thrpt       20  5745187.005 ± 213854.016  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params     thrpt       20  5307137.396 ±  88135.709  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params     thrpt       20  4953015.419 ±  72100.403  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params     thrpt       20  4833836.418 ±  52919.314  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params     thrpt       20  4353791.507 ±  79047.812  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params     thrpt       20  4136761.624 ±  67804.253  ops/s
      o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple      thrpt       20  6719456.722 ± 187433.301  ops/s
      

      AsyncLogger

      Benchmark                                                Mode  Samples         Score        Error  Units
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput10Params    thrpt       20   5075883.371 ± 180465.316  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput11Params    thrpt       20   4867362.030 ± 193909.465  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput1Param      thrpt       20  10294733.024 ± 226536.965  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput2Params     thrpt       20   9021650.667 ± 351102.255  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput3Params     thrpt       20   8079337.905 ± 115824.975  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput4Params     thrpt       20   7347356.788 ±  66598.738  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput5Params     thrpt       20   6930636.174 ± 150072.908  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput6Params     thrpt       20   6309567.300 ± 293709.787  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput7Params     thrpt       20   6051997.196 ± 268405.087  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput8Params     thrpt       20   5273376.623 ±  99168.461  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughput9Params     thrpt       20   5091137.594 ± 150617.444  ops/s
      o.a.l.l.p.j.AsyncLoggersBenchmark.throughputSimple      thrpt       20  11136623.731 ± 400350.272  ops/s
      
      1. log4jthread2cpu4core.log
        65 kB
        John Cairns
      2. log4jthread2cpu2core.log
        62 kB
        John Cairns
      3. log4jrafile.log
        6 kB
        John Cairns
      4. log4jperfnumactl.log
        32 kB
        John Cairns
      5. log4jHaswell2cpu4core.jpg
        57 kB
        John Cairns
      6. log4jHaswell2cpu2core.jpg
        60 kB
        John Cairns
      7. log4jHaswell1cpu12core.jpg
        56 kB
        John Cairns
      8. log4j2-1430-jctools-tmp-patch.txt
        5 kB
        Anthony Maire
      9. jctools-vs-conversant-service-time.png
        47 kB
        Anthony Maire
      10. conversantvsjctoolsnumthreads.jpg
        47 kB
        John Cairns
      11. AsyncLogBenchmarks.log
        21 kB
        John Cairns
      12. AsyncAppenderPerf01.txt
        6 kB
        Matt Sicker

        Activity

        Hide
        ralph.goers@dslextreme.com Ralph Goers added a comment -

        I would prefer an implementation where no configuration is required. If the Disruptor is available then it should be used instead of the ArrayBlockingQueue. You can do that by having two BlockingQueueFactory implementations and just use the interface in the AsyncAppender. Then choose the Disruptor implementation of the factory if a particular class in it is available.

        Show
        ralph.goers@dslextreme.com Ralph Goers added a comment - I would prefer an implementation where no configuration is required. If the Disruptor is available then it should be used instead of the ArrayBlockingQueue. You can do that by having two BlockingQueueFactory implementations and just use the interface in the AsyncAppender. Then choose the Disruptor implementation of the factory if a particular class in it is available.
        Hide
        jvz Matt Sicker added a comment -

        Sounds reasonable, I'll try it out that way.

        Show
        jvz Matt Sicker added a comment - Sounds reasonable, I'll try it out that way.
        Hide
        jvz Matt Sicker added a comment -

        One note about automatic use of Conversant: this makes all our tests using AsyncAppender use DisruptorBlockingQueue instead of ArrayBlockingQueue. I'm not sure if it's worth coding in a sort of unit test feature toggle here.

        Show
        jvz Matt Sicker added a comment - One note about automatic use of Conversant: this makes all our tests using AsyncAppender use DisruptorBlockingQueue instead of ArrayBlockingQueue. I'm not sure if it's worth coding in a sort of unit test feature toggle here.
        Hide
        jvz Matt Sicker added a comment -

        Alright, I have code available in the branch called feature/LOG4J2-1430. Please have a look. I'm aiming for 2.7 with this for semantic versioning reasons.

        Show
        jvz Matt Sicker added a comment - Alright, I have code available in the branch called feature/ LOG4J2-1430 . Please have a look. I'm aiming for 2.7 with this for semantic versioning reasons.
        Show
        garydgregory Gary Gregory added a comment - WRT https://issues.apache.org/jira/browse/LOG4J2-1430?focusedCommentId=15332361&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15332361 That's why the code is too clever. The queue used should be configuration based.
        Hide
        jvz Matt Sicker added a comment - - edited

        I've updated the branch to be configurable instead of magical along with a bonus implementation for LinkedTransferQueue.

        I still need to re-run the performance tests with this additional implementation.

        Show
        jvz Matt Sicker added a comment - - edited I've updated the branch to be configurable instead of magical along with a bonus implementation for LinkedTransferQueue. I still need to re-run the performance tests with this additional implementation.
        Hide
        ralph.goers@dslextreme.com Ralph Goers added a comment -

        Matt, despite while I can agree with Gary that making this configurable makes sense, it should not be configured via a system property. Use a Plugin or something to specify the implementation. If loading that fails then log a warning and fall back to the ArrayBlockingQueue.

        We should never use a system property where something can be configured in the configuration file.

        Show
        ralph.goers@dslextreme.com Ralph Goers added a comment - Matt, despite while I can agree with Gary that making this configurable makes sense, it should not be configured via a system property. Use a Plugin or something to specify the implementation. If loading that fails then log a warning and fall back to the ArrayBlockingQueue. We should never use a system property where something can be configured in the configuration file.
        Hide
        jvz Matt Sicker added a comment -

        Sounds good. I'll convert this into a plugin.

        Show
        jvz Matt Sicker added a comment - Sounds good. I'll convert this into a plugin.
        Hide
        jvz Matt Sicker added a comment - - edited

        Alright, I've converted BlockingQueueFactory into a plugin element. Please let me know what you think. Personally, the config looks slightly awkward now, but that can be improved.

        Edit: I'll update the documentation again when we've agreed on how to configure this.

        Show
        jvz Matt Sicker added a comment - - edited Alright, I've converted BlockingQueueFactory into a plugin element. Please let me know what you think. Personally, the config looks slightly awkward now, but that can be improved. Edit: I'll update the documentation again when we've agreed on how to configure this.
        Hide
        jvz Matt Sicker added a comment -

        Updates on configurability:

        • Converted BlockingQueueFactory into a generic plugin used as a plugin element inside an AsyncAppender. For example:
          <Async name="AsyncList" ...>
            <AppenderRef ref="List" />
            <DisruptorBlockingQueue spinPolicy="SPINNING"/>
          </Async>
          
        • Added attributes for spin policy (disruptor) and fairness (array blocking queue)
        • Updated the AsyncAppenderTest to use multiple config files corresponding to each queue factory plugin
        Show
        jvz Matt Sicker added a comment - Updates on configurability: Converted BlockingQueueFactory into a generic plugin used as a plugin element inside an AsyncAppender. For example: <Async name= "AsyncList" ...> <AppenderRef ref= "List" /> <DisruptorBlockingQueue spinPolicy= "SPINNING" /> </Async> Added attributes for spin policy (disruptor) and fairness (array blocking queue) Updated the AsyncAppenderTest to use multiple config files corresponding to each queue factory plugin
        Hide
        jvz Matt Sicker added a comment - - edited

        Some updated AsyncAppender benchmarks from a different machine:

        Benchmark                                                                                (configFileName)   Mode  Samples        Score        Error  Units
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params          perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   421350.194 ± 118742.034  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params    perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20   732013.709 ± 184898.251  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params     perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   620838.595 ± 134244.041  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params          perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   543288.791 ±  82482.810  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params    perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20   721883.526 ±  66989.585  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params     perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   639439.565 ± 125657.650  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param            perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   824790.565 ± 171621.728  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param      perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  1028556.763 ± 109353.125  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param       perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   568566.928 ± 269872.375  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   594563.141 ±  70404.296  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20   845897.352 ± 228154.808  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   848475.002 ±  91257.402  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   544594.819 ±  62686.338  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20   987199.830 ±  33025.388  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   859914.106 ± 121160.546  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   461825.397 ±  62365.730  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20   864736.023 ± 160036.083  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   932562.486 ± 203536.539  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   521609.312 ±  69653.307  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20   779901.058 ± 293679.862  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   946788.498 ± 219523.096  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   553107.164 ±  68208.224  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20   645058.084 ±  50491.092  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   840133.798 ± 110307.285  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   536812.806 ± 117628.586  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20   903849.192 ± 157589.199  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  1042231.348 ± 158482.482  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   462427.066 ± 127036.911  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  1086714.745 ±  61264.216  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   901716.345 ±  51733.243  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   373571.150 ± 141871.367  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20   996033.507 ± 192236.390  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   852531.537 ±  21937.418  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple            perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   706499.111 ± 122564.787  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  1250716.018 ± 118582.796  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple       perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20   410649.438 ± 238538.562  ops/s
        
        Show
        jvz Matt Sicker added a comment - - edited Some updated AsyncAppender benchmarks from a different machine: Benchmark (configFileName) Mode Samples Score Error Units o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 421350.194 ± 118742.034 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 732013.709 ± 184898.251 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 620838.595 ± 134244.041 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 543288.791 ± 82482.810 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 721883.526 ± 66989.585 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 639439.565 ± 125657.650 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 824790.565 ± 171621.728 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 1028556.763 ± 109353.125 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 568566.928 ± 269872.375 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 594563.141 ± 70404.296 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 845897.352 ± 228154.808 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 848475.002 ± 91257.402 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 544594.819 ± 62686.338 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 987199.830 ± 33025.388 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 859914.106 ± 121160.546 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 461825.397 ± 62365.730 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 864736.023 ± 160036.083 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 932562.486 ± 203536.539 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 521609.312 ± 69653.307 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 779901.058 ± 293679.862 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 946788.498 ± 219523.096 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 553107.164 ± 68208.224 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 645058.084 ± 50491.092 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 840133.798 ± 110307.285 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 536812.806 ± 117628.586 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 903849.192 ± 157589.199 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 1042231.348 ± 158482.482 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 462427.066 ± 127036.911 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 1086714.745 ± 61264.216 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 901716.345 ± 51733.243 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 373571.150 ± 141871.367 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 996033.507 ± 192236.390 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 852531.537 ± 21937.418 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 706499.111 ± 122564.787 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 1250716.018 ± 118582.796 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 410649.438 ± 238538.562 ops/s
        Hide
        ralph.goers@dslextreme.com Ralph Goers added a comment -

        Honestly, there are so many numbers here that I can't really make sense of it.

        Show
        ralph.goers@dslextreme.com Ralph Goers added a comment - Honestly, there are so many numbers here that I can't really make sense of it.
        Hide
        remkop@yahoo.com Remko Popma added a comment -

        What is the thread count for this test?
        Also, I haven't had a chance to look at the code yet, but for the TransferQueue, are you using the tryTransfer() method? This should make quite a difference with the normal offfer() method.

        Show
        remkop@yahoo.com Remko Popma added a comment - What is the thread count for this test? Also, I haven't had a chance to look at the code yet, but for the TransferQueue , are you using the tryTransfer() method? This should make quite a difference with the normal offfer() method.
        Hide
        Anthony Maire Anthony Maire added a comment -

        There was a thread on the mechanical sympathy group on this queue : https://groups.google.com/forum/#!searchin/mechanical-sympathy/conversant/mechanical-sympathy/c5x0c2Zsfpc/BmcnRihGFAAJ
        I had a very quick look on the current code, and the same issues seems to be still present:

        • contention between producers, will not scale if there are more producers than number of cores. That's basically the difference that there was between disruptor 3.x and earlier version, more details in the second half of this conference from Michael Barker (LMAX) : https://www.youtube.com/watch?v=VBnLW9mKMh4
        • Potentially broken on 32 bits system (write to non-volatile long are not guaranteed to be atomic)
        • Not sure that the padding mechanism used in their code is working, the JVM can re-order fields. The most secure way to pad is to use inheritance

        More over, due to the asynchronous nature of this kind of queue, i'm not sure that we will always detect when the queue is empty, which is critical since the isEmpty() method is called to detect if flush are required with immediateFlush = false. Further investigation are needed on this subject

        Show
        Anthony Maire Anthony Maire added a comment - There was a thread on the mechanical sympathy group on this queue : https://groups.google.com/forum/#!searchin/mechanical-sympathy/conversant/mechanical-sympathy/c5x0c2Zsfpc/BmcnRihGFAAJ I had a very quick look on the current code, and the same issues seems to be still present: contention between producers, will not scale if there are more producers than number of cores. That's basically the difference that there was between disruptor 3.x and earlier version, more details in the second half of this conference from Michael Barker (LMAX) : https://www.youtube.com/watch?v=VBnLW9mKMh4 Potentially broken on 32 bits system (write to non-volatile long are not guaranteed to be atomic) Not sure that the padding mechanism used in their code is working, the JVM can re-order fields. The most secure way to pad is to use inheritance More over, due to the asynchronous nature of this kind of queue, i'm not sure that we will always detect when the queue is empty, which is critical since the isEmpty() method is called to detect if flush are required with immediateFlush = false. Further investigation are needed on this subject
        Hide
        jvz Matt Sicker added a comment -

        4 threads on a Core 2 Duo. In this benchmark, I just did a drop-in of LinkedTransferQueue as a BlockingQueue. I haven't tried out the transfer() family of methods yet, but when I looked at the code for the class, there's a generic xfer() method used for both BlockingQueue and TransferQueue methods.

        Show
        jvz Matt Sicker added a comment - 4 threads on a Core 2 Duo. In this benchmark, I just did a drop-in of LinkedTransferQueue as a BlockingQueue. I haven't tried out the transfer() family of methods yet, but when I looked at the code for the class, there's a generic xfer() method used for both BlockingQueue and TransferQueue methods.
        Hide
        jvz Matt Sicker added a comment -

        This is the same test as in the OP, but now with all three BlockingQueues combined.

        Show
        jvz Matt Sicker added a comment - This is the same test as in the OP, but now with all three BlockingQueues combined.
        Hide
        jvz Matt Sicker added a comment -

        There is a disadvantage to using TransferQueue over BlockingQueue: only LinkedTransferQueue implements both. DisruptorBlockingQueue and ArrayBlockingQueue do not, but based on even preliminary testing, LinkedTransferQueue is a lot faster than ArrayBlockingQueue anyways.

        Show
        jvz Matt Sicker added a comment - There is a disadvantage to using TransferQueue over BlockingQueue : only LinkedTransferQueue implements both. DisruptorBlockingQueue and ArrayBlockingQueue do not, but based on even preliminary testing, LinkedTransferQueue is a lot faster than ArrayBlockingQueue anyways.
        Hide
        jvz Matt Sicker added a comment - - edited

        The 64-bit requirements of the code is a known issue. The author of the library has basically said that it's optimized for 64-bit Intel CPUs, so there's that as well.

        Edit: also, that thread you linked seemed to go off-topic rather quickly, but thanks for the info.

        Edit2: it seems like one of the big criticisms was the random amount of latency, but I wasn't under the impression that async logging was about low-latency (at least on the consumer side). How big a problem this would be in practice remains to be seen. I'll make sure to include any caveats in the docs, however.

        Show
        jvz Matt Sicker added a comment - - edited The 64-bit requirements of the code is a known issue. The author of the library has basically said that it's optimized for 64-bit Intel CPUs, so there's that as well. Edit: also, that thread you linked seemed to go off-topic rather quickly, but thanks for the info. Edit2: it seems like one of the big criticisms was the random amount of latency, but I wasn't under the impression that async logging was about low-latency (at least on the consumer side). How big a problem this would be in practice remains to be seen. I'll make sure to include any caveats in the docs, however.
        Hide
        jvz Matt Sicker added a comment - - edited

        Well now, I've gone and done an instanceof check for TransferQueue (which should get optimized away in theory; the instance it checks is final), and here's just the benchmark (4 threads, 2.4 GHz Intel Core i5, 2 cores) for the simple throughput version of AsyncAppender:

        Benchmark                                                                              (configFileName)   Mode  Samples        Score        Error  Units
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple          perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1476472.263 ±  91557.783  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple    perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  3863319.320 ± 514855.706  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple     perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  3949733.956 ±  28776.779  ops/s
        

        Here's the same benchmark run without the instanceof, but with all queues using offer() instead of tryTransfer():

        Benchmark                                                                              (configFileName)   Mode  Samples        Score        Error  Units
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple          perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1354706.385 ± 127150.211  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple    perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  4025877.702 ± 377436.276  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple     perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  3908696.192 ±  71955.810  ops/s
        

        Note: this benchmark was run on yet another different machine than the other two benchmarks posted so far.

        Show
        jvz Matt Sicker added a comment - - edited Well now, I've gone and done an instanceof check for TransferQueue (which should get optimized away in theory; the instance it checks is final ), and here's just the benchmark (4 threads, 2.4 GHz Intel Core i5, 2 cores) for the simple throughput version of AsyncAppender: Benchmark (configFileName) Mode Samples Score Error Units o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1476472.263 ± 91557.783 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 3863319.320 ± 514855.706 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 3949733.956 ± 28776.779 ops/s Here's the same benchmark run without the instanceof , but with all queues using offer() instead of tryTransfer() : Benchmark (configFileName) Mode Samples Score Error Units o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1354706.385 ± 127150.211 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 4025877.702 ± 377436.276 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 3908696.192 ± 71955.810 ops/s Note: this benchmark was run on yet another different machine than the other two benchmarks posted so far.
        Hide
        jvz Matt Sicker added a comment - - edited

        Attached full results of the current AsyncAppender code. Jira isn't wide enough for this, so the text file is a bit easier to read. Benchmarks executed with 4 threads on the same machine as the above post.

        Show
        jvz Matt Sicker added a comment - - edited Attached full results of the current AsyncAppender code. Jira isn't wide enough for this, so the text file is a bit easier to read. Benchmarks executed with 4 threads on the same machine as the above post.
        Hide
        jvz Matt Sicker added a comment -

        Upon benchmarking, it appears the implementation of tryTransfer(E) is quite a bit faster than the implementation of offer(E). I've updated AsyncAppender as such.

        Show
        jvz Matt Sicker added a comment - Upon benchmarking, it appears the implementation of tryTransfer(E) is quite a bit faster than the implementation of offer(E) . I've updated AsyncAppender as such.
        Hide
        Anthony Maire Anthony Maire added a comment -

        I made some tests with the branch to compare Conversant DisruptorBlockingQueue and the MpscArrayQueue from JCTools ( https://github.com/JCTools/JCTools ). Since the JCTool queue isn't a BlockingQueue, I extended it to add the BlockingQueue methods, using LockSupport.parkNanos(1L) when waiting was needed

        I first ran the JMH benchmark on my machine (Windows 7 on core i5-2500K, 4 cores, no HT) with 8 and 16 threads and there were no noticeable difference between both alternative implementations

        Then since the LockSupport.parkNanos(1L) behavior is really dependant on the OS (cf http://vanillajava.blogspot.fr/2012/04/yield-sleep0-wait01-and-parknanos1.html ), I made some test on a Linux server : CentOS 6.7 with dual Xeon E5-2643 v3 (24 hyperthreads for the whole machine), the parkNanos pause was around 50µs instead of 1ms on windows

        There was no noticeable difference up to 8 threads, when the number of thread increase, the DisruptorBlockingQueue was performing bad, and has no clear advantage on ArrayBlockingQueue for 36 threads. the MPSC queue has a constant throughput

        Here are the rough figure I had in the following order ABQ, Conversant Disruptor and JCTools MPSC :

        1 thread  : 925k 3M 3.15M
        8 threads : 600k 2.95M 3.15M
        12 threads : 550k 2.2M 3.10M
        16 threads : 500K 1.9M 3.6M
        24 threads : 600K 650K 3.45M
        36 threads : 530K 600K 3.5M
        

        I assume than the high pause time from parkNanos on windows (used by conversant wait strategy too) was artificially reducing contention between producers and allow it to perform pretty well.

        When I compare latency profiles of both implementation (with JMH "SampleTime" mode) for 8 threads, JCTool perform better at higher percentile (the 99.9999% cannot be trust because the number of samples is way too low)

        Conversant Disruptor results :
        
        Result: 2761.833 ±(99.9%) 26.908 ns/op [Average]
          Statistics: (min, avg, max) = (344.000, 2761.833, 16039936.000), stdev = 11613.072
          Confidence interval (99.9%): [2734.925, 2788.741]
          Samples, N = 2016821
                mean =   2761.833 ±(99.9%) 26.908 ns/op
                 min =    344.000 ns/op
          p( 0.0000) =    344.000 ns/op
          p(50.0000) =   2148.000 ns/op
          p(90.0000) =   5024.000 ns/op
          p(95.0000) =   6728.000 ns/op
          p(99.0000) =  11440.000 ns/op
          p(99.9000) =  18912.000 ns/op
          p(99.9900) =  27328.000 ns/op
          p(99.9990) =  35509.234 ns/op
          p(99.9999) = 2346156.268 ns/op
                 max = 16039936.000 ns/op
        
        
        
        JCTools MPSC results
        
        Result: 3115.571 ±(99.9%) 23.081 ns/op [Average]
          Statistics: (min, avg, max) = (325.000, 3115.571, 6905856.000), stdev = 9438.165
          Confidence interval (99.9%): [3092.490, 3138.651]
          Samples, N = 1810522
                mean =   3115.571 ±(99.9%) 23.081 ns/op
                 min =    325.000 ns/op
          p( 0.0000) =    325.000 ns/op
          p(50.0000) =   2984.000 ns/op
          p(90.0000) =   4352.000 ns/op
          p(95.0000) =   4816.000 ns/op
          p(99.0000) =   5736.000 ns/op
          p(99.9000) =   6872.000 ns/op
          p(99.9900) =   7976.000 ns/op
          p(99.9990) =  10652.653 ns/op
          p(99.9999) = 6872656.978 ns/op
                 max = 6905856.000 ns/op
        

        Interesting fact: with 16 to 36 threads, top reported a cpu usage for the JCtool queue that doesn't exceed 1200% (when Conversant queue was using the full CPU), so half of the cpu cycles of the machine where not used. I think there is something that can be improved, probably around my BlockingQueue quick and dirty implementation.

        Anyway, I'm really not sure that using Conversant DisruptorBlockingQueue is a good alternative to ABQ.

        Show
        Anthony Maire Anthony Maire added a comment - I made some tests with the branch to compare Conversant DisruptorBlockingQueue and the MpscArrayQueue from JCTools ( https://github.com/JCTools/JCTools ). Since the JCTool queue isn't a BlockingQueue, I extended it to add the BlockingQueue methods, using LockSupport.parkNanos(1L) when waiting was needed I first ran the JMH benchmark on my machine (Windows 7 on core i5-2500K, 4 cores, no HT) with 8 and 16 threads and there were no noticeable difference between both alternative implementations Then since the LockSupport.parkNanos(1L) behavior is really dependant on the OS (cf http://vanillajava.blogspot.fr/2012/04/yield-sleep0-wait01-and-parknanos1.html ), I made some test on a Linux server : CentOS 6.7 with dual Xeon E5-2643 v3 (24 hyperthreads for the whole machine), the parkNanos pause was around 50µs instead of 1ms on windows There was no noticeable difference up to 8 threads, when the number of thread increase, the DisruptorBlockingQueue was performing bad, and has no clear advantage on ArrayBlockingQueue for 36 threads. the MPSC queue has a constant throughput Here are the rough figure I had in the following order ABQ, Conversant Disruptor and JCTools MPSC : 1 thread : 925k 3M 3.15M 8 threads : 600k 2.95M 3.15M 12 threads : 550k 2.2M 3.10M 16 threads : 500K 1.9M 3.6M 24 threads : 600K 650K 3.45M 36 threads : 530K 600K 3.5M I assume than the high pause time from parkNanos on windows (used by conversant wait strategy too) was artificially reducing contention between producers and allow it to perform pretty well. When I compare latency profiles of both implementation (with JMH "SampleTime" mode) for 8 threads, JCTool perform better at higher percentile (the 99.9999% cannot be trust because the number of samples is way too low) Conversant Disruptor results : Result: 2761.833 ±(99.9%) 26.908 ns/op [Average] Statistics: (min, avg, max) = (344.000, 2761.833, 16039936.000), stdev = 11613.072 Confidence interval (99.9%): [2734.925, 2788.741] Samples, N = 2016821 mean = 2761.833 ±(99.9%) 26.908 ns/op min = 344.000 ns/op p( 0.0000) = 344.000 ns/op p(50.0000) = 2148.000 ns/op p(90.0000) = 5024.000 ns/op p(95.0000) = 6728.000 ns/op p(99.0000) = 11440.000 ns/op p(99.9000) = 18912.000 ns/op p(99.9900) = 27328.000 ns/op p(99.9990) = 35509.234 ns/op p(99.9999) = 2346156.268 ns/op max = 16039936.000 ns/op JCTools MPSC results Result: 3115.571 ±(99.9%) 23.081 ns/op [Average] Statistics: (min, avg, max) = (325.000, 3115.571, 6905856.000), stdev = 9438.165 Confidence interval (99.9%): [3092.490, 3138.651] Samples, N = 1810522 mean = 3115.571 ±(99.9%) 23.081 ns/op min = 325.000 ns/op p( 0.0000) = 325.000 ns/op p(50.0000) = 2984.000 ns/op p(90.0000) = 4352.000 ns/op p(95.0000) = 4816.000 ns/op p(99.0000) = 5736.000 ns/op p(99.9000) = 6872.000 ns/op p(99.9900) = 7976.000 ns/op p(99.9990) = 10652.653 ns/op p(99.9999) = 6872656.978 ns/op max = 6905856.000 ns/op Interesting fact: with 16 to 36 threads, top reported a cpu usage for the JCtool queue that doesn't exceed 1200% (when Conversant queue was using the full CPU), so half of the cpu cycles of the machine where not used. I think there is something that can be improved, probably around my BlockingQueue quick and dirty implementation. Anyway, I'm really not sure that using Conversant DisruptorBlockingQueue is a good alternative to ABQ.
        Hide
        remkop@yahoo.com Remko Popma added a comment - - edited

        I edited your comments (sorry ) to use {{noformat}} instead of {{code}}. It means horizontal scrolling but shows the numbers aligned better.

        Also just saw the attached full performance results after the tryTransfer() code changes. TransferQueue results look very good! This is JDK 7 no extra dependency required, which is very nice.

        Show
        remkop@yahoo.com Remko Popma added a comment - - edited I edited your comments (sorry ) to use {{ noformat }} instead of {{ code }}. It means horizontal scrolling but shows the numbers aligned better. Also just saw the attached full performance results after the tryTransfer() code changes. TransferQueue results look very good! This is JDK 7 no extra dependency required, which is very nice.
        Hide
        remkop@yahoo.com Remko Popma added a comment - - edited

        Anthony, thanks a lot for the analysis. I think it is great that we are taking a look at improving AsyncAppender and the more ideas and options the better. Can you attach your JMH tests and test results?

        Quite impressive that MPSC queue has constant throughput with increasing threads. Does it have the same caveats as the Conversant Disruptor? I remember some of the code from Nitsan Wakart's blogs about his Queue experiments (which I think eventually resulted in JCTools) and what I see in the Conversant Disruptor code reminds me of Nitsan's code. Admittedly I have not looked in much detail.

        Show
        remkop@yahoo.com Remko Popma added a comment - - edited Anthony, thanks a lot for the analysis. I think it is great that we are taking a look at improving AsyncAppender and the more ideas and options the better. Can you attach your JMH tests and test results? Quite impressive that MPSC queue has constant throughput with increasing threads. Does it have the same caveats as the Conversant Disruptor? I remember some of the code from Nitsan Wakart's blogs about his Queue experiments (which I think eventually resulted in JCTools ) and what I see in the Conversant Disruptor code reminds me of Nitsan's code. Admittedly I have not looked in much detail.
        Hide
        Anthony Maire Anthony Maire added a comment -

        I will attach my patch right after posting this comment. This is clearly a quick & dirty patch to show that this "DisruptorBlockingQueue" is not so performant compared to something that is way closer to current "state of the art". The way that the JCTool queue is converted to a BlockingQueue does matter (as the choice of the wait strategy does matter for the "real" disruptor). I removed the LinkedTranferQueue when testing since I had some OutOfMemoryError with it and hadn't the time to investigate why

        the JCTool queue is free of the caveats than I spotted with the Conversant Disruptor. The main issue when dealing with multi-producer scenario is that you have 2 orders to consider :

        • the order of slots claiming for producers
        • the order of data availability.

        Both queues solve this issue differently :

        • Conversant disruptor has an AtomicLong for each order, and CAS one of them with the value of the other => it forces threads to publish data in the same orders slot where clamed, which doesn't scale at all with lots of producer. It is not a lock-free algorithm since a frozen thread can prevent the whole system to make progress
        • JCTool queue has a single "Atomic" variable (although it's a volatile long, Unsafe is used to add the semantics of an AtomicLong). Elements are stored using an ordered store which is a pretty cheap operation, and the consumer will see published elements as soon as they are available. Both orders are independant, so the only contention between producer is the CAS loop => if a producer is temporarily frozen, others producers can make progress (until the queue is full of course) => It scales way better

        About others caveat :

        • JCTool is safe on 32 bits architectures
        • Padding is inheritance-based so I'm sure it does work on current JVMs
        • About the isEmpy potential issue, I have to think a little more about it to be sure, but I think it's OK for both Conversant and JCTools queue, since isEmpty() will always be called by the singe consumer threads so there shouldn't be any issue
        Show
        Anthony Maire Anthony Maire added a comment - I will attach my patch right after posting this comment. This is clearly a quick & dirty patch to show that this "DisruptorBlockingQueue" is not so performant compared to something that is way closer to current "state of the art". The way that the JCTool queue is converted to a BlockingQueue does matter (as the choice of the wait strategy does matter for the "real" disruptor). I removed the LinkedTranferQueue when testing since I had some OutOfMemoryError with it and hadn't the time to investigate why the JCTool queue is free of the caveats than I spotted with the Conversant Disruptor. The main issue when dealing with multi-producer scenario is that you have 2 orders to consider : the order of slots claiming for producers the order of data availability. Both queues solve this issue differently : Conversant disruptor has an AtomicLong for each order, and CAS one of them with the value of the other => it forces threads to publish data in the same orders slot where clamed, which doesn't scale at all with lots of producer. It is not a lock-free algorithm since a frozen thread can prevent the whole system to make progress JCTool queue has a single "Atomic" variable (although it's a volatile long, Unsafe is used to add the semantics of an AtomicLong). Elements are stored using an ordered store which is a pretty cheap operation, and the consumer will see published elements as soon as they are available. Both orders are independant, so the only contention between producer is the CAS loop => if a producer is temporarily frozen, others producers can make progress (until the queue is full of course) => It scales way better About others caveat : JCTool is safe on 32 bits architectures Padding is inheritance-based so I'm sure it does work on current JVMs About the isEmpy potential issue, I have to think a little more about it to be sure, but I think it's OK for both Conversant and JCTools queue, since isEmpty() will always be called by the singe consumer threads so there shouldn't be any issue
        Hide
        jvz Matt Sicker added a comment - - edited

        Thank you for the patch! I'm working on adapting it to the branch for this ticket. There were a couple other tests that needed to be updated to add support for the JCTools version.

        Edit: I've adapted your patch with minor changes and new tests added. Please have a look in the feature/LOG4J2-1430 branch.

        Edit2: I've also made the issue LOG4J2-1439 to support the JCTools addition to give it better visibility in the changelog (and give proper credit to Anthony).

        Show
        jvz Matt Sicker added a comment - - edited Thank you for the patch! I'm working on adapting it to the branch for this ticket. There were a couple other tests that needed to be updated to add support for the JCTools version. Edit: I've adapted your patch with minor changes and new tests added. Please have a look in the feature/ LOG4J2-1430 branch. Edit2: I've also made the issue LOG4J2-1439 to support the JCTools addition to give it better visibility in the changelog (and give proper credit to Anthony).
        Hide
        jvz Matt Sicker added a comment - - edited

        JCTools does provide the same problem that LMAX Disruptor has: the use of sun.misc.Unsafe. On OpenJDK, this is not an issue, but the use of internal APIs like that are inherently non-portable. The fastest portable implementation of AsyncAppender right now appears to be LinkedTransferQueue, but that's unbounded. Edit: LinkedTransferQueue also uses Unsafe, but that's the OpenJDK implementation. Being part of the JDK means that the class itself must be available on all JDKs.

        Show
        jvz Matt Sicker added a comment - - edited JCTools does provide the same problem that LMAX Disruptor has: the use of sun.misc.Unsafe . On OpenJDK, this is not an issue, but the use of internal APIs like that are inherently non-portable. The fastest portable implementation of AsyncAppender right now appears to be LinkedTransferQueue, but that's unbounded. Edit: LinkedTransferQueue also uses Unsafe, but that's the OpenJDK implementation. Being part of the JDK means that the class itself must be available on all JDKs.
        Hide
        Anthony Maire Anthony Maire added a comment -

        In theory you are perfectly right, but

        • Most of the JVM currently in the market does support Unsafe, especially all the JVM that are usually used in the high-performance community
        • The Unsafe features that are required should become standard in JDK 9 (http://openjdk.java.net/jeps/193), so there will be easy to make this queue implementation standard in the near future.
        • If you would like the JCTool queue to become the standard for AsyncAppender but this dependency to Unsafe is a blocking issue, I can write a portable version without Unsafe so that you can test it and make a comparison. It should be a little less fast than the original version but it should keep the same kind of characteristics under heavy contention. Just let me know if you're interested in this
        Show
        Anthony Maire Anthony Maire added a comment - In theory you are perfectly right, but Most of the JVM currently in the market does support Unsafe, especially all the JVM that are usually used in the high-performance community The Unsafe features that are required should become standard in JDK 9 ( http://openjdk.java.net/jeps/193 ), so there will be easy to make this queue implementation standard in the near future. If you would like the JCTool queue to become the standard for AsyncAppender but this dependency to Unsafe is a blocking issue, I can write a portable version without Unsafe so that you can test it and make a comparison. It should be a little less fast than the original version but it should keep the same kind of characteristics under heavy contention. Just let me know if you're interested in this
        Hide
        jvz Matt Sicker added a comment -

        OSGi provides some fun issues regarding the use of internal classes (e.g., you have to explicitly export the sun.misc package which isn't done by default).

        I do certainly want to add the JCTools version to AsyncAppender, but just like with the Conversant version, neither can be the default as they require third party libraries. Plus, LinkedTransferQueue doesn't support a maximum capacity, so it doesn't make much sense to change to that by default, either.

        The thing I'd be most interested in is seeing how well AsyncAppender+JCTools compares to AsyncLogger/LMAX. These could be two great choices depending on your use of logging.

        Show
        jvz Matt Sicker added a comment - OSGi provides some fun issues regarding the use of internal classes (e.g., you have to explicitly export the sun.misc package which isn't done by default). I do certainly want to add the JCTools version to AsyncAppender, but just like with the Conversant version, neither can be the default as they require third party libraries. Plus, LinkedTransferQueue doesn't support a maximum capacity, so it doesn't make much sense to change to that by default, either. The thing I'd be most interested in is seeing how well AsyncAppender+JCTools compares to AsyncLogger/LMAX. These could be two great choices depending on your use of logging.
        Hide
        john@2ad.com John Cairns added a comment -

        Matt Sicker There is a branch of Conversant Disruptor that includes the MPMC strategy of Vyukov if you would like to test with it: https://github.com/conversant/disruptor/tree/feature/diffeq

        Its called "MPMCBlockingQueue."

        In my testing this implementation is slower than DisruptorBlockingQueue. I'm not sure why others are arguing to support overthreading of producers. Is that a common usecase or just a whiteboard problem? I'd recommend having the fastest configuration for the correct number of threads. That will be Conversant Disruptor.

        BTW, if you set SpinPolicy=SPINNING for your benchmarks it will be more of an Apples to Apples comparison to others. Although I don't recommend this as the default setting.

        Btw, in terms of compatibility with non Intel x64 platforms, you can use Conversant Disruptor 1.2.6 that should be safe in architectures that have a weaker memory model than Intel.

        Show
        john@2ad.com John Cairns added a comment - Matt Sicker There is a branch of Conversant Disruptor that includes the MPMC strategy of Vyukov if you would like to test with it: https://github.com/conversant/disruptor/tree/feature/diffeq Its called "MPMCBlockingQueue." In my testing this implementation is slower than DisruptorBlockingQueue. I'm not sure why others are arguing to support overthreading of producers. Is that a common usecase or just a whiteboard problem? I'd recommend having the fastest configuration for the correct number of threads. That will be Conversant Disruptor. BTW, if you set SpinPolicy=SPINNING for your benchmarks it will be more of an Apples to Apples comparison to others. Although I don't recommend this as the default setting. Btw, in terms of compatibility with non Intel x64 platforms, you can use Conversant Disruptor 1.2.6 that should be safe in architectures that have a weaker memory model than Intel.
        Hide
        john@2ad.com John Cairns added a comment -

        For this test you should set SpinPolicy.SPINNING. It doesn't make sense to use a waiting strategy and then measure the overhead of the waiting does it?

        Show
        john@2ad.com John Cairns added a comment - For this test you should set SpinPolicy.SPINNING. It doesn't make sense to use a waiting strategy and then measure the overhead of the waiting does it?
        Hide
        jvz Matt Sicker added a comment -

        I've updated the two AsyncAppender (no location and with location) benchmarks to use spinPolicy="SPINNING". I'll re-run the benchmarks on my work computer (from the OP) for new comparisons.

        Show
        jvz Matt Sicker added a comment - I've updated the two AsyncAppender (no location and with location) benchmarks to use spinPolicy="SPINNING". I'll re-run the benchmarks on my work computer (from the OP) for new comparisons.
        Hide
        jvz Matt Sicker added a comment -

        As for overthreading producers, my main app at work has over a thousand threads but only 12 cores to run it on. It's not a low latency app, however.

        Show
        jvz Matt Sicker added a comment - As for overthreading producers, my main app at work has over a thousand threads but only 12 cores to run it on. It's not a low latency app, however.
        Hide
        john@2ad.com John Cairns added a comment -

        Most of the blocking queue strategies discussed here will not work well in that case. Your going to pay a fairly high price for that level of thread contention. LinkedTransferQueue transfers might be the best choice for this case. Overall you would be best served to reduce your thread contention and use Conversant Disruptor.

        Show
        john@2ad.com John Cairns added a comment - Most of the blocking queue strategies discussed here will not work well in that case. Your going to pay a fairly high price for that level of thread contention. LinkedTransferQueue transfers might be the best choice for this case. Overall you would be best served to reduce your thread contention and use Conversant Disruptor.
        Hide
        remkop@yahoo.com Remko Popma added a comment - - edited

        Bear in mind that Log4j is just the logging library used in some application. It doesn't control the number of threads that perform logging. Ideally Log4j would perform well with a wide range of threads, although it is acceptable to say in documentation something like "for best performance, use configuration X when your application has up to x threads, use configuration Y when it has more threads." But I'm not willing to say "reduce your thread count" in the Log4j documentation.

        Show
        remkop@yahoo.com Remko Popma added a comment - - edited Bear in mind that Log4j is just the logging library used in some application. It doesn't control the number of threads that perform logging. Ideally Log4j would perform well with a wide range of threads, although it is acceptable to say in documentation something like "for best performance, use configuration X when your application has up to x threads, use configuration Y when it has more threads." But I'm not willing to say "reduce your thread count" in the Log4j documentation.
        Hide
        jvz Matt Sicker added a comment -

        I've rerun the AsyncAppender benchmarks. Specs: 4 threads, 4-core 2.7 GHz Intel Core i7.

        Benchmark                                                                                (configFileName)   Mode  Samples        Score        Error  Units
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params          perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1206845.073 ±  32553.206  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params    perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  3540280.706 ± 142095.468  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params     perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  3721962.221 ± 265353.892  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params     perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  3211162.495 ±  64729.301  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params          perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1153345.166 ±  13144.671  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params    perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  3408585.363 ±  44104.092  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params     perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  3579712.530 ± 257860.511  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params     perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  3172050.033 ±  28729.288  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param            perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1314547.903 ±  13798.123  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param      perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  6367206.729 ±  24910.719  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param       perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  7627619.055 ±  38062.874  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param       perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  5495479.783 ±  43059.188  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1733486.606 ±  61734.478  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  5920363.934 ±  42182.740  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  6678693.818 ±  31373.839  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  4917474.007 ± 200825.228  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1265648.334 ±  27079.677  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  5308803.910 ± 378505.356  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  6539553.090 ± 177574.592  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  4609151.476 ± 108555.837  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1292712.115 ±  44903.210  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  4905343.490 ±  84729.064  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  6067199.645 ±  74495.874  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  4457522.896 ±  25518.207  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1225613.756 ±  14661.952  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  4646320.584 ± 105446.125  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  5121846.498 ± 249702.467  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  4253456.611 ± 128742.328  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1258539.457 ±  28386.367  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  4409586.991 ± 311831.674  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  5484760.055 ± 111687.688  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  3843838.446 ±  57626.978  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1349756.153 ±  36750.141  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  4238115.386 ± 175533.446  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  4849398.916 ±  75327.693  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  3898326.787 ±  55680.200  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1356061.837 ±  24627.270  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  4073197.956 ±  30465.904  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  4633998.411 ±  72966.371  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  3618730.696 ±  52408.085  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params           perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1122352.399 ±  30281.259  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params     perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  3324498.475 ± 207361.244  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params      perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  3751204.111 ± 467556.500  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params      perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  3355028.955 ±  73455.171  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple            perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20  1133597.159 ±  16817.020  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple      perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  6367044.731 ± 435715.161  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple       perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  6761526.363 ±  87207.625  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple       perf5AsyncApndXferQNoLoc-noOpAppender.xml  thrpt       20  5282794.529 ± 259637.289  ops/s
        
        Show
        jvz Matt Sicker added a comment - I've rerun the AsyncAppender benchmarks. Specs: 4 threads, 4-core 2.7 GHz Intel Core i7. Benchmark (configFileName) Mode Samples Score Error Units o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1206845.073 ± 32553.206 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 3540280.706 ± 142095.468 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 3721962.221 ± 265353.892 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput10Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 3211162.495 ± 64729.301 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1153345.166 ± 13144.671 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 3408585.363 ± 44104.092 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 3579712.530 ± 257860.511 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput11Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 3172050.033 ± 28729.288 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1314547.903 ± 13798.123 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 6367206.729 ± 24910.719 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 7627619.055 ± 38062.874 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput1Param perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 5495479.783 ± 43059.188 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1733486.606 ± 61734.478 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 5920363.934 ± 42182.740 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 6678693.818 ± 31373.839 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput2Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 4917474.007 ± 200825.228 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1265648.334 ± 27079.677 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 5308803.910 ± 378505.356 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 6539553.090 ± 177574.592 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput3Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 4609151.476 ± 108555.837 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1292712.115 ± 44903.210 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 4905343.490 ± 84729.064 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 6067199.645 ± 74495.874 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput4Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 4457522.896 ± 25518.207 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1225613.756 ± 14661.952 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 4646320.584 ± 105446.125 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 5121846.498 ± 249702.467 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput5Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 4253456.611 ± 128742.328 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1258539.457 ± 28386.367 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 4409586.991 ± 311831.674 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 5484760.055 ± 111687.688 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput6Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 3843838.446 ± 57626.978 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1349756.153 ± 36750.141 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 4238115.386 ± 175533.446 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 4849398.916 ± 75327.693 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput7Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 3898326.787 ± 55680.200 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1356061.837 ± 24627.270 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 4073197.956 ± 30465.904 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 4633998.411 ± 72966.371 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput8Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 3618730.696 ± 52408.085 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1122352.399 ± 30281.259 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 3324498.475 ± 207361.244 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 3751204.111 ± 467556.500 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughput9Params perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 3355028.955 ± 73455.171 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 1133597.159 ± 16817.020 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 6367044.731 ± 435715.161 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 6761526.363 ± 87207.625 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 20 5282794.529 ± 259637.289 ops/s
        Hide
        Anthony Maire Anthony Maire added a comment - - edited

        I made further tests today, and while I still believe that the Conversant Disruptor is not the best alternative because of the coupling between the claiming slot order an the data publishing order, the main issue is not there.

        These throughput benchmarks only mesure one thing: how much is the consumer thread impacted by the wait strategy. And as soon as the strategy does not use condition variables, the more the consumer threads are paused, the better the throughput will be

        As I said in a previous comment, even with 24 or 36 threads, only 12 hyperthreads were busy with the JCTools queue with a LockSupport.parkNanos wait strategy. So most of the time, the queue is full. I tried to play with the implementation of the wait strategy in my JCTools blocking wrapper to use something that is more like the one used in LMAX or Conversant disruptor with some spin/yield before falling back to park.

        The result is not what I expected : the yield loop made the JCTool queue throughput lower !! I think it can be explained by false sharing : when the queue is nearly full, the producers will write to slots that are adjacent to the one that is read by the consumer. Every time an event is inserted in the queue, that cache line is evicted from consumer core L1 cache, causing a cache miss. If producers are paused for a longer time, then the consumer can append events without these cache misses for a longer time, and the throughput is better

        That's why the Conversant Disruptor was performing way better under window than under Linux : the pause time of parkNanos is 20 times longer on windows and it was enough to avoid the spin/yield loop negative impact in the wait strategy.

        The main question with these benchmarks is: what do we want to measure, how can we say that an implementation is more "performant" than another one ? The throughput with a no-op appender is maybe not the best metric: in real applications, there will be real filters + appenders that will not support such a throughput, so we are mesuring something that does not exist in the real world.

        In my opinion, there are 2 usecases where an async appender is useful :

        • when we need to increase the maximal throughput of the underlying appender thanks to batching (cf RandomAccessFile appender with immediateFlush = false). The queue implementation has few impact on this
        • when we need to keep latency low in the application thread for a realistic load (i.e we don't expect the queue to be always nearly full, if this does happen, then another part of the system need improvements)

        As Remko said, log4J is used in a lot of situations that we cannot control. And saying something like "you shouldn't have more active threads than cores in your app" is not a solution, if the server is used for other apps it will not be enough. Being lock-free on the producer side (as soon as the queue is not full) is a mandatory characteristic in my opinion.

        The intrisinc latency of the queue is often not the main characterisic : there are lot's of things done in the main thread before enqueuing, especially with parametrized messages. The call to the logger can take a few microseconds, whereas any reasonable high-performance collection will enqueue in something like 100ns under low contention (and the saturation rate of the underlying appender will often happen way before achieving high contention on the queue).

        Show
        Anthony Maire Anthony Maire added a comment - - edited I made further tests today, and while I still believe that the Conversant Disruptor is not the best alternative because of the coupling between the claiming slot order an the data publishing order, the main issue is not there. These throughput benchmarks only mesure one thing: how much is the consumer thread impacted by the wait strategy. And as soon as the strategy does not use condition variables, the more the consumer threads are paused, the better the throughput will be As I said in a previous comment, even with 24 or 36 threads, only 12 hyperthreads were busy with the JCTools queue with a LockSupport.parkNanos wait strategy. So most of the time, the queue is full. I tried to play with the implementation of the wait strategy in my JCTools blocking wrapper to use something that is more like the one used in LMAX or Conversant disruptor with some spin/yield before falling back to park. The result is not what I expected : the yield loop made the JCTool queue throughput lower !! I think it can be explained by false sharing : when the queue is nearly full, the producers will write to slots that are adjacent to the one that is read by the consumer. Every time an event is inserted in the queue, that cache line is evicted from consumer core L1 cache, causing a cache miss. If producers are paused for a longer time, then the consumer can append events without these cache misses for a longer time, and the throughput is better That's why the Conversant Disruptor was performing way better under window than under Linux : the pause time of parkNanos is 20 times longer on windows and it was enough to avoid the spin/yield loop negative impact in the wait strategy. The main question with these benchmarks is: what do we want to measure, how can we say that an implementation is more "performant" than another one ? The throughput with a no-op appender is maybe not the best metric: in real applications, there will be real filters + appenders that will not support such a throughput, so we are mesuring something that does not exist in the real world. In my opinion, there are 2 usecases where an async appender is useful : when we need to increase the maximal throughput of the underlying appender thanks to batching (cf RandomAccessFile appender with immediateFlush = false). The queue implementation has few impact on this when we need to keep latency low in the application thread for a realistic load (i.e we don't expect the queue to be always nearly full, if this does happen, then another part of the system need improvements) As Remko said, log4J is used in a lot of situations that we cannot control. And saying something like "you shouldn't have more active threads than cores in your app" is not a solution, if the server is used for other apps it will not be enough. Being lock-free on the producer side (as soon as the queue is not full) is a mandatory characteristic in my opinion. The intrisinc latency of the queue is often not the main characterisic : there are lot's of things done in the main thread before enqueuing, especially with parametrized messages. The call to the logger can take a few microseconds, whereas any reasonable high-performance collection will enqueue in something like 100ns under low contention (and the saturation rate of the underlying appender will often happen way before achieving high contention on the queue).
        Hide
        Anthony Maire Anthony Maire added a comment -

        A made another test with the JCTools queue with the "yield then park" wait strategy that cause the throughput to drop earlier. I changed the implementation of the put() method to use offerIfBelowThreshold() instead of offer() (it requires a snapshot version of JCTools, this method is broken in 1.2, the fix has been merged earlier today). By this way, I made sure that at least 32 slots in the underlying buffer are free, so producer threads won't cause a cache miss for the consumer thread anymore.

        Now the throughput is roughly the same as the "pure park" wait strategy.

        That confirms what the feelings i wrote in the previous post : for any good enough (i.e non-blocking) implementation of the queue, the AsyncAppender throughput performance test does only measure how bad the consumer thread is impacted by the wait strategy when the queue is full

        Show
        Anthony Maire Anthony Maire added a comment - A made another test with the JCTools queue with the "yield then park" wait strategy that cause the throughput to drop earlier. I changed the implementation of the put() method to use offerIfBelowThreshold() instead of offer() (it requires a snapshot version of JCTools, this method is broken in 1.2, the fix has been merged earlier today). By this way, I made sure that at least 32 slots in the underlying buffer are free, so producer threads won't cause a cache miss for the consumer thread anymore. Now the throughput is roughly the same as the "pure park" wait strategy. That confirms what the feelings i wrote in the previous post : for any good enough (i.e non-blocking) implementation of the queue, the AsyncAppender throughput performance test does only measure how bad the consumer thread is impacted by the wait strategy when the queue is full
        Hide
        jvz Matt Sicker added a comment -

        So what would you suggest for making another benchmark? I was under the impression these tests used a trivial appender so that the appender's performance itself wasn't be measured here (there's separate benchmarks for those).

        Show
        jvz Matt Sicker added a comment - So what would you suggest for making another benchmark? I was under the impression these tests used a trivial appender so that the appender's performance itself wasn't be measured here (there's separate benchmarks for those).
        Hide
        john@2ad.com John Cairns added a comment -

        There is no way for a queue of this nature to have constant throughput with increasing thread numbers. I haven't looked at his benchmark but it sounds like it's overly dependent on waiting logic. A benchmark of these queues should not have any waiting involved, in that case your just measuring park time.

        In my benchmarks for the MPMC queue, Conversant Disruptor is faster, but both queues lose equally as the number of threads increase, if he isn't finding that then the test is wrong. The number of cores is fixed so throughput can not go up as more threads contend for cores.

        Show
        john@2ad.com John Cairns added a comment - There is no way for a queue of this nature to have constant throughput with increasing thread numbers. I haven't looked at his benchmark but it sounds like it's overly dependent on waiting logic. A benchmark of these queues should not have any waiting involved, in that case your just measuring park time. In my benchmarks for the MPMC queue, Conversant Disruptor is faster, but both queues lose equally as the number of threads increase, if he isn't finding that then the test is wrong. The number of cores is fixed so throughput can not go up as more threads contend for cores.
        Hide
        jvz Matt Sicker added a comment -

        I haven't run the benchmarks with a large number of threads yet. Let me try that out and I'll post the results. So far, my tests have been doing either threads=cores or threads=2*cores.

        Show
        jvz Matt Sicker added a comment - I haven't run the benchmarks with a large number of threads yet. Let me try that out and I'll post the results. So far, my tests have been doing either threads=cores or threads=2*cores.
        Hide
        john@2ad.com John Cairns added a comment -

        All of the Disruptor approaches will have the same Achilles heel when it comes to large numbers of threads. This is because the nature of speculative racing demands that every core that misses on the speculation will spin or park. This does not mean that Conversant Disruptor is not usable in that use case only that this case is not a good one for the Disruptor strategy. If your users are using a high performance queue then they will have to understand the limits of these queues. It's a terrible fine point to claim that a particular strategy works well in a middling number of threads and ignore the fact that none work well in high numbers of threads. Developers of performance sensitive apps should use Conversant Disruptor and the lowest number of threads possible.

        Show
        john@2ad.com John Cairns added a comment - All of the Disruptor approaches will have the same Achilles heel when it comes to large numbers of threads. This is because the nature of speculative racing demands that every core that misses on the speculation will spin or park. This does not mean that Conversant Disruptor is not usable in that use case only that this case is not a good one for the Disruptor strategy. If your users are using a high performance queue then they will have to understand the limits of these queues. It's a terrible fine point to claim that a particular strategy works well in a middling number of threads and ignore the fact that none work well in high numbers of threads. Developers of performance sensitive apps should use Conversant Disruptor and the lowest number of threads possible.
        Hide
        john@2ad.com John Cairns added a comment -

        These park and yield strategies require a lot of tuning and testing to get right. Jamming one out to cook the benchmark doesn't seem like a good idea to me.

        Show
        john@2ad.com John Cairns added a comment - These park and yield strategies require a lot of tuning and testing to get right. Jamming one out to cook the benchmark doesn't seem like a good idea to me.
        Hide
        jvz Matt Sicker added a comment -

        Here is the simple test benchmarked with 32 threads on a 4-core machine (the LinkedTransferQueue test timed out for some reason):

        Benchmark                                                                              (configFileName)   Mode  Samples        Score        Error  Units
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple          perf5AsyncApndNoLoc-noOpAppender.xml  thrpt       20   452156.178 ±   2895.732  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple    perf5AsyncApndDsrptrNoLoc-noOpAppender.xml  thrpt       20  1199652.525 ± 102706.289  ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple     perf5AsyncApndMpscQNoLoc-noOpAppender.xml  thrpt       20  1595657.925 ± 137987.755  ops/s
        
        Show
        jvz Matt Sicker added a comment - Here is the simple test benchmarked with 32 threads on a 4-core machine (the LinkedTransferQueue test timed out for some reason): Benchmark (configFileName) Mode Samples Score Error Units o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 20 452156.178 ± 2895.732 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 20 1199652.525 ± 102706.289 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 20 1595657.925 ± 137987.755 ops/s
        Hide
        Anthony Maire Anthony Maire added a comment -

        The approch of using a NoOp appender for testing AsyncAppender isn't a bad idea at all. The "issue" is that it's easy to do a wrong interpretation of the result.

        The issue is that we naturally expect the queue to be the bottleneck (and in fact it is the bottleneck in a lots of ABQ tests), but with those high-performance queue, the single-threaded code between queue.take() and appender.append() is the bottleneck. With 8 producer threads, I had the consumer thread 100% busy but producers were not. So we are measuring something that doesn't reflect the real world (since in real application we don't expect the queue to be always full), and this benchmark is not good to show which queue implementation is the best.

        I'm probably biaised because I'm working for a low-latency company, but the best metrics to evaluate the "performance" of the AsyncAppender are latency, and latency is really damn hard to measure (cf this excellent talk from Gil Tene on this subject: https://www.infoq.com/presentations/latency-response-time ). For such cases, I usually mesure latency percentiles at a load that is below maximal throughput (since all systems suck from a latency point of view when running near their saturation point) by recording all individual results in a HdrHistogram. I don't think that is something that can be done with JMH, I usually write custom load generator to do this

        A major caveat (among lots of others) with measuring latency is that the production environment may be very different from the idealized environment you have when doing benchmarks. Especially, the system might be overloaded (i.e there are more active threads in the whole machine, including kernel threads, than number of available core) and/or virtualized. In this situation the kind of concurrency (wait-free, lock-free ...) will matter a lot since threads can be suspended for several milliseconds

        Showing this effect in a benchmark is hard, because it occurs at a very low frequency with "production-like" load and there are other causes of low frequency outliers. It may require lots of analysis. The usual way to show that an algorithm as poorer concurrency properties is to increase the throughput and/or the number of threads of the load generator until some candidates behave poorly: lock-free algorithm will scale better than blocking algorithms.

        But with current AsyncAppender code, doing this kind of test is not possible, because the async thread isn't performant enough. To compare queues, the best thing is to compare them without being integrated into AsyncAppender. Otherwise, we need to optimise the AsyncAppender itself a lot to show differences between queues, but this might be a pretty useless effort since current code is probably efficient enough with a real underlying appender.

        Show
        Anthony Maire Anthony Maire added a comment - The approch of using a NoOp appender for testing AsyncAppender isn't a bad idea at all. The "issue" is that it's easy to do a wrong interpretation of the result. The issue is that we naturally expect the queue to be the bottleneck (and in fact it is the bottleneck in a lots of ABQ tests), but with those high-performance queue, the single-threaded code between queue.take() and appender.append() is the bottleneck. With 8 producer threads, I had the consumer thread 100% busy but producers were not. So we are measuring something that doesn't reflect the real world (since in real application we don't expect the queue to be always full), and this benchmark is not good to show which queue implementation is the best. I'm probably biaised because I'm working for a low-latency company, but the best metrics to evaluate the "performance" of the AsyncAppender are latency, and latency is really damn hard to measure (cf this excellent talk from Gil Tene on this subject: https://www.infoq.com/presentations/latency-response-time ). For such cases, I usually mesure latency percentiles at a load that is below maximal throughput (since all systems suck from a latency point of view when running near their saturation point) by recording all individual results in a HdrHistogram. I don't think that is something that can be done with JMH, I usually write custom load generator to do this A major caveat (among lots of others) with measuring latency is that the production environment may be very different from the idealized environment you have when doing benchmarks. Especially, the system might be overloaded (i.e there are more active threads in the whole machine, including kernel threads, than number of available core) and/or virtualized. In this situation the kind of concurrency (wait-free, lock-free ...) will matter a lot since threads can be suspended for several milliseconds Showing this effect in a benchmark is hard, because it occurs at a very low frequency with "production-like" load and there are other causes of low frequency outliers. It may require lots of analysis. The usual way to show that an algorithm as poorer concurrency properties is to increase the throughput and/or the number of threads of the load generator until some candidates behave poorly: lock-free algorithm will scale better than blocking algorithms. But with current AsyncAppender code, doing this kind of test is not possible, because the async thread isn't performant enough. To compare queues, the best thing is to compare them without being integrated into AsyncAppender. Otherwise, we need to optimise the AsyncAppender itself a lot to show differences between queues, but this might be a pretty useless effort since current code is probably efficient enough with a real underlying appender.
        Hide
        jvz Matt Sicker added a comment -

        Remko Popma: you developed these benchmarks, care to weigh in?

        Show
        jvz Matt Sicker added a comment - Remko Popma : you developed these benchmarks, care to weigh in?
        Hide
        remkop@yahoo.com Remko Popma added a comment - - edited

        We try to standardize on JMH for benchmarking. Generally this is a good thing. In the case of Async Loggers/AsyncAppender JMH may not be a good fit. I asked for advice on this on the Mechanical Sympathy mailing list (Topic: "Garbage-free log4j: request for feedback on measurement methodology", starting April 21, 2016). The consensus there seemed to be:

        • Throughput testing: JMH may not be a good fit. JMH will rapidly generate requests and the queue will fill up, so we end up measuring how fast you can take things off the queue. Martin Thompson suggested to mock out the writing to disk in the appender, which is what we were already doing and are still doing with the NoOpAppender.
        • Latency: JMH is not intended to do latency testing under various workloads. Based on Gil Tene's suggestions and feedback we arrived at org.apache.logging.log4j.core.async.perftest.ResponseTimeTest for this purpose.

        An alternative for the throughput tests is to use the hand-rolled PerfTest and MultiThreadPerfTest classes in the org.apache.logging.log4j.core.async.perftest package. This is what I used to obtain the initial Async Logger performance test results for the Log4j 2.0-beta5 release. This test takes care not to do more iterations than the queue size, so it avoids the above problem where the queue is full. It does write to disk, which I hope makes the test more realistic. It is not as easy to use as the JMH benchmarks though. Perhaps we should look at improving these tests to make them easier to use. An initial improvement attempt is SimplePerfTest in the same package.

        Show
        remkop@yahoo.com Remko Popma added a comment - - edited We try to standardize on JMH for benchmarking. Generally this is a good thing. In the case of Async Loggers/AsyncAppender JMH may not be a good fit. I asked for advice on this on the Mechanical Sympathy mailing list (Topic: "Garbage-free log4j: request for feedback on measurement methodology", starting April 21, 2016). The consensus there seemed to be: Throughput testing: JMH may not be a good fit. JMH will rapidly generate requests and the queue will fill up, so we end up measuring how fast you can take things off the queue. Martin Thompson suggested to mock out the writing to disk in the appender, which is what we were already doing and are still doing with the NoOpAppender. Latency : JMH is not intended to do latency testing under various workloads. Based on Gil Tene's suggestions and feedback we arrived at org.apache.logging.log4j.core.async.perftest.ResponseTimeTest for this purpose. An alternative for the throughput tests is to use the hand-rolled PerfTest and MultiThreadPerfTest classes in the org.apache.logging.log4j.core.async.perftest package. This is what I used to obtain the initial Async Logger performance test results for the Log4j 2.0-beta5 release. This test takes care not to do more iterations than the queue size, so it avoids the above problem where the queue is full. It does write to disk, which I hope makes the test more realistic. It is not as easy to use as the JMH benchmarks though. Perhaps we should look at improving these tests to make them easier to use. An initial improvement attempt is SimplePerfTest in the same package.
        Hide
        Anthony Maire Anthony Maire added a comment -

        I had a look on the ResponseTimeTest, it's exactly what I was thinking about for a good latency benchmark. I will try to find some time to play with it

        Show
        Anthony Maire Anthony Maire added a comment - I had a look on the ResponseTimeTest, it's exactly what I was thinking about for a good latency benchmark. I will try to find some time to play with it
        Hide
        Anthony Maire Anthony Maire added a comment -

        I made some additional tests on a dual xeon server (with 24 hyperthreads) to see how both queues behave under heavy contention (running lots of threads on a 4-cores machine does not show contention for these kind of queue since the CAS operations will never be contended by a sleeping thread) :

        • I first try to made some tests with the original ResponseTimeTest and AsyncAppender configuration files. It was pretty hard to find differences between both queue, because they both perform pretty well when the thoughput is way below the AsyncThread max thorughput. And the small difference between them are not significant because the amplitude is orders of magnitude below the GC pause caused by the appender itself
        • I tried to run 10 threads at 100K msg/s while in the background another process was spinning on 18 threads to simulate contention by other processes (without tuning, another JVM will run GC on 18 threads on this machine). As expected, I saw that the test with Conversant Queue was performing a little worse than the JCTools queue, but the difference is not that impressive

        I made additionnal tests on the queue by themselves, not used in an AsyncAppender (by writing custom RunXXX classes in the ResponseTimeTest package) :

        • 20 producer threads and 1 consumer thread, all with a spinning wait policy, so I expect to see a 2100% CPU usage in top
        • Pacer uses a Yield wait strategy (service time was worse with spinning since there were more threads than physical cores)
        • 20 minutes test (with 1 minute of warmup before)

        Since the AsyncThread was able to sustain a throughput of nearly 5 millions msg/s on this machine, I tried to run the consumer threads at 250 K ops/s each (so roughly 6 trillions operation for the whole test)

        • JCTools queue was able to execute the test, with a maximum response time of 11.8 ms and service time 99.99th percentile is 2.8µs
        • Conversant queue failed, only 2.82 trillions ops were made, median response time is 10.4ms and median service time is 3.1µs. Moreover CPU usage never reached 2100%, but was often between 1600% and 1800% because of the LockSupport.parkNanos() call in Condition.progressiveYield() in their offer method when producers are contending each other (this queue is not lock-free, although the contention impact is way lighter than in a lock-based implementation such as ABQ)

        I made another test with 50K ops/s per threads (1M ops/s total, way below AsyncThread max throughput), both queues are able to execute the test, but Conversant queue service time is worse than the JCTools one (which has nearly the same service time in the two tests). Look at the vertical line in the attached screenshot for the Conversant curve (blue) : this 50µs gap match the minimal pause achievable with lockSupport.parkNanos() on this machine. It shows that it's the same contention issue between producer with prevented this queue to succeed in the 5M ops/s test.

        => JCtools queue provides more predictable latencies and a higher maximal throughput because of its lock-free nature

        By the way, I wonder if it does really make sense to provide specific alternative queues support for the async appender, since this support implicitly mean "we tested this queue, it rocks hard". Being able to inject a custom factory is maybe enough. Currently the ABQ is probably enough for 99% of the users of asynchronous logging, and "performance hardcore users" should use AsyncLogger since it is garbage-free and more efficient (and using the original LMAX disruptor will offer some possibilities than will never be possible with a queue, such as LOG4J2-1424) ...

        Show
        Anthony Maire Anthony Maire added a comment - I made some additional tests on a dual xeon server (with 24 hyperthreads) to see how both queues behave under heavy contention (running lots of threads on a 4-cores machine does not show contention for these kind of queue since the CAS operations will never be contended by a sleeping thread) : I first try to made some tests with the original ResponseTimeTest and AsyncAppender configuration files. It was pretty hard to find differences between both queue, because they both perform pretty well when the thoughput is way below the AsyncThread max thorughput. And the small difference between them are not significant because the amplitude is orders of magnitude below the GC pause caused by the appender itself I tried to run 10 threads at 100K msg/s while in the background another process was spinning on 18 threads to simulate contention by other processes (without tuning, another JVM will run GC on 18 threads on this machine). As expected, I saw that the test with Conversant Queue was performing a little worse than the JCTools queue, but the difference is not that impressive I made additionnal tests on the queue by themselves, not used in an AsyncAppender (by writing custom RunXXX classes in the ResponseTimeTest package) : 20 producer threads and 1 consumer thread, all with a spinning wait policy, so I expect to see a 2100% CPU usage in top Pacer uses a Yield wait strategy (service time was worse with spinning since there were more threads than physical cores) 20 minutes test (with 1 minute of warmup before) Since the AsyncThread was able to sustain a throughput of nearly 5 millions msg/s on this machine, I tried to run the consumer threads at 250 K ops/s each (so roughly 6 trillions operation for the whole test) JCTools queue was able to execute the test, with a maximum response time of 11.8 ms and service time 99.99th percentile is 2.8µs Conversant queue failed, only 2.82 trillions ops were made, median response time is 10.4ms and median service time is 3.1µs. Moreover CPU usage never reached 2100%, but was often between 1600% and 1800% because of the LockSupport.parkNanos() call in Condition.progressiveYield() in their offer method when producers are contending each other (this queue is not lock-free, although the contention impact is way lighter than in a lock-based implementation such as ABQ) I made another test with 50K ops/s per threads (1M ops/s total, way below AsyncThread max throughput), both queues are able to execute the test, but Conversant queue service time is worse than the JCTools one (which has nearly the same service time in the two tests). Look at the vertical line in the attached screenshot for the Conversant curve (blue) : this 50µs gap match the minimal pause achievable with lockSupport.parkNanos() on this machine. It shows that it's the same contention issue between producer with prevented this queue to succeed in the 5M ops/s test. => JCtools queue provides more predictable latencies and a higher maximal throughput because of its lock-free nature By the way, I wonder if it does really make sense to provide specific alternative queues support for the async appender, since this support implicitly mean "we tested this queue, it rocks hard". Being able to inject a custom factory is maybe enough. Currently the ABQ is probably enough for 99% of the users of asynchronous logging, and "performance hardcore users" should use AsyncLogger since it is garbage-free and more efficient (and using the original LMAX disruptor will offer some possibilities than will never be possible with a queue, such as LOG4J2-1424 ) ...
        Hide
        jvz Matt Sicker added a comment -

        Do you a patch or PR for these tests?

        Show
        jvz Matt Sicker added a comment - Do you a patch or PR for these tests?
        Hide
        Anthony Maire Anthony Maire added a comment -

        I will finish the JCTools wrapper implementation (since there is a new JCTool release with a bug fix that I needed) and release a cleaner patch in a very near future

        Show
        Anthony Maire Anthony Maire added a comment - I will finish the JCTools wrapper implementation (since there is a new JCTool release with a bug fix that I needed) and release a cleaner patch in a very near future
        Hide
        jvz Matt Sicker added a comment -

        Ok, thank you for all the help!

        Show
        jvz Matt Sicker added a comment - Ok, thank you for all the help!
        Hide
        Anthony Maire Anthony Maire added a comment -

        the patch has been attached to LOG4J2-1439

        Show
        Anthony Maire Anthony Maire added a comment - the patch has been attached to LOG4J2-1439
        Hide
        john@2ad.com John Cairns added a comment -

        I've read some of the FUD being posted, so I wanted to post a head to head comparison to show what is really at stake here. In the attached plot you see the simplethroughput benchmark with Conversant Disruptor and MPSC isolated. You can immediately see that for a low number of threads, i.e., applications with fewer than 8 threads writing to the log simultaneously Conversant Disruptor is superior without any question. Of course, this is the common case for an application where only a handful of threads write log messages at any given time.

        In the hypothetical case proposed where applications have dozens or more threads contending to the log, most engineers would be well served by altering their application prior to using any Disruptor. Simply put an application with tens of threads logging will not be bottlenecked by the queue. In some of these cases one could use MPSC as a way to mitigate losses for a time. But I could never imagine such an application would be a functioning at its best.

        As further damning evidence against JCTools, in this comparison we are not comparing apples to apples. We are comparing the JCTools MPSC vs the Conversant MPMC. That should be a slam dunk victory for JCTools but it isn't. In fact, the over protected Conversant Disruptor is outperforming JCTools MPSC.

        Everyone reading this should carefully measure the performance in their own environment before deciding on which queue to use, but certainly anyone who exercises reasonable controls on their threading model will be best served by Conversant Disruptor.

        I ran these benchmarks on the latest 24 core Xeon processors with Hyperthreading disabled:
        /proc/cpuinfo reports:
        Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz

        Show
        john@2ad.com John Cairns added a comment - I've read some of the FUD being posted, so I wanted to post a head to head comparison to show what is really at stake here. In the attached plot you see the simplethroughput benchmark with Conversant Disruptor and MPSC isolated. You can immediately see that for a low number of threads, i.e., applications with fewer than 8 threads writing to the log simultaneously Conversant Disruptor is superior without any question. Of course, this is the common case for an application where only a handful of threads write log messages at any given time. In the hypothetical case proposed where applications have dozens or more threads contending to the log, most engineers would be well served by altering their application prior to using any Disruptor. Simply put an application with tens of threads logging will not be bottlenecked by the queue. In some of these cases one could use MPSC as a way to mitigate losses for a time. But I could never imagine such an application would be a functioning at its best. As further damning evidence against JCTools, in this comparison we are not comparing apples to apples. We are comparing the JCTools MPSC vs the Conversant MPMC. That should be a slam dunk victory for JCTools but it isn't. In fact, the over protected Conversant Disruptor is outperforming JCTools MPSC. Everyone reading this should carefully measure the performance in their own environment before deciding on which queue to use, but certainly anyone who exercises reasonable controls on their threading model will be best served by Conversant Disruptor. I ran these benchmarks on the latest 24 core Xeon processors with Hyperthreading disabled: /proc/cpuinfo reports: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
        Hide
        john@2ad.com John Cairns added a comment -

        Here is the actual data I collected:

        Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
        24 cores - hyperthreading disabled

        2 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 1696620.726 ± 650776.577 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 1857956.303 ± 229077.731 ops/s

        3 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2617984.622 ± 13343.976 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2791841.273 ± 384054.383 ops/s

        4 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 3049289.872 ± 84247.960 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2552907.350 ± 72146.621 ops/s

        5 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 3027603.165 ± 31798.646 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2770962.194 ± 65659.289 ops/s

        6 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2963524.783 ± 48958.375 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2741728.798 ± 43767.165 ops/s

        7 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2226923.278 ± 67272.401 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 1815899.328 ± 91079.403 ops/s

        8 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2498785.188 ± 54002.525 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2407604.284 ± 79145.961 ops/s

        9 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2182057.870 ± 175238.533 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2614756.139 ± 15393.854 ops/s

        10 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2174825.507 ± 22335.915 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2882521.436 ± 11394.649 ops/s

        15 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2181973.945 ± 29673.652 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2739778.681 ± 130812.015 ops/s

        20 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 1990073.042 ± 370165.515 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2954562.222 ± 13491.607 ops/s

        25 thread:
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 1606942.759 ± 19518.832 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2769796.044 ± 623809.181 ops/s

        Show
        john@2ad.com John Cairns added a comment - Here is the actual data I collected: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz 24 cores - hyperthreading disabled 2 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 1696620.726 ± 650776.577 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 1857956.303 ± 229077.731 ops/s 3 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2617984.622 ± 13343.976 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2791841.273 ± 384054.383 ops/s 4 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 3049289.872 ± 84247.960 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2552907.350 ± 72146.621 ops/s 5 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 3027603.165 ± 31798.646 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2770962.194 ± 65659.289 ops/s 6 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2963524.783 ± 48958.375 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2741728.798 ± 43767.165 ops/s 7 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2226923.278 ± 67272.401 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 1815899.328 ± 91079.403 ops/s 8 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2498785.188 ± 54002.525 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2407604.284 ± 79145.961 ops/s 9 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2182057.870 ± 175238.533 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2614756.139 ± 15393.854 ops/s 10 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2174825.507 ± 22335.915 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2882521.436 ± 11394.649 ops/s 15 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2181973.945 ± 29673.652 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2739778.681 ± 130812.015 ops/s 20 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 1990073.042 ± 370165.515 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2954562.222 ± 13491.607 ops/s 25 thread: o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 1606942.759 ± 19518.832 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2769796.044 ± 623809.181 ops/s
        Hide
        remkop@yahoo.com Remko Popma added a comment -

        Usually there is no absolute better or worse, there is a set of trade-offs and it is up to our users to decide which is the most attractive option.

        To make progress on this ticket, there is the programming work, but perhaps more involved and more important is the documentation. If we support multiple options the docs need to clarify what the trade-offs are.

        By the way, are the following all the options or are there variations (or relevant config changes etc) that we should add to the list?

        • ArrayBlockingQueue
        • LinkedTransferQueue
        • Conversant MPMC (does a MPSC exist?)
        • JCTools MPSC

        For throughput comparison between the options, just using the AsyncAppenderLog4j2Benchmark.throughputSimple benchmark is probably fine. Unless anyone sees a reason to also compare with a varying number of parameters? It is important to test throughput with a varying number of threads like John did above and like we do with all throughput tests on the performance page.

        Are there any other quantifiable characteristics (other than throughput) that can be measured and compared? Response time latency? Pressure on the garbage collector? We should start to gather these numbers and create some graphs that compare the options.

        Then there are the non-quantifiable characteristics:

        • Does it need external dependencies?
        • Does it use the Unsafe?
        • Is it a bounded queue?
        • Any other?

        How do we present the non-quantifiable characteristics, do we simply write a few paragraphs, or do we compare them in a table or visualize them some other way?

        Show
        remkop@yahoo.com Remko Popma added a comment - Usually there is no absolute better or worse, there is a set of trade-offs and it is up to our users to decide which is the most attractive option. To make progress on this ticket, there is the programming work, but perhaps more involved and more important is the documentation. If we support multiple options the docs need to clarify what the trade-offs are. By the way, are the following all the options or are there variations (or relevant config changes etc) that we should add to the list? ArrayBlockingQueue LinkedTransferQueue Conversant MPMC (does a MPSC exist?) JCTools MPSC For throughput comparison between the options, just using the AsyncAppenderLog4j2Benchmark.throughputSimple benchmark is probably fine. Unless anyone sees a reason to also compare with a varying number of parameters? It is important to test throughput with a varying number of threads like John did above and like we do with all throughput tests on the performance page . Are there any other quantifiable characteristics (other than throughput) that can be measured and compared? Response time latency? Pressure on the garbage collector? We should start to gather these numbers and create some graphs that compare the options. Then there are the non-quantifiable characteristics: Does it need external dependencies? Does it use the Unsafe? Is it a bounded queue? Any other? How do we present the non-quantifiable characteristics, do we simply write a few paragraphs, or do we compare them in a table or visualize them some other way?
        Hide
        remkop@yahoo.com Remko Popma added a comment - - edited

        Matt, to avoid misunderstanding, this is your initiative and I am more than happy for you to take the lead on this one.
        I just thought that it would benefit everyone if it is clear what we need to do to help focus the discussion.

        Show
        remkop@yahoo.com Remko Popma added a comment - - edited Matt, to avoid misunderstanding, this is your initiative and I am more than happy for you to take the lead on this one. I just thought that it would benefit everyone if it is clear what we need to do to help focus the discussion.
        Hide
        john@2ad.com John Cairns added a comment -

        Remko Popma, I agree wholeheartedly. I want to make sure the real measurements don't get lost in a sea of words and ambiguous "concerns" and fine points. Of which I've heard many on this thread.

        For naming "Conversant Disruptor" would be correct.

        Currently we support MPMC and SPSC. I will add it to my project list to investigate an MPSC solution. Given that with <8 threads we have good performance as compared to the JCTools solution, I think it satisfactory to move forward with the MPMC at this time.

        One other non-quantifiable would be: is it a BlockingQueue or does it need to be wrapped. i.e. the wrapping implementation may be less robust regardless of the merits of the queue.

        As far as benchmarking, I would consider the queue capacity. For less than <128 many of these queues will get serious performance advantages by fitting into L1 cache, at least Conversant Disruptor does. On the other hand for throughput benchmarks that saturate the queue, high-capacity might actually exhibit much better results overall. It would be a good idea to try to measure that tradeoff.

        Show
        john@2ad.com John Cairns added a comment - Remko Popma , I agree wholeheartedly. I want to make sure the real measurements don't get lost in a sea of words and ambiguous "concerns" and fine points. Of which I've heard many on this thread. For naming "Conversant Disruptor" would be correct. Currently we support MPMC and SPSC. I will add it to my project list to investigate an MPSC solution. Given that with <8 threads we have good performance as compared to the JCTools solution, I think it satisfactory to move forward with the MPMC at this time. One other non-quantifiable would be: is it a BlockingQueue or does it need to be wrapped. i.e. the wrapping implementation may be less robust regardless of the merits of the queue. As far as benchmarking, I would consider the queue capacity. For less than <128 many of these queues will get serious performance advantages by fitting into L1 cache, at least Conversant Disruptor does. On the other hand for throughput benchmarks that saturate the queue, high-capacity might actually exhibit much better results overall. It would be a good idea to try to measure that tradeoff.
        Hide
        jvz Matt Sicker added a comment -

        No, I appreciate the help. I agree that the documentation part is very important and why I haven't merged these two features into master yet.

        Show
        jvz Matt Sicker added a comment - No, I appreciate the help. I agree that the documentation part is very important and why I haven't merged these two features into master yet.
        Hide
        jvz Matt Sicker added a comment -

        I'm going to need help pulling together benchmark data for all 4 implementations as my computer (Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz) is apparently not good enough for realistic benchmarks. We have some 12-core servers at work, but I doubt I could use any of them to do this. Also, spinning up an EC2 instance doesn't seem right as it could be affected by noisy neighbors.

        Show
        jvz Matt Sicker added a comment - I'm going to need help pulling together benchmark data for all 4 implementations as my computer (Intel(R) Core(TM) i7-3820QM CPU @ 2.70GHz) is apparently not good enough for realistic benchmarks. We have some 12-core servers at work, but I doubt I could use any of them to do this. Also, spinning up an EC2 instance doesn't seem right as it could be affected by noisy neighbors.
        Hide
        john@2ad.com John Cairns added a comment -

        Matt Sicker what do you need? All four implementations in 1-20 threads on the latest Xeons? I can provide that no problem! Just let me know what you want.

        Show
        john@2ad.com John Cairns added a comment - Matt Sicker what do you need? All four implementations in 1-20 threads on the latest Xeons? I can provide that no problem! Just let me know what you want.
        Hide
        jvz Matt Sicker added a comment -

        That would be great, thanks! We might devise some other benchmarks for the other scenarios presented above, too.

        Show
        jvz Matt Sicker added a comment - That would be great, thanks! We might devise some other benchmarks for the other scenarios presented above, too.
        Hide
        Anthony Maire Anthony Maire added a comment -

        If you have enough time (I don't have access for a while on the Haswell box I was using earlier), can you test what is the max throughput that a real appender can sustain. Can you please change the NoOpAppender used in the test to a RandomAccessFile appender with immediateFlush=false (I think that is the appender that offer the best throughput currently) with 8 threads and either JCTools or Conversant queue

        Basically if all queue implementation can deliver more throughput that a real underlying appender can sustain, comparing the throughput of the queues does not make any sense, and only other characteristic such as offer latency and GC pressure really matters.

        Show
        Anthony Maire Anthony Maire added a comment - If you have enough time (I don't have access for a while on the Haswell box I was using earlier), can you test what is the max throughput that a real appender can sustain. Can you please change the NoOpAppender used in the test to a RandomAccessFile appender with immediateFlush=false (I think that is the appender that offer the best throughput currently) with 8 threads and either JCTools or Conversant queue Basically if all queue implementation can deliver more throughput that a real underlying appender can sustain, comparing the throughput of the queues does not make any sense, and only other characteristic such as offer latency and GC pressure really matters.
        Hide
        Anthony Maire Anthony Maire added a comment -

        LinkedBlockingQueue should be considered I think, it has better concurrency property than ABQ, but it can be bounded. It is basically a good tradeoff between ABQ and LTQ

        Show
        Anthony Maire Anthony Maire added a comment - LinkedBlockingQueue should be considered I think, it has better concurrency property than ABQ, but it can be bounded. It is basically a good tradeoff between ABQ and LTQ
        Hide
        jvz Matt Sicker added a comment -

        Could you include AsyncLogger in the comparison?

        Show
        jvz Matt Sicker added a comment - Could you include AsyncLogger in the comparison?
        Hide
        john@2ad.com John Cairns added a comment -

        log4jhaswell2cpu2core.jpg: AsyncAppender Comparison with available queue strategies.

        Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
        @33x clock (3.3Ghz) 2 cpus with 2 cores, 1 core on each cpu selected

        Show
        john@2ad.com John Cairns added a comment - log4jhaswell2cpu2core.jpg: AsyncAppender Comparison with available queue strategies. Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz @33x clock (3.3Ghz) 2 cpus with 2 cores, 1 core on each cpu selected
        Hide
        john@2ad.com John Cairns added a comment -

        log4jthread2cpu2core.log: AsyncAppender JMH output with available queue strategies and 2-20 threads

        Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
        @33x clock (3.3Ghz) 2 cpus with 2 cores, 1 core on each cpu selected

        Show
        john@2ad.com John Cairns added a comment - log4jthread2cpu2core.log: AsyncAppender JMH output with available queue strategies and 2-20 threads Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz @33x clock (3.3Ghz) 2 cpus with 2 cores, 1 core on each cpu selected
        Hide
        john@2ad.com John Cairns added a comment -

        log4jHaswell2cpu4core.jpg: AsyncAppender Comparison with available queue strategies.

        Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
        @29x clock (2.9Ghz) 2 cpus with 12 cores, 2 cores on each cpu selected

        Show
        john@2ad.com John Cairns added a comment - log4jHaswell2cpu4core.jpg: AsyncAppender Comparison with available queue strategies. Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz @29x clock (2.9Ghz) 2 cpus with 12 cores, 2 cores on each cpu selected
        Hide
        john@2ad.com John Cairns added a comment -

        log4jthread2cpu4core.log: AsyncAppender JMH output with available queue strategies.

        Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
        @29x clock (2.9Ghz) 2 cpus with 12 cores, 2 cores on each cpu selected

        Show
        john@2ad.com John Cairns added a comment - log4jthread2cpu4core.log: AsyncAppender JMH output with available queue strategies. Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz @29x clock (2.9Ghz) 2 cpus with 12 cores, 2 cores on each cpu selected
        Hide
        john@2ad.com John Cairns added a comment -

        log4jrafile.log: JMH output for "real world" experiment involving RandomAccessFile Appender, 8 threads only

        Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
        @29x clock (2.9Ghz) 2 cpus with 12 cores, 2 cores on each cpu selected

        Show
        john@2ad.com John Cairns added a comment - log4jrafile.log: JMH output for "real world" experiment involving RandomAccessFile Appender, 8 threads only Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz @29x clock (2.9Ghz) 2 cpus with 12 cores, 2 cores on each cpu selected
        Hide
        john@2ad.com John Cairns added a comment -

        Here you go. There does appear to be a difference in "real-world" performance depending on which queue is in play. This test is exactly as you requested, 8 threads, RandomAccessFile, immediateFlush=false

        Benchmark (configFileName) Mode Samples Score Error Units
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 10 216280.866 ± 12260.529 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 490709.762 ± 24567.828 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 403533.601 ± 278530.384 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 10 246531.593 ± 559182.197 ops/s

        Show
        john@2ad.com John Cairns added a comment - Here you go. There does appear to be a difference in "real-world" performance depending on which queue is in play. This test is exactly as you requested, 8 threads, RandomAccessFile, immediateFlush=false Benchmark (configFileName) Mode Samples Score Error Units o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 10 216280.866 ± 12260.529 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 490709.762 ± 24567.828 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 403533.601 ± 278530.384 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 10 246531.593 ± 559182.197 ops/s
        Hide
        john@2ad.com John Cairns added a comment -

        AsyncLogBenchmarks.log: AsyncLoggesBenchmark throughputSimple test on Haswell:

        Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
        @33x clock (3.3Ghz) 2 cpus with 2 cores, 1 core on each cpu selected

        Show
        john@2ad.com John Cairns added a comment - AsyncLogBenchmarks.log: AsyncLoggesBenchmark throughputSimple test on Haswell: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz @33x clock (3.3Ghz) 2 cpus with 2 cores, 1 core on each cpu selected
        Hide
        john@2ad.com John Cairns added a comment -

        All of the above benchmarks have been posted.

        Show
        john@2ad.com John Cairns added a comment - All of the above benchmarks have been posted.
        Hide
        Anthony Maire Anthony Maire added a comment - - edited

        Hello John

        Thank you for all these benchmark, but I'm afraid they are not really relevant. Can you explain why you constrained the JVM to run on only 2 or 4 cores, with QPI transfers involved? I cannot see the point. As soon as the number of threads is high enough, we expect the queue to be mostly full, so for the non-JDK queues we will have a lot of threads spinning on very few cores, which is really really bad.

        Can you either :

        • use park based strategy for queues (i.e WAITING for Conversant and the default PARK for JCTools) if this kind of taskset setup made sense to your main use cases : I cannot imagine an application with 10 times more active threads than available core that want to use a spin or yield-based strategy ....
        • don't constraint the JVM to let it use the 24 cores available on your box
        • constraint the JVM to a whole NUMA node (using numactl -N 1 -m 1 java -jar ....) but don't run more than 11 spinning producer threads.

        Moreover in such heavily contented scenario, there is usually a throughput vs latency tradeoff (if some threads are frozen, contention is released and overall throughput increases), I'm pretty sure that running the same benchmark with JMH "sample time" mode will show that JCTools queue latency is lower and more stable than the Conversant one. That's why in the 2 core scenario you have JCTools queue that perform worse than ABQ when thread count increases : this kind of thing cannot occur with realistic loads (i.e with no more spinning threads than available core, or with a park-based waiting strategy)

        Kind regards,

        Show
        Anthony Maire Anthony Maire added a comment - - edited Hello John Thank you for all these benchmark, but I'm afraid they are not really relevant. Can you explain why you constrained the JVM to run on only 2 or 4 cores, with QPI transfers involved? I cannot see the point. As soon as the number of threads is high enough, we expect the queue to be mostly full, so for the non-JDK queues we will have a lot of threads spinning on very few cores, which is really really bad. Can you either : use park based strategy for queues (i.e WAITING for Conversant and the default PARK for JCTools) if this kind of taskset setup made sense to your main use cases : I cannot imagine an application with 10 times more active threads than available core that want to use a spin or yield-based strategy .... don't constraint the JVM to let it use the 24 cores available on your box constraint the JVM to a whole NUMA node (using numactl -N 1 -m 1 java -jar ....) but don't run more than 11 spinning producer threads. Moreover in such heavily contented scenario, there is usually a throughput vs latency tradeoff (if some threads are frozen, contention is released and overall throughput increases), I'm pretty sure that running the same benchmark with JMH "sample time" mode will show that JCTools queue latency is lower and more stable than the Conversant one. That's why in the 2 core scenario you have JCTools queue that perform worse than ABQ when thread count increases : this kind of thing cannot occur with realistic loads (i.e with no more spinning threads than available core, or with a park-based waiting strategy) Kind regards,
        Hide
        remkop@yahoo.com Remko Popma added a comment -

        It would be good if we can end up with a recommendation to users about the benefits/disadvantages of the various options under various circumstances. So, if the Conversant Disruptor gives better throughput when running on a box with very few cores, that is fine and that is useful advice to our users. (I would also like to do a latency comparison on the same configuration if possible.)

        I think we can agree that spinning wait strategies are not suitable for general purpose logging. Very few applications would be willing to sacrifice a whole core for better logging throughput/latency. If multiple wait strategies are available (configurable in Log4j) then we should document the options but I would not want to emphasize this option. I certainly don't want to publish any benchmarks based on spinning wait. For other wait strategies we need to document the trade-offs: which has higher CPU usage? Which is better in resource-restricted environments (like virtual environments)? Which has better latency behaviour?

        Show
        remkop@yahoo.com Remko Popma added a comment - It would be good if we can end up with a recommendation to users about the benefits/disadvantages of the various options under various circumstances. So, if the Conversant Disruptor gives better throughput when running on a box with very few cores, that is fine and that is useful advice to our users. (I would also like to do a latency comparison on the same configuration if possible.) I think we can agree that spinning wait strategies are not suitable for general purpose logging. Very few applications would be willing to sacrifice a whole core for better logging throughput/latency. If multiple wait strategies are available (configurable in Log4j) then we should document the options but I would not want to emphasize this option. I certainly don't want to publish any benchmarks based on spinning wait. For other wait strategies we need to document the trade-offs: which has higher CPU usage? Which is better in resource-restricted environments (like virtual environments)? Which has better latency behaviour?
        Hide
        john@2ad.com John Cairns added a comment -

        Anthony, I'm afraid these benchmarks are perfectly relevant. I will write a detailed blog post about these results in a few days and post the link here so you can know my thoughts. For now suffice it to say that I tested 2 configurations. A 4 core high clock rate box typically used by performance enthusiasts, and a 24 core business class box typically used in enterprise environments. In both cases, I constrained the number of cores dedicated for logging to 1 or 2 as would be expected for most applications. I agree with you that a high number of threads is not relevant, but that was the question that you introduced in the discussion. Personally I would have only benchmarked a few threads as that is what will be the common case for 99% of applications using log4j and it is also the best case scenario for Conversant Disruptor. Any application with high number of threads contending on the log would simply fill up the disk and fail. That's not a very interesting use case.

        Here are the results you requested on my 4 core 3.3Ghz Haswell Xeon box. Both Conversant and JCTools are using a parking/waiting strategy. I have not used taskset to select a CPU. You can see that this really impairs performance of both approaches as we are now spending time waiting that we didn't have before. I also believe this is fairly atypical in that the majority of performance enthusiasts would be using taskset to select the cores for their application. This is also not typically done in benchmarking as it introduces contention with other system processes that we should not be measuring. I'm not sure how beneficial this is to anyone but I hope it satisfies your curiosity:

        $ java -jar benchmarks.jar -jvmArgs '-server -XX:+AggressiveOpts -XX:CompileThreshold=100' ".*AsyncAppenderLog4j2Benchmark.throughputSimple" -t 8 -si true

        Benchmark (configFileName) Mode Samples Score Error Units
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 10 1943898.065 ± 653186.710 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2831824.627 ± 55738.821 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2645833.072 ± 112520.607 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 10 1644976.944 ± 654820.239 ops/s

        Show
        john@2ad.com John Cairns added a comment - Anthony, I'm afraid these benchmarks are perfectly relevant. I will write a detailed blog post about these results in a few days and post the link here so you can know my thoughts. For now suffice it to say that I tested 2 configurations. A 4 core high clock rate box typically used by performance enthusiasts, and a 24 core business class box typically used in enterprise environments. In both cases, I constrained the number of cores dedicated for logging to 1 or 2 as would be expected for most applications. I agree with you that a high number of threads is not relevant, but that was the question that you introduced in the discussion. Personally I would have only benchmarked a few threads as that is what will be the common case for 99% of applications using log4j and it is also the best case scenario for Conversant Disruptor. Any application with high number of threads contending on the log would simply fill up the disk and fail. That's not a very interesting use case. Here are the results you requested on my 4 core 3.3Ghz Haswell Xeon box. Both Conversant and JCTools are using a parking/waiting strategy. I have not used taskset to select a CPU. You can see that this really impairs performance of both approaches as we are now spending time waiting that we didn't have before. I also believe this is fairly atypical in that the majority of performance enthusiasts would be using taskset to select the cores for their application. This is also not typically done in benchmarking as it introduces contention with other system processes that we should not be measuring. I'm not sure how beneficial this is to anyone but I hope it satisfies your curiosity: $ java -jar benchmarks.jar -jvmArgs '-server -XX:+AggressiveOpts -XX:CompileThreshold=100' ".*AsyncAppenderLog4j2Benchmark.throughputSimple" -t 8 -si true Benchmark (configFileName) Mode Samples Score Error Units o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 10 1943898.065 ± 653186.710 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 2831824.627 ± 55738.821 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 2645833.072 ± 112520.607 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 10 1644976.944 ± 654820.239 ops/s
        Hide
        jvz Matt Sicker added a comment -

        Could you compare these with the LMAX implementation as well? AsyncLogger is relevant to the discussion I think.

        Show
        jvz Matt Sicker added a comment - Could you compare these with the LMAX implementation as well? AsyncLogger is relevant to the discussion I think.
        Hide
        john@2ad.com John Cairns added a comment -

        Yes, SpinPolicy.WAITING is the correct choice for Conversant Disruptor, it is also very highly tuned and performs very well.

        In a resource constrained environment, someone could use SpinPolicy.BLOCKING (Java style locking). However this will come with performance tradeoffs that some users may not anticipate so certainly WAITING is the pragmatic default.

        From the standpoint of benchmarking, the point should be to determine which queue is best in typical application environments. In order to do that you must isolate the performance of the queue by removing operating system overhead and other things that impact the timing which are orthogonal to the choice of the queue. That is why I have used taskset, shutdown other processes, disabled hyperthreading and power saving, etc. That might not be the way that every user is implementing the queue, however, it allows us to understand more clearly which part of the timings are due to the queue vs other considerations. If people end up using hyperthreading or do not use taskset, they will still get the same benefits from the queue that we demonstrate with these "cleaner" settings.

        Also if you look at the benchmark with the "RandomAccessFile" appender, I think this gives a very "real world" measurement of how Conversant Disruptor will impact the performance of a typical application. Though the throughput of Disruptor is not saturated the application greatly benefits from the lower overhead.

        Show
        john@2ad.com John Cairns added a comment - Yes, SpinPolicy.WAITING is the correct choice for Conversant Disruptor, it is also very highly tuned and performs very well. In a resource constrained environment, someone could use SpinPolicy.BLOCKING (Java style locking). However this will come with performance tradeoffs that some users may not anticipate so certainly WAITING is the pragmatic default. From the standpoint of benchmarking, the point should be to determine which queue is best in typical application environments. In order to do that you must isolate the performance of the queue by removing operating system overhead and other things that impact the timing which are orthogonal to the choice of the queue. That is why I have used taskset, shutdown other processes, disabled hyperthreading and power saving, etc. That might not be the way that every user is implementing the queue, however, it allows us to understand more clearly which part of the timings are due to the queue vs other considerations. If people end up using hyperthreading or do not use taskset, they will still get the same benefits from the queue that we demonstrate with these "cleaner" settings. Also if you look at the benchmark with the "RandomAccessFile" appender, I think this gives a very "real world" measurement of how Conversant Disruptor will impact the performance of a typical application. Though the throughput of Disruptor is not saturated the application greatly benefits from the lower overhead.
        Hide
        remkop@yahoo.com Remko Popma added a comment -

        John, Anthony, please take a look at the Log4j 2 performance page for an idea of what I would like to see.

        • Comparison of various queue options with a wide range of threads. I like using 1, 2, 4, 8, 16, 32 and 64 threads. It does not need to be exactly this number but this give a good idea of behaviour across a wide range of threads without having to run too many perf tests. It is a good thing to include perf results when using more threads than physical cores.
        • Perf tests are done on "vanilla" configurations. We could do a comparison to show how throughput/latency behaviour changes if you use spinlocks, thread affinity or taskset but the main results should be without these. (I'm okay with hyperthreading and power saving mode enabled or disabled as long as we document what we did.) I realize this introduces noise that is not relevant to the choice of queue, but more important to me is to avoid giving the impression that users need to do things like thread affinity or taskset to achieve good performance with Log4j 2.

        Log4j is used in a wide variety of environments by a wide variety of applications. There is no single "typical use case", which makes it hard for us to tune it for optimal performance. The best we can do is show performance behaviour under various configurations. Users will pick the result that gives the best trade-off for their application and environment.

        About testing with RandomAccessFileAppender: JMH does not allow us to control the number of invocations, and it will always fill up the queue, after which performance drops to the throughput of the RandomAccessFileAppender. It may be interesting to show the difference of the various queues in this scenario, but my assumption so far has been that this scenario is a rare one. Async logging is only useful to absorb bursts of events. If the application's sustained logging rate is faster than the underlying appender can keep up with, the application is better off using plain synchronous logging because any async logger would just introduce jitter. For this reason (and after a long discussion on this topic on the Mechanical Sympathy mailing list) the comparison of async logging mechanisms tested with JMH is done with a NoOpAppender. If nobody objects I suggest we stick with that.

        Show
        remkop@yahoo.com Remko Popma added a comment - John, Anthony, please take a look at the Log4j 2 performance page for an idea of what I would like to see. Comparison of various queue options with a wide range of threads. I like using 1, 2, 4, 8, 16, 32 and 64 threads. It does not need to be exactly this number but this give a good idea of behaviour across a wide range of threads without having to run too many perf tests. It is a good thing to include perf results when using more threads than physical cores. Perf tests are done on "vanilla" configurations. We could do a comparison to show how throughput/latency behaviour changes if you use spinlocks, thread affinity or taskset but the main results should be without these. (I'm okay with hyperthreading and power saving mode enabled or disabled as long as we document what we did.) I realize this introduces noise that is not relevant to the choice of queue, but more important to me is to avoid giving the impression that users need to do things like thread affinity or taskset to achieve good performance with Log4j 2. Log4j is used in a wide variety of environments by a wide variety of applications. There is no single "typical use case", which makes it hard for us to tune it for optimal performance. The best we can do is show performance behaviour under various configurations. Users will pick the result that gives the best trade-off for their application and environment. About testing with RandomAccessFileAppender: JMH does not allow us to control the number of invocations, and it will always fill up the queue, after which performance drops to the throughput of the RandomAccessFileAppender. It may be interesting to show the difference of the various queues in this scenario, but my assumption so far has been that this scenario is a rare one. Async logging is only useful to absorb bursts of events . If the application's sustained logging rate is faster than the underlying appender can keep up with, the application is better off using plain synchronous logging because any async logger would just introduce jitter. For this reason (and after a long discussion on this topic on the Mechanical Sympathy mailing list) the comparison of async logging mechanisms tested with JMH is done with a NoOpAppender. If nobody objects I suggest we stick with that.
        Hide
        Anthony Maire Anthony Maire added a comment -

        Hi Remko

        I had a look on the performance page, the tests included in this one are far more relevant that the JMH throughput test that measure something that will never exist in real world. I will definitely try to run this one with the various queue implementations when I will have some time (but I'm currently pretty busy). However I'm very surprised with the figures for AsyncLogger : how can the throughput still double when thread count double but there are already as much thread as available core/hyperthreads ? I suspect there is something wrong there. Did you made some investigation on that point ?

        About the tests with the RAF appender, I made some further test using the rolling variant, with maximum 1 file, and all of that writing to /dev/shm to avoid real fs/disk jitter, the maximum rate I had was around 900K msg/s, very far from the 3 to 7M msg/s I had with a no-op appender with custom queues. Since the JMH test only mesure how fast the appender thread can dequeue (cf a huge increase with JCTools queue when I changed the offer method to avoid false sharing between producer and consumer), doing comparison on it are really pointless. The degree of contention in real world will be really lower, even when it is not blocking on disk

        Show
        Anthony Maire Anthony Maire added a comment - Hi Remko I had a look on the performance page, the tests included in this one are far more relevant that the JMH throughput test that measure something that will never exist in real world. I will definitely try to run this one with the various queue implementations when I will have some time (but I'm currently pretty busy). However I'm very surprised with the figures for AsyncLogger : how can the throughput still double when thread count double but there are already as much thread as available core/hyperthreads ? I suspect there is something wrong there. Did you made some investigation on that point ? About the tests with the RAF appender, I made some further test using the rolling variant, with maximum 1 file, and all of that writing to /dev/shm to avoid real fs/disk jitter, the maximum rate I had was around 900K msg/s, very far from the 3 to 7M msg/s I had with a no-op appender with custom queues. Since the JMH test only mesure how fast the appender thread can dequeue (cf a huge increase with JCTools queue when I changed the offer method to avoid false sharing between producer and consumer), doing comparison on it are really pointless. The degree of contention in real world will be really lower, even when it is not blocking on disk
        Hide
        Anthony Maire Anthony Maire added a comment -

        Running up to 10 more spinning threads than available core can NEVER be relevant, it's just like trying to mesure the speed of a car without releasing handbrake.

        For the "simulating a desktop system" and the "perf enthusiasts use taskset" parts, the issue is that your using cores from different CPUs : in a desktop system, cores won't never communicate through a QPI link, and perf enthusiasts will try to exploit NUMA locality as much as possible, the taskset setting you used is a counter-productive tuning since it will cause lots of cross-node memory access

        However, as I said in another comment, these throughput test are not really relevant, I will try to run some other tests in a near future.

        Show
        Anthony Maire Anthony Maire added a comment - Running up to 10 more spinning threads than available core can NEVER be relevant, it's just like trying to mesure the speed of a car without releasing handbrake. For the "simulating a desktop system" and the "perf enthusiasts use taskset" parts, the issue is that your using cores from different CPUs : in a desktop system, cores won't never communicate through a QPI link, and perf enthusiasts will try to exploit NUMA locality as much as possible, the taskset setting you used is a counter-productive tuning since it will cause lots of cross-node memory access However, as I said in another comment, these throughput test are not really relevant, I will try to run some other tests in a near future.
        Hide
        john@2ad.com John Cairns added a comment -

        Remko PopmaAnthony Maire Here are the benchmarks at 1, 2, 4, 8, 16, 32, 64 thread with numactl and without taskset. Both disruptors are set to WAITING/PARK respectively. I'm using the same 12 core Haswell box with the 29x clock multiplier as before (standard enterprise pizza box circa 2016).

        I'm not sure this result is any more or less relevant than the previous results but it does match in terms of my interpretation of the data. numactl introduced some sort of degenerate pauses for high iterations, so I had to reduce the number of iterations to get a good result. Also there is likely a concurrency bug in the Xfer version as it appeared to deadock in the 32 thread case, that data is lost. Although this result as not as careful, and IMHO not as good, as the previous at least it does agree overall.

        I'm satisfied that the evidence conclusively demonstrates the performance of Conversant Disruptor in a wide variety of real cases and benchmarks. Even including a variety of experimental conditions.

        Show
        john@2ad.com John Cairns added a comment - Remko Popma Anthony Maire Here are the benchmarks at 1, 2, 4, 8, 16, 32, 64 thread with numactl and without taskset. Both disruptors are set to WAITING/PARK respectively. I'm using the same 12 core Haswell box with the 29x clock multiplier as before (standard enterprise pizza box circa 2016). I'm not sure this result is any more or less relevant than the previous results but it does match in terms of my interpretation of the data. numactl introduced some sort of degenerate pauses for high iterations, so I had to reduce the number of iterations to get a good result. Also there is likely a concurrency bug in the Xfer version as it appeared to deadock in the 32 thread case, that data is lost. Although this result as not as careful, and IMHO not as good, as the previous at least it does agree overall. I'm satisfied that the evidence conclusively demonstrates the performance of Conversant Disruptor in a wide variety of real cases and benchmarks. Even including a variety of experimental conditions.
        Hide
        john@2ad.com John Cairns added a comment -

        For the sake of curiosity here is the result on a dual core, core i7 laptop with hyperthreading and turboboost (3.3Ghz) enabled. Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz, 4 threads on 4 virtual cores.

        This could be useful for people developing (open source) tools that are sometimes deployed on consumer grade hardware. You can see the benefits of high clock rate on a single cpu for all of the approaches.

        $ java -jar log4j-perf/target/benchmarks.jar -jvmArgs '-server -XX:+AggressiveOpts -XX:CompileThreshold=100' ".*AsyncAppenderLog4j2Benchmark.throughputSimple" -t 4 -si true

        Benchmark (configFileName) Mode Samples Score Error Units
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 10 2400103.680 ± 44580.898 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 4644898.418 ± 238257.071 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 4669573.290 ± 273457.358 ops/s
        o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 10 3302446.735 ± 362639.838 ops/s

        Show
        john@2ad.com John Cairns added a comment - For the sake of curiosity here is the result on a dual core, core i7 laptop with hyperthreading and turboboost (3.3Ghz) enabled. Intel(R) Core(TM) i7-4600U CPU @ 2.10GHz, 4 threads on 4 virtual cores. This could be useful for people developing (open source) tools that are sometimes deployed on consumer grade hardware. You can see the benefits of high clock rate on a single cpu for all of the approaches. $ java -jar log4j-perf/target/benchmarks.jar -jvmArgs '-server -XX:+AggressiveOpts -XX:CompileThreshold=100' ".*AsyncAppenderLog4j2Benchmark.throughputSimple" -t 4 -si true Benchmark (configFileName) Mode Samples Score Error Units o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndNoLoc-noOpAppender.xml thrpt 10 2400103.680 ± 44580.898 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndDsrptrNoLoc-noOpAppender.xml thrpt 10 4644898.418 ± 238257.071 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndMpscQNoLoc-noOpAppender.xml thrpt 10 4669573.290 ± 273457.358 ops/s o.a.l.l.p.j.AsyncAppenderLog4j2Benchmark.throughputSimple perf5AsyncApndXferQNoLoc-noOpAppender.xml thrpt 10 3302446.735 ± 362639.838 ops/s
        Hide
        john@2ad.com John Cairns added a comment -

        log4jHaswell1cpu12core.jpg uploaded for comparison purposes

        Show
        john@2ad.com John Cairns added a comment - log4jHaswell1cpu12core.jpg uploaded for comparison purposes
        Hide
        jvz Matt Sicker added a comment -

        John Cairns: we've discovered a bit of an issue after I merged the branch. Your library requires Java 1.8, but we use Java 1.7. I've proposed that we build using Java 8 along with the animal-sniffer plugin to prevent using anything that wouldn't be available in Java 1.7, but nothing is final. We might need to delay this plugin due to this issue.

        Show
        jvz Matt Sicker added a comment - John Cairns : we've discovered a bit of an issue after I merged the branch. Your library requires Java 1.8, but we use Java 1.7. I've proposed that we build using Java 8 along with the animal-sniffer plugin to prevent using anything that wouldn't be available in Java 1.7, but nothing is final. We might need to delay this plugin due to this issue.
        Hide
        john@2ad.com John Cairns added a comment -

        Matt Sicker If it comes to that I could do a one-off Java 7 build. Let me know if this would help.

        Show
        john@2ad.com John Cairns added a comment - Matt Sicker If it comes to that I could do a one-off Java 7 build. Let me know if this would help.
        Hide
        jvz Matt Sicker added a comment -

        That would be helpful, yeah.

        Show
        jvz Matt Sicker added a comment - That would be helpful, yeah.
        Hide
        ralph.goers@dslextreme.com Ralph Goers added a comment -

        As I mentioned on the mailing list, we can't do a release using the Java 8 compiler with Java 7 as the target.

        Show
        ralph.goers@dslextreme.com Ralph Goers added a comment - As I mentioned on the mailing list, we can't do a release using the Java 8 compiler with Java 7 as the target.
        Hide
        john@2ad.com John Cairns added a comment -

        I just uploaded a Java 7 repo to Nexus. Please try it with 1.2.8-JDK7 once your cache is refreshed.

        <dependency>
        <groupId>com.conversantmedia</groupId>
        <artifactId>disruptor</artifactId>
        <version>1.2.8-JDK7</version>
        </dependency>

        Show
        john@2ad.com John Cairns added a comment - I just uploaded a Java 7 repo to Nexus. Please try it with 1.2.8-JDK7 once your cache is refreshed. <dependency> <groupId>com.conversantmedia</groupId> <artifactId>disruptor</artifactId> <version>1.2.8-JDK7</version> </dependency>
        Hide
        jvz Matt Sicker added a comment -

        Thanks, I've upgraded the dependency.

        Show
        jvz Matt Sicker added a comment - Thanks, I've upgraded the dependency.
        Hide
        jvz Matt Sicker added a comment -

        Code has been merged to master.

        Show
        jvz Matt Sicker added a comment - Code has been merged to master.
        Hide
        ralph.goers@dslextreme.com Ralph Goers added a comment - - edited

        Thanks for doing this. I hate to nitpick though, but shouldn't this have been:

              <dependency>
                <groupId>com.conversantmedia</groupId>
                <artifactId>disruptor</artifactId>
                <version>1.2.8</version>
                <classifier>JDK7</classifier>
              </dependency>
        
        Show
        ralph.goers@dslextreme.com Ralph Goers added a comment - - edited Thanks for doing this. I hate to nitpick though, but shouldn't this have been: <dependency> <groupId>com.conversantmedia</groupId> <artifactId>disruptor</artifactId> <version>1.2.8</version> <classifier>JDK7</classifier> </dependency>
        Hide
        jvz Matt Sicker added a comment - - edited

        I was not aware of that feature. I'll try it out.

        Edit: didn't seem to work for me. Tried changing the parent pom to:

              <dependency>
                <groupId>com.conversantmedia</groupId>
                <artifactId>disruptor</artifactId>
                <version>${conversantDisruptorVersion}</version>
                <classifier>JDK7</classifier>
              </dependency>
        

        But I'm getting a compile error:

        [ERROR] 'dependencies.dependency.version' for com.conversantmedia:disruptor:jar is missing. @ org.apache.logging.log4j:log4j-core:[unknown-version], /Users/matt/Documents/Programming/log4j/log4j-core/pom.xml, line 55, column 17
        
        Show
        jvz Matt Sicker added a comment - - edited I was not aware of that feature. I'll try it out. Edit: didn't seem to work for me. Tried changing the parent pom to: <dependency> <groupId>com.conversantmedia</groupId> <artifactId>disruptor</artifactId> <version>${conversantDisruptorVersion}</version> <classifier>JDK7</classifier> </dependency> But I'm getting a compile error: [ERROR] 'dependencies.dependency.version' for com.conversantmedia:disruptor:jar is missing. @ org.apache.logging.log4j:log4j-core:[unknown-version], /Users/matt/Documents/Programming/log4j/log4j-core/pom.xml, line 55, column 17
        Hide
        john@2ad.com John Cairns added a comment -

        I'll look into adding a classifier for a future release. Thanks Ralph Goers for the pointer.

        Show
        john@2ad.com John Cairns added a comment - I'll look into adding a classifier for a future release. Thanks Ralph Goers for the pointer.

          People

          • Assignee:
            jvz Matt Sicker
            Reporter:
            jvz Matt Sicker
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development