Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-10130

Serious performance degradation in Solr 6.4.1 due to the new metrics collection

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 6.4, 6.4.1
    • Fix Version/s: 6.4.2, 7.0
    • Component/s: metrics
    • Security Level: Public (Default Security Level. Issues are Public)
    • Labels:
    • Environment:

      Centos 7, OpenJDK 1.8.0 update 111

      Description

      We've stumbled on serious performance issues after upgrading to Solr 6.4.1. Looks like the new metrics collection system in MetricsDirectoryFactory is causing a major slowdown. This happens with an index configuration that, as far as I can see, has no metrics specific configuration and uses luceneMatchVersion 5.5.0. In practice a moderate load will completely bog down the server with Solr threads constantly using up all CPU (600% on 6 core machine) capacity with a load that normally where we normally see an average load of < 50%.

      I took stack traces (I'll attach them) and noticed that the threads are spending time in com.codahale.metrics.Meter.mark. I tested building Solr 6.4.1 with the metrics collection disabled in MetricsDirectoryFactory getByte and getBytes methods and was unable to reproduce the issue.

      As far as I can see there are several issues:
      1. Collecting metrics on every single byte read is slow.
      2. Having it enabled by default is not a good idea.
      3. The comment "enable coarse-grained metrics by default" at https://github.com/apache/lucene-solr/blob/branch_6x/solr/core/src/java/org/apache/solr/update/SolrIndexConfig.java#L104 implies that only coarse-grained metrics should be enabled by default, and this contradicts with collecting metrics on every single byte read.

      1. SOLR-10130.patch
        24 kB
        Andrzej Bialecki
      2. SOLR-10130.patch
        23 kB
        Andrzej Bialecki
      3. solr-8983-console-f1.log
        234 kB
        Ere Maijala

        Issue Links

          Activity

          Hide
          wunder Walter Underwood added a comment -

          I’m seeing similar problems here. With 6.4.0, we were handling 6000 requests/minute. With 6.4.1 it is 1000 rpm with median response times around 2.5 seconds. I also switched to the G1 collector. I’m going to back that out and retest today to see if the performance comes back.

          Does disabling metrics fix it or we we need to go back to 6.4.0?

          wunder

          Show
          wunder Walter Underwood added a comment - I’m seeing similar problems here. With 6.4.0, we were handling 6000 requests/minute. With 6.4.1 it is 1000 rpm with median response times around 2.5 seconds. I also switched to the G1 collector. I’m going to back that out and retest today to see if the performance comes back. Does disabling metrics fix it or we we need to go back to 6.4.0? wunder
          Hide
          ab Andrzej Bialecki added a comment -

          Does disabling metrics fix it or we we need to go back to 6.4.0?

          Unfortunately no, these metrics are always turned on both in 6.4.0 and in 6.4.1. I'll upload a patch that disables this by default and allows turning it on via a solrconfig.xml knob.

          Show
          ab Andrzej Bialecki added a comment - Does disabling metrics fix it or we we need to go back to 6.4.0? Unfortunately no, these metrics are always turned on both in 6.4.0 and in 6.4.1. I'll upload a patch that disables this by default and allows turning it on via a solrconfig.xml knob.
          Hide
          ab Andrzej Bialecki added a comment -

          This patch turns off Directory and Index metrics by default, and adds config knobs to selectively turn them on in solrconfig.xml (default are all false now, so this section is optional):

          <config>
          ...
            <indexConfig>
              <mergeFactor>...
              ...
              <metrics>
                <bool name="directory">false</bool>
                <bool name="directoryDetails">false</bool>
                <bool name="merge">false</bool>
                <bool name="mergeDetails">false</bool>
              </metrics>  
          
          Show
          ab Andrzej Bialecki added a comment - This patch turns off Directory and Index metrics by default, and adds config knobs to selectively turn them on in solrconfig.xml (default are all false now, so this section is optional): <config> ... <indexConfig> <mergeFactor>... ... <metrics> <bool name= "directory" > false </bool> <bool name= "directoryDetails" > false </bool> <bool name= "merge" > false </bool> <bool name= "mergeDetails" > false </bool> </metrics>
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment - - edited

          I ran some benchmarks, with and without this patch.

          Benchmarking suite: https://github.com/chatman/solr-upgrade-tests/blob/master/BENCHMARKS.md
          Environment: packet.net, Type 0 server (https://www.packet.net/bare-metal/servers/type-0/)

          6.4.1 Without patch
          ------------------------
          java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 72f75b2503fa0aa4f0aff76d439874feb923bb0e -Nodes 1 -Shards 1 -Replicas 1 -numDocs 100000 -threads 6 -benchmarkType generalIndexing
          
          Indexing times: 188,190
          
          6.4.1 With patch
          --------------------
          java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 72f75b2503fa0aa4f0aff76d439874feb923bb0e -patchUrl https://issues.apache.org/jira/secure/attachment/12852444/SOLR-10130.patch -Nodes 1 -Shards 1 -Replicas 1 -numDocs 100000 -threads 6 -benchmarkType generalIndexing
          
          Indexing times: 171,165
          
          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - - edited I ran some benchmarks, with and without this patch. Benchmarking suite: https://github.com/chatman/solr-upgrade-tests/blob/master/BENCHMARKS.md Environment: packet.net, Type 0 server ( https://www.packet.net/bare-metal/servers/type-0/ ) 6.4.1 Without patch ------------------------ java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 72f75b2503fa0aa4f0aff76d439874feb923bb0e -Nodes 1 -Shards 1 -Replicas 1 -numDocs 100000 -threads 6 -benchmarkType generalIndexing Indexing times: 188,190 6.4.1 With patch -------------------- java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 72f75b2503fa0aa4f0aff76d439874feb923bb0e -patchUrl https: //issues.apache.org/jira/secure/attachment/12852444/SOLR-10130.patch -Nodes 1 -Shards 1 -Replicas 1 -numDocs 100000 -threads 6 -benchmarkType generalIndexing Indexing times: 171,165
          Hide
          ab Andrzej Bialecki added a comment -

          I would've expected a much larger difference with this patch, if this indeed was the cause of the slowdown - the patch completely turns off metrics collection at directory and index level.

          With 6.4.0, we were handling 6000 requests/minute. With 6.4.1 it is 1000 rpm

          Walter Underwood This is odd, too - the same metrics code is present in both 6.4.1 and 6.4.0, with the same defaults, so I would expect that both versions should show similar performance. Could you please collect some stacktraces (or sample / profile) to verify that you see the same hotspots as Ere Maijala ?

          Show
          ab Andrzej Bialecki added a comment - I would've expected a much larger difference with this patch, if this indeed was the cause of the slowdown - the patch completely turns off metrics collection at directory and index level. With 6.4.0, we were handling 6000 requests/minute. With 6.4.1 it is 1000 rpm Walter Underwood This is odd, too - the same metrics code is present in both 6.4.1 and 6.4.0, with the same defaults, so I would expect that both versions should show similar performance. Could you please collect some stacktraces (or sample / profile) to verify that you see the same hotspots as Ere Maijala ?
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          6.4.0 shows very similar numbers as compared to 6.4.1

          6.4.0 Without patch
          --------------------
          java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 680153de29c5b01d4a8afad88d4a7b84ab01e145 -Nodes 1 -Shards 1 -Replicas 1 -numDocs 100000 -threads 6 -benchmarkType generalIndexing
          
          Indexing times: 191,184
          
          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - 6.4.0 shows very similar numbers as compared to 6.4.1 6.4.0 Without patch -------------------- java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 680153de29c5b01d4a8afad88d4a7b84ab01e145 -Nodes 1 -Shards 1 -Replicas 1 -numDocs 100000 -threads 6 -benchmarkType generalIndexing Indexing times: 191,184
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          6.3.0 was faster (same as 6.4.1 with patch).

          java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 6fa26fe8553b7b65dee96da741f2c1adf4cb6216 -patchUrl http://147.75.108.131/LUCENE-7651.patch -Nodes 1 -Shards 1 -Replicas 1 -numDocs 100000 -threads 6 -benchmarkType generalIndexing
          Indexing times: 168,167
          
          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - 6.3.0 was faster (same as 6.4.1 with patch). java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 6fa26fe8553b7b65dee96da741f2c1adf4cb6216 -patchUrl http: //147.75.108.131/LUCENE-7651.patch -Nodes 1 -Shards 1 -Replicas 1 -numDocs 100000 -threads 6 -benchmarkType generalIndexing Indexing times: 168,167
          Hide
          ab Andrzej Bialecki added a comment - - edited

          I haven't been able to reproduce such drastic slowdown using simple benchmarks - example results from indexing using post tool, fairly representative from several runs on each branch:

          * branch_6_3
          real    4m14.804s
          user    0m0.883s
          sys     0m2.279s
          
          * branch_6_4
          real    5m0.987s
          user    0m0.910s
          sys     0m2.276s
          
          * jira/solr-10130
          real    4m38.097s
          user    0m0.881s
          sys     0m2.287s
          

          Profiler indeed shows that one of the hotspots on branch_6_4 is the Meter.mark code that is called in org.apache.solr.core.MetricsDirectoryFactory$MetricsInput.readByte. In my test the profiler showed that this consumes ~ 3% CPU, which is indeed something that we should avoid and turn off by default.

          However, this still doesn't explain the order of magnitude slowdown reported above.

          Ere Maijala and Walter Underwood - please apply the above patch in your environment and see what is the impact. It makes sense to make this change anyway, so I'm going to apply this or similar version to all affected branches, but maybe there's more we can do here.

          Show
          ab Andrzej Bialecki added a comment - - edited I haven't been able to reproduce such drastic slowdown using simple benchmarks - example results from indexing using post tool, fairly representative from several runs on each branch: * branch_6_3 real 4m14.804s user 0m0.883s sys 0m2.279s * branch_6_4 real 5m0.987s user 0m0.910s sys 0m2.276s * jira/solr-10130 real 4m38.097s user 0m0.881s sys 0m2.287s Profiler indeed shows that one of the hotspots on branch_6_4 is the Meter.mark code that is called in org.apache.solr.core.MetricsDirectoryFactory$MetricsInput.readByte . In my test the profiler showed that this consumes ~ 3% CPU, which is indeed something that we should avoid and turn off by default. However, this still doesn't explain the order of magnitude slowdown reported above. Ere Maijala and Walter Underwood - please apply the above patch in your environment and see what is the impact. It makes sense to make this change anyway, so I'm going to apply this or similar version to all affected branches, but maybe there's more we can do here.
          Hide
          monti Henrik added a comment -

          We've also seen performance degradation with SolrCloud on 6.4.1, as I've posted on solr-user ( http://lucene.472066.n3.nabble.com/Performance-degradation-after-upgrading-from-6-2-1-to-6-4-1-td4320226.html ):

          Here are a couple of graphs. As you can see, 6.4.1 was introduced 2/10
          12:00:

          https://www.dropbox.com/s/qrc0wodain50azz/solr1.png?dl=0
          https://www.dropbox.com/s/sdk30imm8jlomz2/solr2.png?dl=0
          https://www.dropbox.com/s/rgd8bq86i3c5mga/solr2b.png?dl=0

          These are two very different usage scenarios:

          • Solr1 has constant updates and very volatile data (30 minutes TTL, 20
            shards with no replicas, across 8 servers). Requests in the 99 percentile
            went from ~400ms to 1000-1500ms. (Hystrix cutoff at 1.5s)
          • Solr2 is a more traditional instance with long-lived data (updated once a
            day, 24 shards with 2 replicas, across 8 servers). Requests in the 99
            percentile went from ~400ms to at least 1s. (Hystrix cutoff at 1s)
          Show
          monti Henrik added a comment - We've also seen performance degradation with SolrCloud on 6.4.1, as I've posted on solr-user ( http://lucene.472066.n3.nabble.com/Performance-degradation-after-upgrading-from-6-2-1-to-6-4-1-td4320226.html ): Here are a couple of graphs. As you can see, 6.4.1 was introduced 2/10 12:00: https://www.dropbox.com/s/qrc0wodain50azz/solr1.png?dl=0 https://www.dropbox.com/s/sdk30imm8jlomz2/solr2.png?dl=0 https://www.dropbox.com/s/rgd8bq86i3c5mga/solr2b.png?dl=0 These are two very different usage scenarios: Solr1 has constant updates and very volatile data (30 minutes TTL, 20 shards with no replicas, across 8 servers). Requests in the 99 percentile went from ~400ms to 1000-1500ms. (Hystrix cutoff at 1.5s) Solr2 is a more traditional instance with long-lived data (updated once a day, 24 shards with 2 replicas, across 8 servers). Requests in the 99 percentile went from ~400ms to at least 1s. (Hystrix cutoff at 1s)
          Hide
          wunder Walter Underwood added a comment -

          Sorry, the 6000 rpm was with 6.2.1, not 6.4.0.

          I've backrev'ed the cluster to 6.3.0 and I'll be running load benchmarks today.

          Show
          wunder Walter Underwood added a comment - Sorry, the 6000 rpm was with 6.2.1, not 6.4.0. I've backrev'ed the cluster to 6.3.0 and I'll be running load benchmarks today.
          Hide
          yseeley@gmail.com Yonik Seeley added a comment -

          Just a matter of how many little IOs are involved in your request.
          I was easily able to reproduce a 5x slowdown with a prefix query that matches many terms.

          Show
          yseeley@gmail.com Yonik Seeley added a comment - Just a matter of how many little IOs are involved in your request. I was easily able to reproduce a 5x slowdown with a prefix query that matches many terms.
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          Thanks Yonik. I'm working on query performance benchmarks for this.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - Thanks Yonik. I'm working on query performance benchmarks for this.
          Hide
          wunder Walter Underwood added a comment -

          I have a JMeter-based load script I can share. It replays access logs. I reload the collection to clear caches, run warming queries, then run queries at a controlled rate. After, it calculates percentiles.

          This was a test of 6.4.1. Really slow. The errors are usually log lines with queries so long that they are truncated and end up with bad syntax. There is one column per request handler, so these results are for /auto, /mobile, /select, and /srp.

          Mon Feb 13 12:01:29 PST 2017 ; INFO testing solr-cloud.test.cheggnet.com:8983
          Mon Feb 13 12:01:29 PST 2017 ; INFO testing with 2000 requests/min
          Mon Feb 13 12:01:29 PST 2017 ; INFO testing with 240000 requests
          Mon Feb 13 12:01:29 PST 2017 : splitting log into cache warming (first 2000 lines) and benchmark for /home/wunder/2016-12-12-peak-questions-traffic-clean.log
          Mon Feb 13 12:01:36 PST 2017 : starting cache warming to solr-cloud.test.cheggnet.com:8983
          Mon Feb 13 12:24:29 PST 2017 : starting benchmarking to solr-cloud.test.cheggnet.com:8983
          Mon Feb 13 12:24:29 PST 2017 : benchmark should run for 120 minutes
          Mon Feb 13 12:24:29 PST 2017 : to get a count of requests sent so far, use "wc -l out-32688.jtl"
          Mon Feb 13 14:55:01 PST 2017 : WARNING 207 error responses from solr-cloud.test.cheggnet.com
          Mon Feb 13 14:55:01 PST 2017 : INFO Removing 207 error responses from JMeter output file before analysis
          Mon Feb 13 14:55:01 PST 2017 : analyzing results
          /home/wunder/search-test/load-test
          Mon Feb 13 14:55:04 PST 2017 : 25th percentiles are 3151.0,3389.0,9329.0,5647.0
          Mon Feb 13 14:55:04 PST 2017 : medians are 6101.0,10579.0,18692.0,8780.0
          Mon Feb 13 14:55:04 PST 2017 : 75th percentiles are 6871.0,12499.0,25000.0,12580.0
          Mon Feb 13 14:55:04 PST 2017 : 90th percentiles are 7593.0,13481.0,27623.0,14068.0
          Mon Feb 13 14:55:04 PST 2017 : 95th percentiles are 8079.0,14039.0,28566.0,16606.0
          Mon Feb 13 14:55:04 PST 2017 : full results are in test.csv

          Show
          wunder Walter Underwood added a comment - I have a JMeter-based load script I can share. It replays access logs. I reload the collection to clear caches, run warming queries, then run queries at a controlled rate. After, it calculates percentiles. This was a test of 6.4.1. Really slow. The errors are usually log lines with queries so long that they are truncated and end up with bad syntax. There is one column per request handler, so these results are for /auto, /mobile, /select, and /srp. Mon Feb 13 12:01:29 PST 2017 ; INFO testing solr-cloud.test.cheggnet.com:8983 Mon Feb 13 12:01:29 PST 2017 ; INFO testing with 2000 requests/min Mon Feb 13 12:01:29 PST 2017 ; INFO testing with 240000 requests Mon Feb 13 12:01:29 PST 2017 : splitting log into cache warming (first 2000 lines) and benchmark for /home/wunder/2016-12-12-peak-questions-traffic-clean.log Mon Feb 13 12:01:36 PST 2017 : starting cache warming to solr-cloud.test.cheggnet.com:8983 Mon Feb 13 12:24:29 PST 2017 : starting benchmarking to solr-cloud.test.cheggnet.com:8983 Mon Feb 13 12:24:29 PST 2017 : benchmark should run for 120 minutes Mon Feb 13 12:24:29 PST 2017 : to get a count of requests sent so far, use "wc -l out-32688.jtl" Mon Feb 13 14:55:01 PST 2017 : WARNING 207 error responses from solr-cloud.test.cheggnet.com Mon Feb 13 14:55:01 PST 2017 : INFO Removing 207 error responses from JMeter output file before analysis Mon Feb 13 14:55:01 PST 2017 : analyzing results /home/wunder/search-test/load-test Mon Feb 13 14:55:04 PST 2017 : 25th percentiles are 3151.0,3389.0,9329.0,5647.0 Mon Feb 13 14:55:04 PST 2017 : medians are 6101.0,10579.0,18692.0,8780.0 Mon Feb 13 14:55:04 PST 2017 : 75th percentiles are 6871.0,12499.0,25000.0,12580.0 Mon Feb 13 14:55:04 PST 2017 : 90th percentiles are 7593.0,13481.0,27623.0,14068.0 Mon Feb 13 14:55:04 PST 2017 : 95th percentiles are 8079.0,14039.0,28566.0,16606.0 Mon Feb 13 14:55:04 PST 2017 : full results are in test.csv
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          I could reproduce a 1.6x slowdown for prefix queries.

          Benchmarking suite: https://github.com/chatman/solr-upgrade-tests/blob/master/BENCHMARKS.md
          Environment: packet.net, Type 0 server (https://www.packet.net/bare-metal/servers/type-0/)

          Prefix query times for 6.4.1
          ----------------------------
          java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 72f75b2503fa0aa4f0aff76d439874feb923bb0e -Nodes 1 -Shards 1 -Replicas 1 -numDocs 10000 -threads 4 -benchmarkType generalQuerying
          
          Got results for prefix queries: 10000
          Max time (prefix queries): 2156ms
          Total time (prefix queries): 1324856ms
          
          Prefix query times for 6.3.0
          ----------------------------
          java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 6fa26fe8553b7b65dee96da741f2c1adf4cb6216 -patchUrl http://147.75.108.131/LUCENE-7651.patch -Nodes 1 -Shards 1 -Replicas 1 -numDocs 10000 -threads 4 -benchmarkType generalQuerying
          
          Got results for prefix queries: 10000
          Max time (prefix queries): 1358ms
          Total time (prefix queries): 839534ms
          

          Notes:
          1. The -threads parameter here is for no. of indexing threads, and number of querying threads is 4 times that, i.e. 16 in this case.
          2. Total time is the sum of all times, as reported in the response header's "QTime". Max time is the QTime for the worst performing query.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - I could reproduce a 1.6x slowdown for prefix queries. Benchmarking suite: https://github.com/chatman/solr-upgrade-tests/blob/master/BENCHMARKS.md Environment: packet.net, Type 0 server ( https://www.packet.net/bare-metal/servers/type-0/ ) Prefix query times for 6.4.1 ---------------------------- java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 72f75b2503fa0aa4f0aff76d439874feb923bb0e -Nodes 1 -Shards 1 -Replicas 1 -numDocs 10000 -threads 4 -benchmarkType generalQuerying Got results for prefix queries: 10000 Max time (prefix queries): 2156ms Total time (prefix queries): 1324856ms Prefix query times for 6.3.0 ---------------------------- java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 6fa26fe8553b7b65dee96da741f2c1adf4cb6216 -patchUrl http: //147.75.108.131/LUCENE-7651.patch -Nodes 1 -Shards 1 -Replicas 1 -numDocs 10000 -threads 4 -benchmarkType generalQuerying Got results for prefix queries: 10000 Max time (prefix queries): 1358ms Total time (prefix queries): 839534ms Notes: 1. The -threads parameter here is for no. of indexing threads, and number of querying threads is 4 times that, i.e. 16 in this case. 2. Total time is the sum of all times, as reported in the response header's "QTime". Max time is the QTime for the worst performing query.
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -
          Prefix query times for 6.4.1 with SOLR-10130 patch
          -----------------------------------------------------------------
          java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 72f75b2503fa0aa4f0aff76d439874feb923bb0e -patchUrl https://issues.apache.org/jira/secure/attachment/12852444/SOLR-10130.patch -Nodes 1 -Shards 1 -Replicas 1 -numDocs 10000 -threads 4 -benchmarkType generalQuerying
          
          Got results for prefix queries: 10000
          Max time (prefix queries): 1716
          Total time (prefix queries): 852266
          
          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - Prefix query times for 6.4.1 with SOLR-10130 patch ----------------------------------------------------------------- java -cp target/org.apache.solr.tests.upgradetests-0.0.1-SNAPSHOT-jar-with-dependencies.jar:. org.apache.solr.tests.upgradetests.SimpleBenchmarks -v 72f75b2503fa0aa4f0aff76d439874feb923bb0e -patchUrl https: //issues.apache.org/jira/secure/attachment/12852444/SOLR-10130.patch -Nodes 1 -Shards 1 -Replicas 1 -numDocs 10000 -threads 4 -benchmarkType generalQuerying Got results for prefix queries: 10000 Max time (prefix queries): 1716 Total time (prefix queries): 852266
          Hide
          emaijala Ere Maijala added a comment -

          I don't have proper benchmarks at hand, but I can support others' findings about the serious query performance degradation. I suppose it's easily overlooked when testing with light concurrency, but with enough concurrent queries being served it gets CPU-heavy. We use queries with a lot of filters so that may play a role too. I'll see if I came come up with a reproducible-enough test results from our actual queries.

          Show
          emaijala Ere Maijala added a comment - I don't have proper benchmarks at hand, but I can support others' findings about the serious query performance degradation. I suppose it's easily overlooked when testing with light concurrency, but with enough concurrent queries being served it gets CPU-heavy. We use queries with a lot of filters so that may play a role too. I'll see if I came come up with a reproducible-enough test results from our actual queries.
          Hide
          tech@bidorbuy.co.za bidorbuy added a comment -

          Same issue here. Worked perfectly fine on Solr 6.2.0 and CPU is trashing on Solr 6.4.1. I didn't see this bug report and logged a duplicate - https://issues.apache.org/jira/browse/SOLR-10140 showing slowdown in comparison.

          In our case, Solr 6.4.1 works perfectly fine under production load for about 1 hour and then CPU starts trashing. From the New Relic reports you will see that Solr 6.4.1 is flaring CPU substantially more than prior versions.

          Show
          tech@bidorbuy.co.za bidorbuy added a comment - Same issue here. Worked perfectly fine on Solr 6.2.0 and CPU is trashing on Solr 6.4.1. I didn't see this bug report and logged a duplicate - https://issues.apache.org/jira/browse/SOLR-10140 showing slowdown in comparison. In our case, Solr 6.4.1 works perfectly fine under production load for about 1 hour and then CPU starts trashing. From the New Relic reports you will see that Solr 6.4.1 is flaring CPU substantially more than prior versions.
          Hide
          ab Andrzej Bialecki added a comment -

          Patch with some cleanup and CHANGES.txt entry. I'll commit this shortly.

          Show
          ab Andrzej Bialecki added a comment - Patch with some cleanup and CHANGES.txt entry. I'll commit this shortly.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit a9eb001f44ca846b64d4ed6e46af316fe12ce3d0 in lucene-solr's branch refs/heads/branch_6_4 from Andrzej Bialecki
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a9eb001 ]

          SOLR-10130 Serious performance degradation in Solr 6.4.1 due to the
          new metrics collection.

          Show
          jira-bot ASF subversion and git services added a comment - Commit a9eb001f44ca846b64d4ed6e46af316fe12ce3d0 in lucene-solr's branch refs/heads/branch_6_4 from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=a9eb001 ] SOLR-10130 Serious performance degradation in Solr 6.4.1 due to the new metrics collection.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 835c96ba97a01c61978535c0e8fe34708755dc28 in lucene-solr's branch refs/heads/branch_6x from Andrzej Bialecki
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=835c96b ]

          SOLR-10130 Serious performance degradation in Solr 6.4.1 due to the
          new metrics collection.

          Show
          jira-bot ASF subversion and git services added a comment - Commit 835c96ba97a01c61978535c0e8fe34708755dc28 in lucene-solr's branch refs/heads/branch_6x from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=835c96b ] SOLR-10130 Serious performance degradation in Solr 6.4.1 due to the new metrics collection.
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit b6f49dc1fb4ad6ef890ae1d09f6d4c0584bb6f64 in lucene-solr's branch refs/heads/master from Andrzej Bialecki
          [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b6f49dc ]

          SOLR-10130 Serious performance degradation in Solr 6.4.1 due to the
          new metrics collection.

          Show
          jira-bot ASF subversion and git services added a comment - Commit b6f49dc1fb4ad6ef890ae1d09f6d4c0584bb6f64 in lucene-solr's branch refs/heads/master from Andrzej Bialecki [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=b6f49dc ] SOLR-10130 Serious performance degradation in Solr 6.4.1 due to the new metrics collection.
          Hide
          ab Andrzej Bialecki added a comment -

          Fixed in branch_6_4, branch_6x and master.

          Show
          ab Andrzej Bialecki added a comment - Fixed in branch_6_4, branch_6x and master.
          Hide
          tech@bidorbuy.co.za bidorbuy added a comment -

          Andrzej Bialecki would I have to do a Solr build myself to get the patch in or should I rather wait for 6.4.2 (if so, any indication of when it would be released)?

          Show
          tech@bidorbuy.co.za bidorbuy added a comment - Andrzej Bialecki would I have to do a Solr build myself to get the patch in or should I rather wait for 6.4.2 (if so, any indication of when it would be released)?
          Hide
          ab Andrzej Bialecki added a comment -

          You can of course build Solr yourself from branch_6_4 that contains the patch. I don't know of any specific timeline for 6.4.2, but this is a pretty serious issue so I think we should do it soon - let's discuss this on mailing lists.

          Show
          ab Andrzej Bialecki added a comment - You can of course build Solr yourself from branch_6_4 that contains the patch. I don't know of any specific timeline for 6.4.2, but this is a pretty serious issue so I think we should do it soon - let's discuss this on mailing lists.
          Hide
          monti Henrik added a comment -

          We just deployed the latest from branch_6_4 (a9eb001f44) and our systems are performing normally again. Thanks for your work on this Andrzej Bialecki !

          Show
          monti Henrik added a comment - We just deployed the latest from branch_6_4 (a9eb001f44) and our systems are performing normally again. Thanks for your work on this Andrzej Bialecki !
          Hide
          wunder Walter Underwood added a comment -

          I don't have hard numbers, but core recovery after a restart with 6.4.0 was taking a really long time. Maybe 30 minutes. Back-reved to 6.3.0, it is maybe five minutes.

          Show
          wunder Walter Underwood added a comment - I don't have hard numbers, but core recovery after a restart with 6.4.0 was taking a really long time. Maybe 30 minutes. Back-reved to 6.3.0, it is maybe five minutes.
          Hide
          alessandro.benedetti Alessandro Benedetti added a comment -

          What was causing the Query time slowdown for prefix queries ?
          Has this been discovered ?

          Show
          alessandro.benedetti Alessandro Benedetti added a comment - What was causing the Query time slowdown for prefix queries ? Has this been discovered ?
          Hide
          ab Andrzej Bialecki added a comment -

          Prefix query that matches many terms causes many seek & read ops, which meant that the instrumentation in org.apache.solr.core.MetricsDirectoryFactory$MetricsInput.readByte was called for every small read. This normally wouldn't matter for regular Directory implementations because they use caching extensively, precisely to avoid the overhead of reading single bytes, but MetricsDirectory being a wrapper on top of any Directory implementation couldn't benefit from this caching and still maintain the read/write counters. The overhead of individual Meter.mark call is in the order of microseconds, but invoking it a few million times resulted in significant slowdown.

          Show
          ab Andrzej Bialecki added a comment - Prefix query that matches many terms causes many seek & read ops, which meant that the instrumentation in org.apache.solr.core.MetricsDirectoryFactory$MetricsInput.readByte was called for every small read. This normally wouldn't matter for regular Directory implementations because they use caching extensively, precisely to avoid the overhead of reading single bytes, but MetricsDirectory being a wrapper on top of any Directory implementation couldn't benefit from this caching and still maintain the read/write counters. The overhead of individual Meter.mark call is in the order of microseconds, but invoking it a few million times resulted in significant slowdown.
          Hide
          alessandro.benedetti Alessandro Benedetti added a comment -

          Thanks, for the clear explanation!
          Good spot !

          Show
          alessandro.benedetti Alessandro Benedetti added a comment - Thanks, for the clear explanation! Good spot !
          Hide
          wunder Walter Underwood added a comment -

          The slowdown is impressive under heavy query load. Here are two load benchmarks with a 16 node cluster, c4.8xlarge instances (36 CPUs, 60 GB RAM), 15.7 million docs, 4 shards, replication factor 4 using production query logs. These are very long text queries, up to 40 words. Benchmark runs for two or three hours, depending on my patience. Java 8u121, G1 collector.

          6.4.0 with 1000 requests/minute is running out of CPU. Median and 95th percentile response times for an ngram/prefix match are 7.5 and 9.8 seconds. For a word match, they are 11 and 25.4 seconds.

          6.3.0 with 6000 rpm, the times are 0.4 and 2.7 seconds, and 0.7 and 4.3 seconds, respectively. CPU usage is under 50%.

          Short version, 6.4 is 10X slower than 6.3 handling 1/6 the load.

          Show
          wunder Walter Underwood added a comment - The slowdown is impressive under heavy query load. Here are two load benchmarks with a 16 node cluster, c4.8xlarge instances (36 CPUs, 60 GB RAM), 15.7 million docs, 4 shards, replication factor 4 using production query logs. These are very long text queries, up to 40 words. Benchmark runs for two or three hours, depending on my patience. Java 8u121, G1 collector. 6.4.0 with 1000 requests/minute is running out of CPU. Median and 95th percentile response times for an ngram/prefix match are 7.5 and 9.8 seconds. For a word match, they are 11 and 25.4 seconds. 6.3.0 with 6000 rpm, the times are 0.4 and 2.7 seconds, and 0.7 and 4.3 seconds, respectively. CPU usage is under 50%. Short version, 6.4 is 10X slower than 6.3 handling 1/6 the load.
          Hide
          wunder Walter Underwood added a comment -

          Also, recovery is much, much slower in 6.4. Each core is about 8 GB. After a server process restart, the core is recovering for a few minutes in 6.3, but for about a half hour in 6.4.

          Show
          wunder Walter Underwood added a comment - Also, recovery is much, much slower in 6.4. Each core is about 8 GB. After a server process restart, the core is recovering for a few minutes in 6.3, but for about a half hour in 6.4.
          Hide
          erickerickson Erick Erickson added a comment -

          Walter Underwood: If it's easy, could you try a manual fetchindex? Which you can even do in cloud mode. See: https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler

          Or maybe just see if the logs show that this very long recovery happens when you have a full recovery, i.e. you're copying the full index down from the leader/master...

          Show
          erickerickson Erick Erickson added a comment - Walter Underwood : If it's easy, could you try a manual fetchindex? Which you can even do in cloud mode. See: https://cwiki.apache.org/confluence/display/solr/Index+Replication#IndexReplication-HTTPAPICommandsfortheReplicationHandler Or maybe just see if the logs show that this very long recovery happens when you have a full recovery, i.e. you're copying the full index down from the leader/master...
          Hide
          wunder Walter Underwood added a comment -

          I'm looking at how long the core is marked "recovering" in the cloud view of the admin UI.

          There shouldn't be any recovery. The server process is restarted hours after the most recent update. I think this is how long it takes to get the core loaded and ready for search. Startup time, really.

          Show
          wunder Walter Underwood added a comment - I'm looking at how long the core is marked "recovering" in the cloud view of the admin UI. There shouldn't be any recovery. The server process is restarted hours after the most recent update. I think this is how long it takes to get the core loaded and ready for search. Startup time, really.
          Hide
          erickerickson Erick Erickson added a comment -

          Ah, ok. I take it this is a replica with a bunch of data in it? Although this doesn't make sense, there shouldn't be that much work do fire up a core absent tlog replay and the like but it sounds like you're far beyond that so I'm missing something.

          Is there any way you could pull/build a new version of Solr 6.4 (or apply the patch on this JIRA locally) and try? I'd hate to have the 6.4.2 release get out (coming soon, due to this) and not have fixed a different issue.

          Show
          erickerickson Erick Erickson added a comment - Ah, ok. I take it this is a replica with a bunch of data in it? Although this doesn't make sense, there shouldn't be that much work do fire up a core absent tlog replay and the like but it sounds like you're far beyond that so I'm missing something. Is there any way you could pull/build a new version of Solr 6.4 (or apply the patch on this JIRA locally) and try? I'd hate to have the 6.4.2 release get out (coming soon, due to this) and not have fixed a different issue.
          Hide
          ab Andrzej Bialecki added a comment -

          Is there any way you could pull/build a new version of Solr 6.4 (or apply the patch on this JIRA locally) and try? I'd hate to have the 6.4.2 release get out (coming soon, due to this) and not have fixed a different issue.

          I concur. Walter Underwood - we are not sure if your situation is caused by the issue fixed here or by some other bug, it would be very helpful if you could try a build that contains this patch to see if it solves the problem in your environment.

          Show
          ab Andrzej Bialecki added a comment - Is there any way you could pull/build a new version of Solr 6.4 (or apply the patch on this JIRA locally) and try? I'd hate to have the 6.4.2 release get out (coming soon, due to this) and not have fixed a different issue. I concur. Walter Underwood - we are not sure if your situation is caused by the issue fixed here or by some other bug, it would be very helpful if you could try a build that contains this patch to see if it solves the problem in your environment.
          Hide
          wunder Walter Underwood added a comment -

          This might be part of it:

          [wunder@new-solr-c01.test3]# ls -lh /solr/data/questions_shard2_replica1/data/tlog/
          total 4.7G
          rw-rr- 1 bin bin 4.7G Feb 13 11:04 tlog.0000000000000000000
          [wunder@new-solr-c01.test3]# du -sh /solr/data/questions_shard2_replica1/data/*
          8.4G /solr/data/questions_shard2_replica1/data/index
          4.0K /solr/data/questions_shard2_replica1/data/snapshot_metadata
          4.7G /solr/data/questions_shard2_replica1/data/tlog

          Last Modified: 3 days ago
          Num Docs: 3683075
          Max Doc: 3683075
          Heap Memory Usage: -1
          Deleted Docs: 0
          Version: 2737
          Segment Count: 26
          Optimized: yes
          Current: yes

          Show
          wunder Walter Underwood added a comment - This might be part of it: [wunder@new-solr-c01.test3] # ls -lh /solr/data/questions_shard2_replica1/data/tlog/ total 4.7G rw-r r - 1 bin bin 4.7G Feb 13 11:04 tlog.0000000000000000000 [wunder@new-solr-c01.test3] # du -sh /solr/data/questions_shard2_replica1/data/* 8.4G /solr/data/questions_shard2_replica1/data/index 4.0K /solr/data/questions_shard2_replica1/data/snapshot_metadata 4.7G /solr/data/questions_shard2_replica1/data/tlog Last Modified: 3 days ago Num Docs: 3683075 Max Doc: 3683075 Heap Memory Usage: -1 Deleted Docs: 0 Version: 2737 Segment Count: 26 Optimized: yes Current: yes
          Hide
          erickerickson Erick Erickson added a comment -

          How on earth did you get a 4.7G tlog? It looks like you somehow didn't commit, shut the node down are replaying a ton of docs (well, how much does 4.7G weigh anyway?) from the tlog.

          So, simple test:
          1> wait for the node to come up.
          2> insure you've issued a hard commit
          3> try restarting.

          My claim is that the restart will be reasonable and the slowness you're seeing is a result of somehow shutting down without doing a commit. Of course depending on your autocommit interval you may not need to do the hard commit before restarting...

          Best,
          Erick

          Show
          erickerickson Erick Erickson added a comment - How on earth did you get a 4.7G tlog? It looks like you somehow didn't commit, shut the node down are replaying a ton of docs (well, how much does 4.7G weigh anyway?) from the tlog. So, simple test: 1> wait for the node to come up. 2> insure you've issued a hard commit 3> try restarting. My claim is that the restart will be reasonable and the slowness you're seeing is a result of somehow shutting down without doing a commit. Of course depending on your autocommit interval you may not need to do the hard commit before restarting... Best, Erick
          Hide
          emaijala Ere Maijala added a comment -

          I still don't have proper benchmarks, but I've tested enough to say with fair confidence that this is fixed for us.

          Show
          emaijala Ere Maijala added a comment - I still don't have proper benchmarks, but I've tested enough to say with fair confidence that this is fixed for us.
          Hide
          ichattopadhyaya Ishan Chattopadhyaya added a comment -

          Adding a link to https://issues.apache.org/jira/browse/SOLR-10182 for backing out the changes that caused these perf degradations.

          Show
          ichattopadhyaya Ishan Chattopadhyaya added a comment - Adding a link to https://issues.apache.org/jira/browse/SOLR-10182 for backing out the changes that caused these perf degradations.
          Hide
          elyograg Shawn Heisey added a comment -

          The reassignment was accidental, fixed. I hate the fact that Jira responds with real actions to just typing on the keyboard. I sometimes forget which window has the focus, assume that an SSH session I can clearly see is active, and find that I'm giving unknown commands to something that accepts keypresses as commands, like Thunderbird or Jira.

          Show
          elyograg Shawn Heisey added a comment - The reassignment was accidental, fixed. I hate the fact that Jira responds with real actions to just typing on the keyboard. I sometimes forget which window has the focus, assume that an SSH session I can clearly see is active, and find that I'm giving unknown commands to something that accepts keypresses as commands, like Thunderbird or Jira.
          Hide
          elyograg Shawn Heisey added a comment -

          Have a question related to this issue. Somebody on the IRC channel running 6.4.2 is seeing continued performance degradation compared to 4.x. They were running an earlier 6.4.x release, until they were advised about this issue.

          Looking at the utilization for threads, the top threads on 6.4.2 are all named starting with qtp, which I believe means they are Jetty threads.

          https://gist.github.com/msporleder-work/7313ebedbdab2e178ca0aa2e889d006b

          If I'm not mistaken, we enabled container-level metrics with the changes that went into 6.4.0. If that's true, do we perhaps have those metrics dialed up to 11?

          Show
          elyograg Shawn Heisey added a comment - Have a question related to this issue. Somebody on the IRC channel running 6.4.2 is seeing continued performance degradation compared to 4.x. They were running an earlier 6.4.x release, until they were advised about this issue. Looking at the utilization for threads, the top threads on 6.4.2 are all named starting with qtp, which I believe means they are Jetty threads. https://gist.github.com/msporleder-work/7313ebedbdab2e178ca0aa2e889d006b If I'm not mistaken, we enabled container-level metrics with the changes that went into 6.4.0. If that's true, do we perhaps have those metrics dialed up to 11?
          Hide
          ab Andrzej Bialecki added a comment -

          I don't think this should make much of a difference - InstrumentedQueuedThreadPool only exposes gauges, which basically don't add CPU overhead unless accessed, and InstrumentedHandler collects only a few specific metrics, so the overhead should also be minimal, in the order of microseconds / request.

          A drill-down into these threads to find their hot-spots would be useful.

          Show
          ab Andrzej Bialecki added a comment - I don't think this should make much of a difference - InstrumentedQueuedThreadPool only exposes gauges, which basically don't add CPU overhead unless accessed, and InstrumentedHandler collects only a few specific metrics, so the overhead should also be minimal, in the order of microseconds / request. A drill-down into these threads to find their hot-spots would be useful.
          Hide
          msporleder Matthew Sporleder added a comment -

          Not sure I have the tooling right now for a full drill down, but here are some examples of a thread dump:

          "qtp968514068-37953" - Thread t@37953
             java.lang.Thread.State: RUNNABLE
              at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
              at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269)
              at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79)
              at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86)
              - locked <2e101d2c> (a sun.nio.ch.Util$2)
              - locked <59c1f901> (a java.util.Collections$UnmodifiableSet)
              - locked <54c6a926> (a sun.nio.ch.EPollSelectorImpl)
              at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97)
              at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101)
              at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243)
              at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191)
              at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249)
              at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148)
              at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136)
              at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671)
              at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589)
              at java.lang.Thread.run(Thread.java:745)
          
             Locked ownable synchronizers:
              - None
          
          "qtp968514068-37952" - Thread t@37952
             java.lang.Thread.State: TIMED_WAITING
              at sun.misc.Unsafe.park(Native Method)
              - parking to wait for <2d78e562> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
              at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
              at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
              at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392)
              at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563)
              at org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48)
              at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626)
              at java.lang.Thread.run(Thread.java:745)
          

          Most are in that TIMED_WAITING and most CPU time is spend on org.eclipse.jetty.util.BlockingArrayQueue.poll according to visualvm

          Show
          msporleder Matthew Sporleder added a comment - Not sure I have the tooling right now for a full drill down, but here are some examples of a thread dump: "qtp968514068-37953" - Thread t@37953 java.lang. Thread .State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:269) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:79) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:86) - locked <2e101d2c> (a sun.nio.ch.Util$2) - locked <59c1f901> (a java.util.Collections$UnmodifiableSet) - locked <54c6a926> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:97) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:101) at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.select(ManagedSelector.java:243) at org.eclipse.jetty.io.ManagedSelector$SelectorProducer.produce(ManagedSelector.java:191) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.executeProduceConsume(ExecuteProduceConsume.java:249) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.produceConsume(ExecuteProduceConsume.java:148) at org.eclipse.jetty.util.thread.strategy.ExecuteProduceConsume.run(ExecuteProduceConsume.java:136) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:671) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:589) at java.lang. Thread .run( Thread .java:745) Locked ownable synchronizers: - None "qtp968514068-37952" - Thread t@37952 java.lang. Thread .State: TIMED_WAITING at sun.misc.Unsafe.park(Native Method) - parking to wait for <2d78e562> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject) at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215) at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078) at org.eclipse.jetty.util.BlockingArrayQueue.poll(BlockingArrayQueue.java:392) at org.eclipse.jetty.util.thread.QueuedThreadPool.idleJobPoll(QueuedThreadPool.java:563) at org.eclipse.jetty.util.thread.QueuedThreadPool.access$800(QueuedThreadPool.java:48) at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:626) at java.lang. Thread .run( Thread .java:745) Most are in that TIMED_WAITING and most CPU time is spend on org.eclipse.jetty.util.BlockingArrayQueue.poll according to visualvm
          Hide
          ab Andrzej Bialecki added a comment -

          Did you change JDK version between these two installs? I found an old issue (but still open!) that may indicate it's a JDK bug: https://github.com/netty/netty/issues/327 . There are other similar reports for Jetty, but for older versions ... can't say whether that's relevant here.

          However, what these stacktraces do NOT show is anything related to metrics.

          Show
          ab Andrzej Bialecki added a comment - Did you change JDK version between these two installs? I found an old issue (but still open!) that may indicate it's a JDK bug: https://github.com/netty/netty/issues/327 . There are other similar reports for Jetty, but for older versions ... can't say whether that's relevant here. However, what these stacktraces do NOT show is anything related to metrics.
          Hide
          msporleder Matthew Sporleder added a comment -

          Both are running java version "1.8.0_45" sun jdk

          Show
          msporleder Matthew Sporleder added a comment - Both are running java version "1.8.0_45" sun jdk
          Hide
          elyograg Shawn Heisey added a comment -

          Metrics was just a theory, sounds like that's not it. Thanks Andrzej Bialecki for the assist.

          Show
          elyograg Shawn Heisey added a comment - Metrics was just a theory, sounds like that's not it. Thanks Andrzej Bialecki for the assist.

            People

            • Assignee:
              ab Andrzej Bialecki
              Reporter:
              emaijala Ere Maijala
            • Votes:
              1 Vote for this issue
              Watchers:
              31 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development