Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-10542

FieldSource exists implementations can avoid value retrieval


    • Improvement
    • Status: Closed
    • Minor
    • Resolution: Fixed
    • None
    • 9.2
    • None
    • None
    • New


      While looking at LUCENE-10534, found that *FieldSource exists implementation after LUCENE-7407 can avoid value lookup when just checking for exists.

      Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call

      Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original)

      LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily:

      • FilterNumericDocValues#longValue()
      • advance()

      This makes sense based on looking at the code (copied below to make it easier to see at once)

            private long getValueForDoc(int doc) throws IOException {
              if (doc < lastDocID) {
                throw new IllegalArgumentException(
                    "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc);
              lastDocID = doc;
              int curDocID = arr.docID();
              if (doc > curDocID) {
                curDocID = arr.advance(doc);
              if (doc == curDocID) {
                return arr.longValue();
              } else {
                return 0;

      LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values.

            public boolean exists(int doc) throws IOException {
              return arr.docID() == doc;

      So putting this all together for exists calling getValueForDoc, we spent ~50% of the time trying to get the long value when we don't need it in exists. We can save that 50% of time making exists not care about the actual value and just return if doc == curDocID basically.

      This 50% extra is exaggerated in MaxFloatFunction (and other places) since exists() is being called a bunch. Eventually the value will be needed from longVal(), but if we call exists() say 3 times for every longVal(), we are spending a lot of time computing the value when we only need to check for existence.

      I found the same pattern in DoubleFieldSource, EnumFieldSource, FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change showing what this would look like:

      Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847.

      Benchmark Mode Cnt Score and Error Units
      ----------------------------------------------------------------- -------
      ------------------ -------
      MyBenchmark.testMaxFloatFunction thrpt 25 64.159 ± 2.031 ops/s
      MyBenchmark.testNewMaxFloatFunction thrpt 25 94.997 ± 2.365 ops/s
      MyBenchmark.testMaxFloatFunctionNewFloatFieldSource thrpt 25 123.191 ± 9.291 ops/s
      MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource thrpt 25 123.817 ± 6.191 ops/s
      MyBenchmark.testMaxFloatFunctionRareField thrpt 25 244.921 ± 6.439 ops/s
      MyBenchmark.testNewMaxFloatFunctionRareField thrpt 25 239.288 ± 5.136 ops/s
      MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField thrpt 25 271.521 ± 3.870 ops/s
      MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField thrpt 25 279.334 ± 10.511 ops/s

      Source: https://github.com/risdenk/lucene-jmh


        1. flamegraph_getValueForDoc.png
          535 kB
          Kevin Risden

        Issue Links



              krisden Kevin Risden
              krisden Kevin Risden
              0 Vote for this issue
              3 Start watching this issue



                Time Tracking

                  Original Estimate - Not Specified
                  Not Specified
                  Remaining Estimate - 0h
                  Time Spent - 0.5h