[LUCENE-10542] FieldSource exists implementations can avoid value retrieval - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 9.2
Component/s: None
Labels:
None

Lucene Fields:

New

Description

While looking at ~~LUCENE-10534~~, found that *FieldSource exists implementation after ~~LUCENE-7407~~ can avoid value lookup when just checking for exists.

Flamegraphs - x axis = time spent as a percentage of time being profiled, y axis = stack trace bottom being first call top being last call

Looking only at the left most getValueForDoc highlight only (and it helps to make it bigger or download the original)

LongFieldSource#exists spends MOST of its time doing a LongFieldSource#getValueForDoc. LongFieldSource#getValueForDoc spends its time doing two things primarily:

FilterNumericDocValues#longValue()
advance()

This makes sense based on looking at the code (copied below to make it easier to see at once)
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L72

      private long getValueForDoc(int doc) throws IOException {
        if (doc < lastDocID) {
          throw new IllegalArgumentException(
              "docs were sent out-of-order: lastDocID=" + lastDocID + " vs docID=" + doc);
        }
        lastDocID = doc;
        int curDocID = arr.docID();
        if (doc > curDocID) {
          curDocID = arr.advance(doc);
        }
        if (doc == curDocID) {
          return arr.longValue();
        } else {
          return 0;
        }
      }

LongFieldSource#exists - doesn't care about the actual longValue. Just that there was a value found when iterating through the doc values.
https://github.com/apache/lucene/blob/main/lucene/queries/src/java/org/apache/lucene/queries/function/valuesource/LongFieldSource.java#L95

      @Override
      public boolean exists(int doc) throws IOException {
        getValueForDoc(doc);
        return arr.docID() == doc;
      }

So putting this all together for exists calling getValueForDoc, we spent ~50% of the time trying to get the long value when we don't need it in exists. We can save that 50% of time making exists not care about the actual value and just return if doc == curDocID basically.

This 50% extra is exaggerated in MaxFloatFunction (and other places) since exists() is being called a bunch. Eventually the value will be needed from longVal(), but if we call exists() say 3 times for every longVal(), we are spending a lot of time computing the value when we only need to check for existence.

I found the same pattern in DoubleFieldSource, EnumFieldSource, FloatFieldSource, IntFieldSource, LongFieldSource. I put together a change showing what this would look like:

Simple JMH performance tests comparing the original FloatFieldSource to the new ones from PR #847.

Benchmark	Mode	Cnt	Score and Error	Units
-----------------------------------------------------------------	-------		------------------	-------
MyBenchmark.testMaxFloatFunction	thrpt	25	64.159 ± 2.031	ops/s
MyBenchmark.testNewMaxFloatFunction	thrpt	25	94.997 ± 2.365	ops/s
MyBenchmark.testMaxFloatFunctionNewFloatFieldSource	thrpt	25	123.191 ± 9.291	ops/s
MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSource	thrpt	25	123.817 ± 6.191	ops/s
MyBenchmark.testMaxFloatFunctionRareField	thrpt	25	244.921 ± 6.439	ops/s
MyBenchmark.testNewMaxFloatFunctionRareField	thrpt	25	239.288 ± 5.136	ops/s
MyBenchmark.testMaxFloatFunctionNewFloatFieldSourceRareField	thrpt	25	271.521 ± 3.870	ops/s
MyBenchmark.testNewMaxFloatFunctionNewFloatFieldSourceRareField	thrpt	25	279.334 ± 10.511	ops/s

Source: https://github.com/risdenk/lucene-jmh

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

flamegraph_getValueForDoc.png
27/Apr/22 17:54
535 kB
Kevin Risden

Issue Links

relates to

LUCENE-7407 Explore switching doc values to an iterator API

Resolved

LUCENE-10534 MinFloatFunction / MaxFloatFunction calls exists twice

Closed

links to

GitHub Pull Request #847

Activity

People

Assignee:: Kevin Risden

Reporter:: Kevin Risden

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 27/Apr/22 17:48

Updated:: 28/Aug/22 16:39

Resolved:: 30/Apr/22 02:50

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

0.5h