Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9399

Wrong cost calculation in prefix/wildcard queries

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 8.5.2
    • Fix Version/s: None
    • Component/s: core/other
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      When running a prefix query that matches more than 16 terms a BitSet is built for that subquery with the DocIdSetBuilder.
      The cost of the subquery calculated in that class too.
      If I get it right, the idea behind the calculation is to get the frequency of every matched term, sum it all and then divide by the average number of terms per doc. It makes sense but it seems like the sum is not calculated properly:

      public void add(DocIdSetIterator iter) throws IOException {
       if (bitSet != null) {
       bitSet.or(iter);
       return;
       }
       int cost = (int) Math.min(Integer.MAX_VALUE, iter.cost());
       BulkAdder adder = grow(cost);

      Instead of adding the frequency of every term, it stops after the `grow()` creates a bitset. From that moment add() will not call grow and the "counter" variable that stands for the sum of all frequencies will not grow anymore.

      In the end, it can create a prefix query with a full bitset (or almost full) that has a low cost. That query can become a leader in a conjunction query.

      It's important to say that even if the function is fixed as mentioned above, the cost might still result in a much lower cost than it should.
      For example, If we look for all the terms which start with "strin*" in a java_code field. String term and 20 more terms will match that query. The String Term will match most of the documents but the cost will be pretty low because of the calculation (sum_of_frequencies / avarage_number_of_terms_per_doc).

      Maybe we can count every new bit added to the bitset.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              Nirf8 Nir Finkelstein
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: