Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9399

Wrong cost calculation in prefix/wildcard queries

Details

    • Bug
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • 8.5.2
    • None
    • core/other
    • None
    • New

    Description

      When running a prefix query that matches more than 16 terms a BitSet is built for that subquery with the DocIdSetBuilder.
      The cost of the subquery calculated in that class too.
      If I get it right, the idea behind the calculation is to get the frequency of every matched term, sum it all and then divide by the average number of terms per doc. It makes sense but it seems like the sum is not calculated properly:

      public void add(DocIdSetIterator iter) throws IOException {
       if (bitSet != null) {
       bitSet.or(iter);
       return;
       }
       int cost = (int) Math.min(Integer.MAX_VALUE, iter.cost());
       BulkAdder adder = grow(cost);

      Instead of adding the frequency of every term, it stops after the `grow()` creates a bitset. From that moment add() will not call grow and the "counter" variable that stands for the sum of all frequencies will not grow anymore.

      In the end, it can create a prefix query with a full bitset (or almost full) that has a low cost. That query can become a leader in a conjunction query.

      It's important to say that even if the function is fixed as mentioned above, the cost might still result in a much lower cost than it should.
      For example, If we look for all the terms which start with "strin*" in a java_code field. String term and 20 more terms will match that query. The String Term will match most of the documents but the cost will be pretty low because of the calculation (sum_of_frequencies / avarage_number_of_terms_per_doc).

      Maybe we can count every new bit added to the bitset.

      Attachments

        Activity

          People

            Unassigned Unassigned
            Nirf8 Nir Finkelstein
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: