Lucy
  1. Lucy
  2. LUCY-183

Eliminate spurious "extra" query normalization

    Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.1.0 (incubating), 0.2.0 (incubating), 0.2.1 (incubating)
    • Component/s: Search
    • Labels:
      None

      Description

      Query normalization is supposed to scale all scores uniformly by a simple
      multiplier, but the child nodes in complex queries are presently getting
      "extra" normalization applied to them. This has the effect of scaling
      different subqueries by different amounts, changing the balance of the
      subqueries within a complex query, interfering with IDF weighting and subtly
      degrading relevancy.

      1. normalization.patch
        25 kB
        Marvin Humphrey

        Activity

        Hide
        Marvin Humphrey added a comment -

        The supplied "normalize.patch" file adds a boolean "subordinate" parameter to
        Query_Make_Compiler() which defaults to false and indicates whether a Query is
        a child of another. If "subordinate" is indeed false (because the Query is a
        top-most node), then Query_Make_Compiler() implementations are now supposed to
        invoke Compiler_Normalize() on the newly created Compiler object.

        Giving Make_Compiler() an extra parameter and responsibility for invoking
        Normalize() is technically an API change, but it should have little impact on
        code in the wild. The only query types with ranking affected by this bug are
        those with child nodes, such as ANDQuery, ORQuery and RequiredOptionalQuery,
        but I'm unaware of anyone who has subclassed those. Single-node Query
        subclasses (e.g. LucyX::Search::WildCardQuery) will need to have their
        normalize() calls moved from their Compiler constructors to make_compiler(),
        but IDF for a WildCardQuery is an imprecise notion to begin with.

        This patch is also safe for any existing Searcher subclasses or other classes
        which currently invoke Make_Compiler() – the default value of "false" for
        "subordinate" is correct for such situations.

        The patch applies cleanly against both trunk and the 0.2 branch; I plan to
        commit it against both in a day or so.

        Show
        Marvin Humphrey added a comment - The supplied "normalize.patch" file adds a boolean "subordinate" parameter to Query_Make_Compiler() which defaults to false and indicates whether a Query is a child of another. If "subordinate" is indeed false (because the Query is a top-most node), then Query_Make_Compiler() implementations are now supposed to invoke Compiler_Normalize() on the newly created Compiler object. Giving Make_Compiler() an extra parameter and responsibility for invoking Normalize() is technically an API change, but it should have little impact on code in the wild. The only query types with ranking affected by this bug are those with child nodes, such as ANDQuery, ORQuery and RequiredOptionalQuery, but I'm unaware of anyone who has subclassed those. Single-node Query subclasses (e.g. LucyX::Search::WildCardQuery) will need to have their normalize() calls moved from their Compiler constructors to make_compiler(), but IDF for a WildCardQuery is an imprecise notion to begin with. This patch is also safe for any existing Searcher subclasses or other classes which currently invoke Make_Compiler() – the default value of "false" for "subordinate" is correct for such situations. The patch applies cleanly against both trunk and the 0.2 branch; I plan to commit it against both in a day or so.
        Hide
        David E. Wheeler added a comment -

        The fix seems to have broken something else. See this test failure, especially this:

        
        #   Failed test 'Should have results for simple search'
        #   at t/base.t line 274.
        #     Structures begin differing at:
        #          $got->{hits}[0]{excerpt} = 'This is the <strong>pair</strong><strong> README file. Here you will find all thingds related to </strong><strong>pair</strong>, including installation information'
        #     $expected->{hits}[0]{excerpt} = 'This is the <strong>pair</strong> README file. Here you will find all thingds related to <strong>pair</strong>, including installation information'
        
        Show
        David E. Wheeler added a comment - The fix seems to have broken something else. See this test failure , especially this: # Failed test 'Should have results for simple search' # at t/base.t line 274. # Structures begin differing at: # $got->{hits}[0]{excerpt} = 'This is the <strong>pair</strong><strong> README file. Here you will find all thingds related to </strong><strong>pair</strong>, including installation information' # $expected->{hits}[0]{excerpt} = 'This is the <strong>pair</strong> README file. Here you will find all thingds related to <strong>pair</strong>, including installation information'
        Hide
        Marvin Humphrey added a comment -

        That snippet shows a problem with highlighting, which is almost certainly
        related to LUCY-182 rather than this issue.

        The test failure in question also includes score differences, which
        are indeed likely a result of the changes made in this issue. However,
        scores should be more accurate now, so there's shouldn't be a need
        to reopen. The downstream test failures are regrettable, but the exact
        scores generated by Lucy are not part of our public API.

        In contrast, it looks like we're going to have to reopen LUCY-182.

        Show
        Marvin Humphrey added a comment - That snippet shows a problem with highlighting, which is almost certainly related to LUCY-182 rather than this issue. The test failure in question also includes score differences, which are indeed likely a result of the changes made in this issue. However, scores should be more accurate now, so there's shouldn't be a need to reopen. The downstream test failures are regrettable, but the exact scores generated by Lucy are not part of our public API. In contrast, it looks like we're going to have to reopen LUCY-182 .
        Hide
        David E. Wheeler added a comment -

        Yeah, I've fixed the issue with scores here.

        Show
        David E. Wheeler added a comment - Yeah, I've fixed the issue with scores here .
        Hide
        David E. Wheeler added a comment -

        And sorry for bitching on the wrong ticket.

        Show
        David E. Wheeler added a comment - And sorry for bitching on the wrong ticket.

          People

          • Assignee:
            Marvin Humphrey
            Reporter:
            Marvin Humphrey
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development