Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.0
    • Component/s: core/query/scoring
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      wildcardquery has logic to rewrite to termquery if there is no wildcard character, but

      • it needs to pass along the boost if it does this
      • if the user asked for a 'constant score' rewriteMethod, it should rewrite to a constant score query for consistency.

      additionally, if the query is really a prefixquery, it would be nice to rewrite to prefix query.
      both will enumerate the same number of terms, but prefixquery has a simpler comparison function.

      1. LUCENE-1951_bwcompatbranch.patch
        0.9 kB
        Robert Muir
      2. LUCENE-1951.patch
        9 kB
        Robert Muir
      3. LUCENE-1951.patch
        6 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        here are some stats for rewriting wildcards that should be prefix.
        i query on a field with about ~10M numeric terms (a unique database id), average length 10 characters or so.
        i copied this into ramdirectory to try to rule out i/o a bit (its only 1GB index and i use 4GB heap)

        I look for all the ones starting with "1" (about 1.5 million of these). I did 3 runs, 100 queries each.
        here are average times for each.

        Run wildcardquery("1*") prefixquery("1")
        1 1181ms 973ms
        2 1179ms 966ms
        3 1079ms 963ms

        So, its not a big optimization, but seems consistent, and maybe more important if avg term length is longer: in this case wildcard's comparison function might have to do even more work.

        I'll work on a patch to fix the boost/constant score and include a prefixquery rewrite for this case.

        Show
        Robert Muir added a comment - here are some stats for rewriting wildcards that should be prefix. i query on a field with about ~10M numeric terms (a unique database id), average length 10 characters or so. i copied this into ramdirectory to try to rule out i/o a bit (its only 1GB index and i use 4GB heap) I look for all the ones starting with "1" (about 1.5 million of these). I did 3 runs, 100 queries each. here are average times for each. Run wildcardquery("1*") prefixquery("1") 1 1181ms 973ms 2 1179ms 966ms 3 1079ms 963ms So, its not a big optimization, but seems consistent, and maybe more important if avg term length is longer: in this case wildcard's comparison function might have to do even more work. I'll work on a patch to fix the boost/constant score and include a prefixquery rewrite for this case.
        Hide
        Robert Muir added a comment -

        patch for these issues.
        note: the existing TestWildCardQuery.testTermWithoutWildcard had a bad test, and needs to be fixed for bw-compat branch.

        it did the following:

        Query wq = new WildcardQuery(new Term("field", "nowildcard"));
        wq = searcher.rewrite(wq);
        assertTrue(wq instanceof TermQuery);
        

        this is not correct, it should only be TermQuery when rewriteMethod is SCORING_BOOLEAN_QUERY_REWRITE. and this is not the default, constant score is.

        easiest way to fix the old test is to setRewriteMethod(SCORING_BOOLEAN_QUERY_REWRITE), its the only time it should rewrite to TermQuery.

        Show
        Robert Muir added a comment - patch for these issues. note: the existing TestWildCardQuery.testTermWithoutWildcard had a bad test, and needs to be fixed for bw-compat branch. it did the following: Query wq = new WildcardQuery( new Term( "field" , "nowildcard" )); wq = searcher.rewrite(wq); assertTrue(wq instanceof TermQuery); this is not correct, it should only be TermQuery when rewriteMethod is SCORING_BOOLEAN_QUERY_REWRITE. and this is not the default, constant score is. easiest way to fix the old test is to setRewriteMethod(SCORING_BOOLEAN_QUERY_REWRITE), its the only time it should rewrite to TermQuery.
        Hide
        Robert Muir added a comment -

        patch to the back compat branch to fix the buggy wildcard rewrite test.

        Show
        Robert Muir added a comment - patch to the back compat branch to fix the buggy wildcard rewrite test.
        Hide
        Michael McCandless added a comment -

        Patch looks good, thanks Robert! And those are good perf numbers;
        rewriting to PrefixQuery seems a clear win.

        The only thing that makes me nervous here is we've baked-in MTQ's
        rewrite logic into WildcardQuery.rewrite. Ie, MTQ in general accepts
        any rewrite method, and so conceivably one could create their own
        rewrite method and then see that it's unused in the special case where
        WildcardQuery is a single term.

        And while it's true today that if the rewrite method != scoring
        boolean query, it must be a constant scoring one, that could
        conceivably some day change.

        Maybe a different approach would be to make a degenerate
        "SingleTermEnum" (subclasses FilteredTermEnum) that produces only a
        single term? Then in getEnum we could return that, instead, so the
        rewrite method remains intact?

        Show
        Michael McCandless added a comment - Patch looks good, thanks Robert! And those are good perf numbers; rewriting to PrefixQuery seems a clear win. The only thing that makes me nervous here is we've baked-in MTQ's rewrite logic into WildcardQuery.rewrite. Ie, MTQ in general accepts any rewrite method, and so conceivably one could create their own rewrite method and then see that it's unused in the special case where WildcardQuery is a single term. And while it's true today that if the rewrite method != scoring boolean query, it must be a constant scoring one, that could conceivably some day change. Maybe a different approach would be to make a degenerate "SingleTermEnum" (subclasses FilteredTermEnum) that produces only a single term? Then in getEnum we could return that, instead, so the rewrite method remains intact?
        Hide
        Robert Muir added a comment -

        Michael, I thought about this problem too, but didnt know what to do.

        I rather like the SingleTermEnum idea. I'll do it.

        Show
        Robert Muir added a comment - Michael, I thought about this problem too, but didnt know what to do. I rather like the SingleTermEnum idea. I'll do it.
        Hide
        Robert Muir added a comment -

        think there would be objection to making this proposed SingleTermEnum public?

        I would like to use it in LUCENE-1606 (contrib) to have consistency there as well.

        Show
        Robert Muir added a comment - think there would be objection to making this proposed SingleTermEnum public? I would like to use it in LUCENE-1606 (contrib) to have consistency there as well.
        Hide
        Michael McCandless added a comment -

        think there would be objection to making this proposed SingleTermEnum public?

        I think that's fine.

        Show
        Michael McCandless added a comment - think there would be objection to making this proposed SingleTermEnum public? I think that's fine.
        Hide
        Robert Muir added a comment -

        updated patch, using SingleTermEnum instead of TermQuery rewrite when there are no wildcards to preserve all the MultiTermQuery semantics.

        Show
        Robert Muir added a comment - updated patch, using SingleTermEnum instead of TermQuery rewrite when there are no wildcards to preserve all the MultiTermQuery semantics.
        Hide
        Michael McCandless added a comment -

        Patch looks good Robert! Thanks. I'll commit soon.

        Show
        Michael McCandless added a comment - Patch looks good Robert! Thanks. I'll commit soon.
        Hide
        Robert Muir added a comment -

        Michael, cool. The bw_compat patch is still valid with these changes.

        I will mention one concern, just for the record (you can tell me if it is an issue).

        These tests test that for example, a WildcardQuery with SCORING_REWRITE rewrites to a TermQuery, which is correct, but now its a bit wierd how this happens.

        SingleTermEnum -> MultiTermQuery -> BooleanQuery with one term -> TermQuery.

        I couldnt think of a better way to test the correct behavior, but it is testing a bit more than just what happens in WildcardQuery...

        Show
        Robert Muir added a comment - Michael, cool. The bw_compat patch is still valid with these changes. I will mention one concern, just for the record (you can tell me if it is an issue). These tests test that for example, a WildcardQuery with SCORING_REWRITE rewrites to a TermQuery, which is correct, but now its a bit wierd how this happens. SingleTermEnum -> MultiTermQuery -> BooleanQuery with one term -> TermQuery. I couldnt think of a better way to test the correct behavior, but it is testing a bit more than just what happens in WildcardQuery...
        Hide
        Michael McCandless added a comment -

        That is a rather roundabout way to arrive at the TermQuery, but I think the test is fine?

        Show
        Michael McCandless added a comment - That is a rather roundabout way to arrive at the TermQuery, but I think the test is fine?
        Hide
        Robert Muir added a comment -

        That is a rather roundabout way to arrive at the TermQuery, but I think the test is fine?

        Ok, that was my only concern, the test. I like the SingleTermEnum otherwise, I think it will reduce maintenance.

        Show
        Robert Muir added a comment - That is a rather roundabout way to arrive at the TermQuery, but I think the test is fine? Ok, that was my only concern, the test. I like the SingleTermEnum otherwise, I think it will reduce maintenance.
        Hide
        Michael McCandless added a comment -

        Thanks Robert!

        Show
        Michael McCandless added a comment - Thanks Robert!

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development