Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.1, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff/blocker for LUCENE-5879.

      It seems like PrefixQuery should "simply" be an AutomatonQuery rather than specializing its own TermsEnum ... with maybe some performance improvements to ByteRunAutomaton.run to short-circuit once it's in a "sink state", AutomatonTermsEnum could be just as fast as PrefixTermsEnum.

      If we can do this it will make LUCENE-5879 simpler.

      1. LUCENE-6367.patch
        31 kB
        Michael McCandless
      2. LUCENE-6367.patch
        21 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          Patch, cutting over PrefixQuery to AutomatonQuery and removing
          PrefixTermsEnum.

          I explored the optimization of having Byte/CharRunAutomaton.run
          optimize (short-circuit) when you're in a sink state but it became
          quite difficult/invasive fixing all callers of .step to handle this.
          With LUCENE-5879 we also need to know the sink state under-the-hood,
          but that's separate from fixing .run to make use of it.

          So I backed out that opto and tried just doing the PrefixQuery cutover
          without optimizing for sink states. I'm running a perf test w/
          luceneutil and it looks like the impact is trivial (well within
          noise). Net/net I think it's fine to "just cutover" without the
          invasive opto?

          I also changed PrefixQuery's semantics to apply to full binary space
          terms, not just UTF-8 space. While this is technically a change in
          behavior, it won't impact users who index only unicode terms. It's
          also necessary for LUCENE-5879, because if prefixing is done only in
          unicode space (like today), then the resulting binary space automaton
          will not have a sink state and auto-prefix can't apply.

          If this part is somehow controversial I can revert and try to do it
          only with LUCENE-5879 instead... if it's OK, I'll add some tests
          showing that PrefixQuery on binary terms works.

          Show
          Michael McCandless added a comment - Patch, cutting over PrefixQuery to AutomatonQuery and removing PrefixTermsEnum. I explored the optimization of having Byte/CharRunAutomaton.run optimize (short-circuit) when you're in a sink state but it became quite difficult/invasive fixing all callers of .step to handle this. With LUCENE-5879 we also need to know the sink state under-the-hood, but that's separate from fixing .run to make use of it. So I backed out that opto and tried just doing the PrefixQuery cutover without optimizing for sink states. I'm running a perf test w/ luceneutil and it looks like the impact is trivial (well within noise). Net/net I think it's fine to "just cutover" without the invasive opto? I also changed PrefixQuery's semantics to apply to full binary space terms, not just UTF-8 space. While this is technically a change in behavior, it won't impact users who index only unicode terms. It's also necessary for LUCENE-5879 , because if prefixing is done only in unicode space (like today), then the resulting binary space automaton will not have a sink state and auto-prefix can't apply. If this part is somehow controversial I can revert and try to do it only with LUCENE-5879 instead... if it's OK, I'll add some tests showing that PrefixQuery on binary terms works.
          Hide
          Robert Muir added a comment -

          i think the boolean is fine. For the other subclasses, they are taking "patterns" and inspecting characters for that reason, so thats different.

          Show
          Robert Muir added a comment - i think the boolean is fine. For the other subclasses, they are taking "patterns" and inspecting characters for that reason, so thats different.
          Hide
          Michael McCandless added a comment -

          New patch, adding tests for binary terms to PrefixQuery.

          I also added a new Operations.getSingleton, to work for both unicode
          and binary automata, and fixed CompiledAutomaton to use that when it's
          trying to "simplify" the incoming automaton.

          Show
          Michael McCandless added a comment - New patch, adding tests for binary terms to PrefixQuery. I also added a new Operations.getSingleton, to work for both unicode and binary automata, and fixed CompiledAutomaton to use that when it's trying to "simplify" the incoming automaton.
          Hide
          ASF subversion and git services added a comment -

          Commit 1669522 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1669522 ]

          LUCENE-6367: PrefixQuery now subclasses AutomatonQuery

          Show
          ASF subversion and git services added a comment - Commit 1669522 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1669522 ] LUCENE-6367 : PrefixQuery now subclasses AutomatonQuery
          Hide
          ASF subversion and git services added a comment -

          Commit 1669526 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1669526 ]

          LUCENE-6367: PrefixQuery now subclasses AutomatonQuery

          Show
          ASF subversion and git services added a comment - Commit 1669526 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1669526 ] LUCENE-6367 : PrefixQuery now subclasses AutomatonQuery
          Hide
          ASF subversion and git services added a comment -

          Commit 1669577 from Michael McCandless in branch 'dev/trunk'
          [ https://svn.apache.org/r1669577 ]

          LUCENE-6367: correct CHANGES entry: PrefixQuery already operated in binary term space

          Show
          ASF subversion and git services added a comment - Commit 1669577 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1669577 ] LUCENE-6367 : correct CHANGES entry: PrefixQuery already operated in binary term space
          Hide
          ASF subversion and git services added a comment -

          Commit 1669578 from Michael McCandless in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1669578 ]

          LUCENE-6367: correct CHANGES entry: PrefixQuery already operated in binary term space

          Show
          ASF subversion and git services added a comment - Commit 1669578 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1669578 ] LUCENE-6367 : correct CHANGES entry: PrefixQuery already operated in binary term space
          Hide
          Timothy Potter added a comment -

          Bulk close after 5.1 release

          Show
          Timothy Potter added a comment - Bulk close after 5.1 release

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development