Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-7627

Improve TermsEnum automaton filtering APIs

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 6.4
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      To filter a TermsEnum by a CompiledAutomaton, we currently have a number of different possibilities:

      • Terms.intersect(CompiledAutomaton, BytesRef) - efficient, but only works on NORMAL type automata
      • CompiledAutomaton.getTerms(Terms) - efficient, works on all automaton types, but requires a Terms instead of a TermsEnum, so no use for eg SortedDocValues.termsEnum()
      • AutomatonTermsEnum - takes a TermsEnum, so it's more general than the Terms methods above, but agian only works on NORMAL automata

      It's easy to do the wrong thing here, and at the moment we only guard against incorrect usage via runtime checks (see eg LUCENE-7576, https://github.com/flaxsearch/marple/issues/24). We should try and clean this up.

      1. LUCENE-7627.patch
        5 kB
        Alan Woodward
      2. LUCENE-7627.patch
        5 kB
        Alan Woodward

        Activity

        Hide
        romseygeek Alan Woodward added a comment -

        The immediate problem we faced in marple can be fixed by adding SortedDocValues#termsEnum(CompiledAutomaton) and SortedSetDocValues#termsEnum(CompiledAutomaton) methods, as in this patch.

        I think the next step could be to address the TODO in CompiledAutomaton#getTermsEnum(Terms), adding an optional startTerm parameter. Then Terms.intersect() can delegate to that method instead of throwing exceptions.

        That leaves AutomatonTermsEnum, which I think can be fixed by making the constructor private, and adding factory methods to do the right thing depending on the automaton type. Plus some javadocs which point out that if you have a Terms, then calling intersect directly on the Terms instance is likely to be more efficient than calling iterator() and passing that to ATE.

        Show
        romseygeek Alan Woodward added a comment - The immediate problem we faced in marple can be fixed by adding SortedDocValues#termsEnum(CompiledAutomaton) and SortedSetDocValues#termsEnum(CompiledAutomaton) methods, as in this patch. I think the next step could be to address the TODO in CompiledAutomaton#getTermsEnum(Terms), adding an optional startTerm parameter. Then Terms.intersect() can delegate to that method instead of throwing exceptions. That leaves AutomatonTermsEnum, which I think can be fixed by making the constructor private, and adding factory methods to do the right thing depending on the automaton type. Plus some javadocs which point out that if you have a Terms, then calling intersect directly on the Terms instance is likely to be more efficient than calling iterator() and passing that to ATE.
        Hide
        romseygeek Alan Woodward added a comment -

        I think the next step could be to address the TODO in CompiledAutomaton#getTermsEnum(Terms)

        Although this might be trickier than it looks, as the contract on Terms is that you get a TermsEnum which will only return terms > startTerm, but it's initially unpositioned - which makes handling Type.ALL complicated...

        Show
        romseygeek Alan Woodward added a comment - I think the next step could be to address the TODO in CompiledAutomaton#getTermsEnum(Terms) Although this might be trickier than it looks, as the contract on Terms is that you get a TermsEnum which will only return terms > startTerm, but it's initially unpositioned - which makes handling Type.ALL complicated...
        Hide
        mikemccand Michael McCandless added a comment -

        Thanks Alan Woodward, I think the patch is good, except maybe we can name the new methods intersect so it's clearer that we are intersecting with the incoming automaton?

        Show
        mikemccand Michael McCandless added a comment - Thanks Alan Woodward , I think the patch is good, except maybe we can name the new methods intersect so it's clearer that we are intersecting with the incoming automaton?
        Hide
        romseygeek Alan Woodward added a comment -

        OK, termsEnum(CompiledAutomaton) is now intersect(CA). Running tests now, I'd like to get this in for 6.4 as it means we can use it in marple quickly. The next steps we can do in a follow-up issue.

        Show
        romseygeek Alan Woodward added a comment - OK, termsEnum(CompiledAutomaton) is now intersect(CA). Running tests now, I'd like to get this in for 6.4 as it means we can use it in marple quickly. The next steps we can do in a follow-up issue.
        Hide
        jira-bot ASF subversion and git services added a comment -

        Commit c2c758bb71e621b1d8c5d8b228b8dfe4ec50acfe in lucene-solr's branch refs/heads/branch_6x from Alan Woodward
        [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2c758b ]

        LUCENE-7627: Add #intersect(CompiledAutomaton) to Sorted*DocValues

        Show
        jira-bot ASF subversion and git services added a comment - Commit c2c758bb71e621b1d8c5d8b228b8dfe4ec50acfe in lucene-solr's branch refs/heads/branch_6x from Alan Woodward [ https://git-wip-us.apache.org/repos/asf?p=lucene-solr.git;h=c2c758b ] LUCENE-7627 : Add #intersect(CompiledAutomaton) to Sorted*DocValues
        Hide
        romseygeek Alan Woodward added a comment -

        Thanks for the review Mike!

        Show
        romseygeek Alan Woodward added a comment - Thanks for the review Mike!

          People

          • Assignee:
            romseygeek Alan Woodward
            Reporter:
            romseygeek Alan Woodward
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development