Lucene - Core
  1. Lucene - Core
  2. LUCENE-3832

Port BasicAutomata.stringUnion from Brics to Lucene

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-BETA, 5.0
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Brics has my code to build Automaton from a set of sorted strings in one step (Daciuk/Mihov's algorithm again). This should be easily portable to Lucene and is quite useful.

      1. LUCENE-3832.patch
        29 kB
        Dawid Weiss
      2. LUCENE-3832.patch
        15 kB
        Dawid Weiss

        Activity

        Hide
        Dawid Weiss added a comment -

        A patch with ported stringunion.

        Show
        Dawid Weiss added a comment - A patch with ported stringunion.
        Hide
        Michael McCandless added a comment -

        +1

        Then we can remove the class from test-framework and cutover all uses to this new one?

        Show
        Michael McCandless added a comment - +1 Then we can remove the class from test-framework and cutover all uses to this new one?
        Hide
        Dawid Weiss added a comment -

        Huh? Sorry, I don't get you – what do you mean?

        Show
        Dawid Weiss added a comment - Huh? Sorry, I don't get you – what do you mean?
        Hide
        Uwe Schindler added a comment -

        In test-framework is a helper method to build a Daciuk/Mihov automaton.

        With the patch, this somehow makes TermsFilter from contrib obsolete or should we maybe port that to an AutomatonQuery / MTQWF(AutomatonQuery)? If it simply subclasses AQ it could be more performant if the array has terms which are following each other in terms index.

        Show
        Uwe Schindler added a comment - In test-framework is a helper method to build a Daciuk/Mihov automaton. With the patch, this somehow makes TermsFilter from contrib obsolete or should we maybe port that to an AutomatonQuery / MTQWF(AutomatonQuery)? If it simply subclasses AQ it could be more performant if the array has terms which are following each other in terms index.
        Hide
        Michael McCandless added a comment -

        Sorry, what I meant was: we already have (and use, from tests only) this algorithm, in lucene/test-framework/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java.

        I agree we should promote it to core: it seems quite useful!

        Actually I think there are slight differences vs the attached patch (looks like Robert cutover to CharsRef/BytesRef), so I guess we need to reconcile those... or maybe just move the existing one from test-framework to StringUnionOperations (if there are no important differences ).

        Show
        Michael McCandless added a comment - Sorry, what I meant was: we already have (and use, from tests only) this algorithm, in lucene/test-framework/src/java/org/apache/lucene/util/automaton/DaciukMihovAutomatonBuilder.java . I agree we should promote it to core: it seems quite useful! Actually I think there are slight differences vs the attached patch (looks like Robert cutover to CharsRef/BytesRef), so I guess we need to reconcile those... or maybe just move the existing one from test-framework to StringUnionOperations (if there are no important differences ).
        Hide
        Dawid Weiss added a comment -

        I don't know if the original version is a derivative of what I placed in the patch or not (I wrote the one originally in brics). There will be ordering differences between the one based on char and UTF8 so it's not exactly the same; somebody should perhaps look at it from a wider perspective. If you have spare cycles, feel free to take over – this was really a 5 minute effort and I can't take a deeper look at it at the moment.

        Show
        Dawid Weiss added a comment - I don't know if the original version is a derivative of what I placed in the patch or not (I wrote the one originally in brics). There will be ordering differences between the one based on char and UTF8 so it's not exactly the same; somebody should perhaps look at it from a wider perspective. If you have spare cycles, feel free to take over – this was really a 5 minute effort and I can't take a deeper look at it at the moment.
        Hide
        Robert Muir added a comment -

        The version in test-framework was something i quickly hacked together
        (converted Dawid's implementation to binary order) because some tests
        added for blocktree (TestTermsEnum?) were really slow without it.

        Show
        Robert Muir added a comment - The version in test-framework was something i quickly hacked together (converted Dawid's implementation to binary order) because some tests added for blocktree (TestTermsEnum?) were really slow without it.
        Hide
        Dawid Weiss added a comment -

        We should then just move Robert's port instead of what I pulled from Brics. UTF8 order is omnipresent in Lucene so it should really be that (possibly with a method accepting an iterator/iterable of char sequences and sorting internally?).

        Show
        Dawid Weiss added a comment - We should then just move Robert's port instead of what I pulled from Brics. UTF8 order is omnipresent in Lucene so it should really be that (possibly with a method accepting an iterator/iterable of char sequences and sorting internally?).
        Hide
        Michael McCandless added a comment -

        We should then just move Robert's port instead of what I pulled from Brics.

        +1, as long as we can reconcile that the diffs are not "meaningful" Maybe you fixed a bug since we added this to Lucene!

        EG a small diff is your version does the .sort() for the caller, but Lucene's version requires caller do the .sort()...

        Show
        Michael McCandless added a comment - We should then just move Robert's port instead of what I pulled from Brics. +1, as long as we can reconcile that the diffs are not "meaningful" Maybe you fixed a bug since we added this to Lucene! EG a small diff is your version does the .sort() for the caller, but Lucene's version requires caller do the .sort()...
        Hide
        Robert Muir added a comment -

        removing 3.6, automaton package doesnt exist there.

        Show
        Robert Muir added a comment - removing 3.6, automaton package doesnt exist there.
        Hide
        Dawid Weiss added a comment -

        A patch moving DaciukMihovAutomatonBuilder to automata package and hiding it effectively under BasicAutomata.makeStringUnion.

        Minor cleanups.

        Show
        Dawid Weiss added a comment - A patch moving DaciukMihovAutomatonBuilder to automata package and hiding it effectively under BasicAutomata.makeStringUnion. Minor cleanups.
        Hide
        Dawid Weiss added a comment -

        Passes tests for me, I'll commit it in soon if there are no objections.

        Show
        Dawid Weiss added a comment - Passes tests for me, I'll commit it in soon if there are no objections.
        Hide
        Robert Muir added a comment -

        I don't like the one that takes CharSequence and makes a new arraylist and sorts

        Show
        Robert Muir added a comment - I don't like the one that takes CharSequence and makes a new arraylist and sorts
        Hide
        Dawid Weiss added a comment -

        I don't like many things. For example the fact that it accepts UTF-8 but returns an automaton with codepoints on transitions – this isn't intuitive

        I think that method fits well with other methods in that class (accepting strings). Maybe it's a good idea to cutover to utf8 entirely but just having Collection<utf8> and everything else on char sequences seems odd to me.

        Show
        Dawid Weiss added a comment - I don't like many things. For example the fact that it accepts UTF-8 but returns an automaton with codepoints on transitions – this isn't intuitive I think that method fits well with other methods in that class (accepting strings). Maybe it's a good idea to cutover to utf8 entirely but just having Collection<utf8> and everything else on char sequences seems odd to me.
        Hide
        Robert Muir added a comment -

        For example the fact that it accepts UTF-8 but returns an automaton with codepoints on transitions – this isn't intuitive

        Whats unintuitive? UTF-8 and UTF-32 share the same order.

        Show
        Robert Muir added a comment - For example the fact that it accepts UTF-8 but returns an automaton with codepoints on transitions – this isn't intuitive Whats unintuitive? UTF-8 and UTF-32 share the same order.
        Hide
        Dawid Weiss added a comment -

        You pass an array of bytes and get codepoint (ints) as transitions.

        Show
        Dawid Weiss added a comment - You pass an array of bytes and get codepoint (ints) as transitions.
        Hide
        Robert Muir added a comment -

        So how is passing an array of shorts better?

        Show
        Robert Muir added a comment - So how is passing an array of shorts better?
        Hide
        Dawid Weiss added a comment -

        My whining doesn't mean I want to change this because I realize most of the data in Lucene already comes as UTF8 and it wouldn't make sense to convert it back and forth. This doesn't change my feeling that it isn't intuitive when you look at the method's signature.

        Of course it's explained in the javadoc what it actually does, but so is conversion/ sorting in that other method that you don't like (and that I think is generally useful considering the rest of this package).

        Show
        Dawid Weiss added a comment - My whining doesn't mean I want to change this because I realize most of the data in Lucene already comes as UTF8 and it wouldn't make sense to convert it back and forth. This doesn't change my feeling that it isn't intuitive when you look at the method's signature. Of course it's explained in the javadoc what it actually does, but so is conversion/ sorting in that other method that you don't like (and that I think is generally useful considering the rest of this package).
        Hide
        Dawid Weiss added a comment -

        It'd be possible to avoid the sort and still make makeUnionOfStrings(Collection<String>) possible if we also exposed Utf16AsUtf8 comparator... Then the automaton builder could accept either byte[] or char[] and just do its job without any additional overhead of copying.

        Ok, I'll remove that method if you don't like it. If there's a need we can revert it using the trick above.

        Show
        Dawid Weiss added a comment - It'd be possible to avoid the sort and still make makeUnionOfStrings(Collection<String>) possible if we also exposed Utf16AsUtf8 comparator... Then the automaton builder could accept either byte[] or char[] and just do its job without any additional overhead of copying. Ok, I'll remove that method if you don't like it. If there's a need we can revert it using the trick above.
        Hide
        Hoss Man added a comment -

        hoss20120711-manual-post-40alpha-change

        Show
        Hoss Man added a comment - hoss20120711-manual-post-40alpha-change

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development