Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 5.0
    • Fix Version/s: 5.3, 6.0
    • Component/s: core/other
    • Labels:
    • Lucene Fields:
      New, Patch Available

      Description

      Replaced Operations.getFiniteStrings() by an optimized FiniteStringIterator.

      Benefits:
      Avoid huge hash set of finite strings.
      Avoid massive object/array creation during processing.

      "Downside":
      Iteration order changed, so when iterating with a limit, the result may differ slightly. Old: emit current node, if accept / recurse. New: recurse / emit current node, if accept.

      The old method Operations.getFiniteStrings() still exists, because it eases the tests. It is now implemented by use of the new FiniteStringIterator.

      1. FiniteStrings_noreuse.patch
        57 kB
        Markus Heiden
      2. FiniteStrings_reuse.patch
        58 kB
        Markus Heiden
      3. LUCENE-6365.patch
        56 kB
        Michael McCandless

        Activity

        Hide
        Markus Heiden added a comment -

        Patch containing the proposed changes.

        Show
        Markus Heiden added a comment - Patch containing the proposed changes.
        Hide
        Dawid Weiss added a comment -

        I looked at the patch (briefly, without delving into all the details – I'm on vacation .

        I like the idea overall. The only thing that bugs me is the explicitness of FiniteStringsIterator. I would rather have something like:

        Iterator<IntsRef> Operations.iterateFiniteStrings(...)
        

        so that we can leave FiniteStringsIterator as a package-private implementation and not proliferate it around. If Iterator interface gives you a headache (which it can) then I'd leave FiniteStringsIterator as an explicit return type but leave the factory method inside Operations.

        The assignments inside the conditional part of for loops are very likely to drive somebody crazy sooner or later.

        Again, I only looked at the patch, I don't even have a proper environment with me to check on all the details.

        Show
        Dawid Weiss added a comment - I looked at the patch (briefly, without delving into all the details – I'm on vacation . I like the idea overall. The only thing that bugs me is the explicitness of FiniteStringsIterator. I would rather have something like: Iterator<IntsRef> Operations.iterateFiniteStrings(...) so that we can leave FiniteStringsIterator as a package-private implementation and not proliferate it around. If Iterator interface gives you a headache (which it can) then I'd leave FiniteStringsIterator as an explicit return type but leave the factory method inside Operations. The assignments inside the conditional part of for loops are very likely to drive somebody crazy sooner or later. Again, I only looked at the patch, I don't even have a proper environment with me to check on all the details.
        Hide
        Markus Heiden added a comment -

        I prefer a "standard" Iterator too. But by implementing it, my implementation loses some of its benefits, because an iterator needs a look-ahead for hasNext().

        I chose the current implementation because:
        1) I can reuse the iterator for different iterations (similar to Transition): See AnalyzingSuggester. So e.g. the internal stack does not need to be reallocated. This avoids not just the initial allocation, but the resizes during the iterations too. BTW: May we choose a bigger initial size for the stack?
        2) By avoiding a look-ahead, my implementation can return the internal IntRefs (from the used IntsRefBuilder) in next(), without the need for a deep copy.

        Show
        Markus Heiden added a comment - I prefer a "standard" Iterator too. But by implementing it, my implementation loses some of its benefits, because an iterator needs a look-ahead for hasNext(). I chose the current implementation because: 1) I can reuse the iterator for different iterations (similar to Transition): See AnalyzingSuggester. So e.g. the internal stack does not need to be reallocated. This avoids not just the initial allocation, but the resizes during the iterations too. BTW: May we choose a bigger initial size for the stack? 2) By avoiding a look-ahead, my implementation can return the internal IntRefs (from the used IntsRefBuilder) in next(), without the need for a deep copy.
        Hide
        Robert Muir added a comment -

        BTW: May we choose a bigger initial size for the stack?

        No, this is too much as a library, it impacts too many uses and makes the library too difficult to use (special jvm flags must be supplied).

        Show
        Robert Muir added a comment - BTW: May we choose a bigger initial size for the stack? No, this is too much as a library, it impacts too many uses and makes the library too difficult to use (special jvm flags must be supplied).
        Hide
        Markus Heiden added a comment -

        I talked about the stack like structure inside the iterator not the jvm stack size. This stack holds the PathNodes of the current finite string. It has an initial size of just 4, so it has to be dynamicly resized in almost all use cases.

        Show
        Markus Heiden added a comment - I talked about the stack like structure inside the iterator not the jvm stack size. This stack holds the PathNodes of the current finite string. It has an initial size of just 4, so it has to be dynamicly resized in almost all use cases.
        Hide
        Michael McCandless added a comment -

        I like this change overall ... it means you can iterate over many more finite strings if you want. I think a dedicated iterator is OK; just mark it @lucene.experimental so we are free to improve it later.

        Iteration order changed, so when iterating with a limit, the result may differ slightly.

        I think that's fine and it's an impl detail (which strings you'll get when you hit the limit) ... can you update the javadocs to say so?

        Show
        Michael McCandless added a comment - I like this change overall ... it means you can iterate over many more finite strings if you want. I think a dedicated iterator is OK; just mark it @lucene.experimental so we are free to improve it later. Iteration order changed, so when iterating with a limit, the result may differ slightly. I think that's fine and it's an impl detail (which strings you'll get when you hit the limit) ... can you update the javadocs to say so?
        Hide
        Dawid Weiss added a comment -

        Oops, apologies Markus, this slipped my agenda somehow. As for implementing Iterator – ok, you don't have to stick to Java's one, I understand the lookahead is hairy. But I'd still change those for loops with compound assignment/ comparison to something that is less cryptic (like a simple while loop with the variable thrown out). The JIT doesn't care about it anyway I bet (variables undergo liveness analysis).

        Show
        Dawid Weiss added a comment - Oops, apologies Markus, this slipped my agenda somehow. As for implementing Iterator – ok, you don't have to stick to Java's one, I understand the lookahead is hairy. But I'd still change those for loops with compound assignment/ comparison to something that is less cryptic (like a simple while loop with the variable thrown out). The JIT doesn't care about it anyway I bet (variables undergo liveness analysis).
        Hide
        Markus Heiden added a comment - - edited

        Are you talking about this?

        for (IntsRef finiteString; (finiteString = iterator.next()) != null;)
        

        For me it is the standard iteration pattern for non-lookahead iterations, like e.g. iterating over an input stream (see e.g. FileCopyUtils of Spring framework).

        Does this one look better for you?

        for (IntsRef finiteString = iterator.next(); finiteString != null; finiteString = iterator.next())
        

        I like my version better, because it is shorter and the iterator.next() is not doubled, but I will you use it, if you like it better.

        A simple while loop looks even more bloated to me. It unnecessarily widens the scope of finiteString and splits things which belong together, which both is error prone for coding:

        IntsRef finiteString = iterator.next();
        while (finiteString != null) {
           // do something
        
           finiteString = iterator.next();
        }
        

        Something different:
        I marked Operations.getFiniteStrings() as deprecated in my patch, because it should be replaced by the new iterator. But I consider to remove the deprecated, because this method is easier to use for single iterations of small finite strings sets and makes some tests cases simpler. What do you think?

        Again something different:
        What about the initial stack size in the new iterator (which needs to be at least as big as the max. length of the iterated finite strings)? May I raise it from 4 to e.g. 16? In my opinion this would be needed for roughly 90% of all cases.

        Show
        Markus Heiden added a comment - - edited Are you talking about this? for (IntsRef finiteString; (finiteString = iterator.next()) != null ;) For me it is the standard iteration pattern for non-lookahead iterations, like e.g. iterating over an input stream (see e.g. FileCopyUtils of Spring framework). Does this one look better for you? for (IntsRef finiteString = iterator.next(); finiteString != null ; finiteString = iterator.next()) I like my version better, because it is shorter and the iterator.next() is not doubled, but I will you use it, if you like it better. A simple while loop looks even more bloated to me. It unnecessarily widens the scope of finiteString and splits things which belong together, which both is error prone for coding: IntsRef finiteString = iterator.next(); while (finiteString != null ) { // do something finiteString = iterator.next(); } Something different: I marked Operations.getFiniteStrings() as deprecated in my patch, because it should be replaced by the new iterator. But I consider to remove the deprecated, because this method is easier to use for single iterations of small finite strings sets and makes some tests cases simpler. What do you think? Again something different: What about the initial stack size in the new iterator (which needs to be at least as big as the max. length of the iterated finite strings)? May I raise it from 4 to e.g. 16? In my opinion this would be needed for roughly 90% of all cases.
        Hide
        Dawid Weiss added a comment -
        for (IntsRef finiteString; (finiteString = iterator.next()) != null;)
        

        Many people have strong feelings about assignments in conditional expressions. In fact I just recently stumbled upon an Eclipse JDT refactoring bug that was evaluating these (and refactoring these) incorrectly. My comment was actually meant to go together with the "why don't we make it an iterator" one... If you did that then the problem of what kind of loop it is pretty much disappears.

        Anyway, this is really minor and a matter of style rather than correctness. It can go in as-is.

        I marked Operations.getFiniteStrings() as deprecated in my patch [...] What do you think?

        No strong opinion. If it's only used from tests then you can mark it as deprecated I guess; no need to support redundant code.

        May I raise it from 4 to e.g. 16?

        It very likely won't matter in practice at all. I think increasing it to 16 won't do anybody any harm (you could try to squeeze it into a single line cache, but I think it's an overkill and premature optimization; it'll vanish in other processing noise).

        Show
        Dawid Weiss added a comment - for (IntsRef finiteString; (finiteString = iterator.next()) != null ;) Many people have strong feelings about assignments in conditional expressions. In fact I just recently stumbled upon an Eclipse JDT refactoring bug that was evaluating these (and refactoring these) incorrectly. My comment was actually meant to go together with the "why don't we make it an iterator" one... If you did that then the problem of what kind of loop it is pretty much disappears. Anyway, this is really minor and a matter of style rather than correctness. It can go in as-is. I marked Operations.getFiniteStrings() as deprecated in my patch [...] What do you think? No strong opinion. If it's only used from tests then you can mark it as deprecated I guess; no need to support redundant code. May I raise it from 4 to e.g. 16? It very likely won't matter in practice at all. I think increasing it to 16 won't do anybody any harm (you could try to squeeze it into a single line cache, but I think it's an overkill and premature optimization; it'll vanish in other processing noise).
        Hide
        Michael McCandless added a comment -

        Can you add in the javadocs of the new iterator that the order that accepted strings are returned is implementation dependent and free to change across releases?

        I marked Operations.getFiniteStrings() as deprecated

        Can we simply remove it? The Operations methods are experimental and dangerous ... it need not live on?

        Show
        Michael McCandless added a comment - Can you add in the javadocs of the new iterator that the order that accepted strings are returned is implementation dependent and free to change across releases? I marked Operations.getFiniteStrings() as deprecated Can we simply remove it? The Operations methods are experimental and dangerous ... it need not live on?
        Hide
        Markus Heiden added a comment - - edited

        Sorry for the delay.

        I moved Operations.getFiniteStrings() to TestOperations.getFiniteStrings(), because using the iterator for assertions is a pain. Production code using this method has been replaced by direct usage of the new iterator.

        I got one problem with that:
        I am not sure if the implementation change of CompletionTokenStream is OK, because I set the position attribute at the end of the iteration instead of at the start of the iteration. The tests run fine, but someone should review that.

        I marked the new iterator as @lucene.experimental and added a comment, that the iteration order may change.

        Show
        Markus Heiden added a comment - - edited Sorry for the delay. I moved Operations.getFiniteStrings() to TestOperations.getFiniteStrings(), because using the iterator for assertions is a pain. Production code using this method has been replaced by direct usage of the new iterator. I got one problem with that: I am not sure if the implementation change of CompletionTokenStream is OK, because I set the position attribute at the end of the iteration instead of at the start of the iteration. The tests run fine, but someone should review that. I marked the new iterator as @lucene.experimental and added a comment, that the iteration order may change.
        Hide
        Michael McCandless added a comment -

        Thanks Markus Heiden, new patch looks great.

        Can we remove the limit to FiniteStringsIterator.init? Seems like this ("abort iteration after N items") should be the caller's job?

        Can we just pass the automaton to FSI's ctor? I don't think we need a reuse API here...

        I am not sure if the implementation change of CompletionTokenStream is OK, because I set the position attribute at the end of the iteration instead of at the start of the iteration. The tests run fine, but someone should review that.

        It is weird that CompletionTokenStream hijacks PositionIncrementAttribute like that, and I can't see anywhere that reads from that (and indeed tests pass if I comment it out). Maybe Areek Zillur knows? I think we should just remove it?

        Show
        Michael McCandless added a comment - Thanks Markus Heiden , new patch looks great. Can we remove the limit to FiniteStringsIterator.init? Seems like this ("abort iteration after N items") should be the caller's job? Can we just pass the automaton to FSI's ctor? I don't think we need a reuse API here... I am not sure if the implementation change of CompletionTokenStream is OK, because I set the position attribute at the end of the iteration instead of at the start of the iteration. The tests run fine, but someone should review that. It is weird that CompletionTokenStream hijacks PositionIncrementAttribute like that, and I can't see anywhere that reads from that (and indeed tests pass if I comment it out). Maybe Areek Zillur knows? I think we should just remove it?
        Hide
        Areek Zillur added a comment -

        +1 to removing PositionIncrementAttr from CompletionTokenStream.

        Show
        Areek Zillur added a comment - +1 to removing PositionIncrementAttr from CompletionTokenStream.
        Hide
        Markus Heiden added a comment -

        Removed PositionIncrementAttribute from CompletionTokenStream.

        Show
        Markus Heiden added a comment - Removed PositionIncrementAttribute from CompletionTokenStream.
        Hide
        Markus Heiden added a comment -

        Removing init(): One my design goals was to be able to reuse this iterator, mainly to optimize the performance of the AnalyzingSuggester. This mainly comes from avoiding the step by step allocation of the stack inside the iterator over and over again. So maybe I can provide an additional constructor by combining the default constructor with the init() method?

        Removing limit support: I like the idea, but as I tried to implement the change, it seems that there are some downsides: At least the AnalyzingSuggester and the CompletionTokenStream are using this feature. Especially annoying is to change the implementation of CompletionTokenStream, because it has no for loop. Maybe I can add a second init() method without limit, to not need to pass -1 each time the limitation is not needed?

        Show
        Markus Heiden added a comment - Removing init(): One my design goals was to be able to reuse this iterator, mainly to optimize the performance of the AnalyzingSuggester. This mainly comes from avoiding the step by step allocation of the stack inside the iterator over and over again. So maybe I can provide an additional constructor by combining the default constructor with the init() method? Removing limit support: I like the idea, but as I tried to implement the change, it seems that there are some downsides: At least the AnalyzingSuggester and the CompletionTokenStream are using this feature. Especially annoying is to change the implementation of CompletionTokenStream, because it has no for loop. Maybe I can add a second init() method without limit, to not need to pass -1 each time the limitation is not needed?
        Hide
        Dawid Weiss added a comment -

        > mainly to optimize the performance of the AnalyzingSuggester. This mainly comes from avoiding the step by step allocation of the stack inside the iterator over and over again.

        I wouldn't worry about it, Markus. With a TLAB these allocations are fairly cheap. You could just add a reasonably large initial stack and that's it. If it needs to reallocate, so be it. Unless it becomes a real bottleneck in real life (which it won't, I assure you), there is no need for premature optimizations at the cost of making the code more complex.

        Show
        Dawid Weiss added a comment - > mainly to optimize the performance of the AnalyzingSuggester. This mainly comes from avoiding the step by step allocation of the stack inside the iterator over and over again. I wouldn't worry about it, Markus. With a TLAB these allocations are fairly cheap. You could just add a reasonably large initial stack and that's it. If it needs to reallocate, so be it. Unless it becomes a real bottleneck in real life (which it won't, I assure you), there is no need for premature optimizations at the cost of making the code more complex.
        Hide
        Markus Heiden added a comment - - edited

        For the normal use case I agree. But I had problems with long build times for lookups for big dictionaries (using the AnalyzingSuggester). I profiled the creation of the lookups and one hotspot was the allocation of the internal stack. One problem is, that the initial size of the internal stack is too small (4 entries), so the internal stack gets resized over and over again. I will increase its initial size to 16.

        Show
        Markus Heiden added a comment - - edited For the normal use case I agree. But I had problems with long build times for lookups for big dictionaries (using the AnalyzingSuggester). I profiled the creation of the lookups and one hotspot was the allocation of the internal stack. One problem is, that the initial size of the internal stack is too small (4 entries), so the internal stack gets resized over and over again. I will increase its initial size to 16.
        Hide
        Michael McCandless added a comment -

        I would really prefer not make API compromises (reuse, init method) for such optimizations, nor for the "limit" case (this is really the caller's responsibility...).

        Especially annoying is to change the implementation of CompletionTokenStream, because it has no for loop.

        It's fine to just add a member to the class tracking how many strings it has pulled so far from the iterator...

        Also, you can remove that //TODO: make this return a Iterator<IntsRef> instead? since you are doing exactly that, here...

        Show
        Michael McCandless added a comment - I would really prefer not make API compromises (reuse, init method) for such optimizations, nor for the "limit" case (this is really the caller's responsibility...). Especially annoying is to change the implementation of CompletionTokenStream, because it has no for loop. It's fine to just add a member to the class tracking how many strings it has pulled so far from the iterator... Also, you can remove that //TODO: make this return a Iterator<IntsRef> instead? since you are doing exactly that, here...
        Hide
        Markus Heiden added a comment -

        The Iterator interface is not possible with the current implementation, because the lookahead of hasNext() would destroy the current value provided by the previous next(). Anyway, I removed that comment.

        I provided now both interfaces (single use and multiple use) with the newest patch, so for the default case you can use the more intuitive one (single use). I hope you like it.

        I kept the limit inside the iterator, but provided the new class LimitedFiniteStringsIterator for it. Because it was too complicated to transfer the limit into the yet complex iteration in AnalyzingSuggester etc.

        Show
        Markus Heiden added a comment - The Iterator interface is not possible with the current implementation, because the lookahead of hasNext() would destroy the current value provided by the previous next(). Anyway, I removed that comment. I provided now both interfaces (single use and multiple use) with the newest patch, so for the default case you can use the more intuitive one (single use). I hope you like it. I kept the limit inside the iterator, but provided the new class LimitedFiniteStringsIterator for it. Because it was too complicated to transfer the limit into the yet complex iteration in AnalyzingSuggester etc.
        Hide
        Michael McCandless added a comment -

        The Iterator interface is not possible with the current implementation, because the lookahead of hasNext() would destroy the current value provided by the previous next(). Anyway, I removed that comment.

        OK it's fine if we don't implement Java's Iterator; the spirit of that TODO can still be removed

        I provided now both interfaces (single use and multiple use) with the newest patch, so for the default case you can use the more intuitive one (single use). I hope you like it.

        Sorry, I really don't want to pollute such a low level API with reuse ... can you remove the init reuse method?

        I kept the limit inside the iterator, but provided the new class LimitedFiniteStringsIterator for it. Because it was too complicated to transfer the limit into the yet complex iteration in AnalyzingSuggester etc.

        OK that seems like a good compromise...

        Show
        Michael McCandless added a comment - The Iterator interface is not possible with the current implementation, because the lookahead of hasNext() would destroy the current value provided by the previous next(). Anyway, I removed that comment. OK it's fine if we don't implement Java's Iterator; the spirit of that TODO can still be removed I provided now both interfaces (single use and multiple use) with the newest patch, so for the default case you can use the more intuitive one (single use). I hope you like it. Sorry, I really don't want to pollute such a low level API with reuse ... can you remove the init reuse method? I kept the limit inside the iterator, but provided the new class LimitedFiniteStringsIterator for it. Because it was too complicated to transfer the limit into the yet complex iteration in AnalyzingSuggester etc. OK that seems like a good compromise...
        Hide
        Markus Heiden added a comment -

        I adapted my patch to the latest changes in trunk.

        I think the reuse of the iterator is one core part of this whole patch. I tried to rework the api of the iterator so that the reuse case and the no-reuse case are handled in a similar way. I hope you like it now (at least a bit). Lucene does this kind of reuse already, e.g. see Transition.

        FuzzyCompletionQuery has been added lately and relies on the old big set of finite strings. I am not sure how to rework it. Currently it still uses the set, maybe it is better to use the iterator inside of FuzzyCompletionWeight, but this means recomputing the finite strings over and over again. What do you think?

        BTW topoSortStates() is implemented by AnalyzingSuggester and CompletionTokenStream identically. Maybe it should be moved to one place, maybe to Operations?

        Show
        Markus Heiden added a comment - I adapted my patch to the latest changes in trunk. I think the reuse of the iterator is one core part of this whole patch. I tried to rework the api of the iterator so that the reuse case and the no-reuse case are handled in a similar way. I hope you like it now (at least a bit). Lucene does this kind of reuse already, e.g. see Transition. FuzzyCompletionQuery has been added lately and relies on the old big set of finite strings. I am not sure how to rework it. Currently it still uses the set, maybe it is better to use the iterator inside of FuzzyCompletionWeight, but this means recomputing the finite strings over and over again. What do you think? BTW topoSortStates() is implemented by AnalyzingSuggester and CompletionTokenStream identically. Maybe it should be moved to one place, maybe to Operations?
        Hide
        Michael McCandless added a comment -

        I adapted my patch to the latest changes in trunk.

        Thanks.

        I think the reuse of the iterator is one core part of this whole patch. I tried to rework the api of the iterator so that the reuse case and the no-reuse case are handled in a similar way. I hope you like it now (at least a bit).

        Alas I still don't think this is an appropriate place for object reuse.

        Lucene does this kind of reuse already, e.g. see Transition.

        That's true: Lucene does reuse objects in many low-level places, but this is ugly and cancerous and dangerous (can easily cause bugs, e.g. accidentally reusing one iterator across threads) and anti-Java, etc., and it really should be used only sparingly, and should be the exception not the rule. I don't think this API qualifies, i.e. it's a bad tradeoff to pollute the API to eeek out a bit of GC perf gain that in real usage would be negligible because the cost of building an automaton and the cost of consuming each string that's iterated would normally dwarf the small GC cost of creating a new iterator per automaton. APIs are hard enough to "get right" as it is ...

        FuzzyCompletionQuery has been added lately and relies on the old big set of finite strings. I am not sure how to rework it. Currently it still uses the set, maybe it is better to use the iterator inside of FuzzyCompletionWeight, but this means recomputing the finite strings over and over again. What do you think?

        It's fine to leave this as the full Set<String> for now. It's no worse Progress not perfection...

        BTW topoSortStates() is implemented by AnalyzingSuggester and CompletionTokenStream identically. Maybe it should be moved to one place, maybe to Operations?

        Woops, I'll go move that to Operations now, good idea, thank you!

        Show
        Michael McCandless added a comment - I adapted my patch to the latest changes in trunk. Thanks. I think the reuse of the iterator is one core part of this whole patch. I tried to rework the api of the iterator so that the reuse case and the no-reuse case are handled in a similar way. I hope you like it now (at least a bit). Alas I still don't think this is an appropriate place for object reuse. Lucene does this kind of reuse already, e.g. see Transition. That's true: Lucene does reuse objects in many low-level places, but this is ugly and cancerous and dangerous (can easily cause bugs, e.g. accidentally reusing one iterator across threads) and anti-Java, etc., and it really should be used only sparingly, and should be the exception not the rule. I don't think this API qualifies, i.e. it's a bad tradeoff to pollute the API to eeek out a bit of GC perf gain that in real usage would be negligible because the cost of building an automaton and the cost of consuming each string that's iterated would normally dwarf the small GC cost of creating a new iterator per automaton. APIs are hard enough to "get right" as it is ... FuzzyCompletionQuery has been added lately and relies on the old big set of finite strings. I am not sure how to rework it. Currently it still uses the set, maybe it is better to use the iterator inside of FuzzyCompletionWeight, but this means recomputing the finite strings over and over again. What do you think? It's fine to leave this as the full Set<String> for now. It's no worse Progress not perfection... BTW topoSortStates() is implemented by AnalyzingSuggester and CompletionTokenStream identically. Maybe it should be moved to one place, maybe to Operations? Woops, I'll go move that to Operations now, good idea, thank you!
        Hide
        ASF subversion and git services added a comment -

        Commit 1689046 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1689046 ]

        LUCENE-6365: add Operations.topoSort

        Show
        ASF subversion and git services added a comment - Commit 1689046 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1689046 ] LUCENE-6365 : add Operations.topoSort
        Hide
        ASF subversion and git services added a comment -

        Commit 1689047 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1689047 ]

        LUCENE-6365: add Operations.topoSort

        Show
        ASF subversion and git services added a comment - Commit 1689047 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1689047 ] LUCENE-6365 : add Operations.topoSort
        Hide
        Markus Heiden added a comment -

        I see, I am not able to convince you So I attached a version of the patch with eliminated reuse api.

        I agree that reuse is no good design, but the profiler pointed me to that spot. I already did a patch for Automaton (LUCENE-5959) for the same reasons.

        It would be nice, if Automaton knows the size of the longest word it produces. That would eliminated the resizing of the internal stack array inside the iterator and could convince me that reuse is not needed for the iterator.

        Show
        Markus Heiden added a comment - I see, I am not able to convince you So I attached a version of the patch with eliminated reuse api. I agree that reuse is no good design, but the profiler pointed me to that spot. I already did a patch for Automaton ( LUCENE-5959 ) for the same reasons. It would be nice, if Automaton knows the size of the longest word it produces. That would eliminated the resizing of the internal stack array inside the iterator and could convince me that reuse is not needed for the iterator.
        Hide
        ASF subversion and git services added a comment -

        Commit 1689079 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1689079 ]

        LUCENE-6365: fix buggy Operations.topoSort; add test

        Show
        ASF subversion and git services added a comment - Commit 1689079 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1689079 ] LUCENE-6365 : fix buggy Operations.topoSort; add test
        Hide
        ASF subversion and git services added a comment -

        Commit 1689081 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1689081 ]

        LUCENE-6365: fix buggy Operations.topoSort; add test

        Show
        ASF subversion and git services added a comment - Commit 1689081 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1689081 ] LUCENE-6365 : fix buggy Operations.topoSort; add test
        Hide
        Michael McCandless added a comment -

        Thanks Markus Heiden, the patch looks good.

        I do still think this is a very nice change even without the reuse!

        Was it accidental to remove @lucene.experimental from CompletionTokenStream's javadocs? I put that back.

        I also made a couple additional members of FiniteStringsIterator final, and fixed up the javadocs a bit ... new patch attached.

        I'll commit soon, thank you!

        Show
        Michael McCandless added a comment - Thanks Markus Heiden , the patch looks good. I do still think this is a very nice change even without the reuse! Was it accidental to remove @lucene.experimental from CompletionTokenStream's javadocs? I put that back. I also made a couple additional members of FiniteStringsIterator final, and fixed up the javadocs a bit ... new patch attached. I'll commit soon, thank you!
        Hide
        Uwe Schindler added a comment -

        I agree that reuse is no good design, but the profiler pointed me to that spot. I already did a patch for Automaton (LUCENE-5959) for the same reasons.

        The problem with profilers is: They don't run the code with all hotspot optimizations turned on (because they add extra commands into the loop) to take measurement points. In general Object creation is in most cases not bad, except when they tend to live long time (e.g., across several method). Short living objects never affect GC, because they are created on stack automatically (escape analysis) - and those are in most cases never shown correctly by profiler, because escape analysis is mostly broken with profilers.

        Reuse is only suitable for cases where the object goes through many stages of a query and needs to be cached between method calls and involves disk I/O,... This is not the case here.

        Patch without reuse looks fine, although I dont fully understand it (not my area of work).

        Show
        Uwe Schindler added a comment - I agree that reuse is no good design, but the profiler pointed me to that spot. I already did a patch for Automaton ( LUCENE-5959 ) for the same reasons. The problem with profilers is: They don't run the code with all hotspot optimizations turned on (because they add extra commands into the loop) to take measurement points. In general Object creation is in most cases not bad, except when they tend to live long time (e.g., across several method). Short living objects never affect GC, because they are created on stack automatically (escape analysis) - and those are in most cases never shown correctly by profiler, because escape analysis is mostly broken with profilers. Reuse is only suitable for cases where the object goes through many stages of a query and needs to be cached between method calls and involves disk I/O,... This is not the case here. Patch without reuse looks fine, although I dont fully understand it (not my area of work).
        Hide
        Markus Heiden added a comment -

        @Michael: The removal of @lucene.experimental was a mistake of mine during merging.Thanks for your rework and your patience.

        @Uwe: I measured the cpu runtime in sampling mode, so (almost) no additional overhead should occur. I did the reuse because there is not just one allocation of the array, but many. During runtime the array will be resized over and over again, because the initial size was rather small (4 entries). I changed that to 16 so the resizing occurs less frequent. My test case was the build of dictionary of 100000s of words, so even small things accumulate.

        A better solution to that problem would be, if automatons know the length of their longest word. In that case that above mentioned array could initially be sized right. But I don't know, if that length is always known during construction of automatons.

        Show
        Markus Heiden added a comment - @Michael: The removal of @lucene.experimental was a mistake of mine during merging.Thanks for your rework and your patience. @Uwe: I measured the cpu runtime in sampling mode, so (almost) no additional overhead should occur. I did the reuse because there is not just one allocation of the array, but many. During runtime the array will be resized over and over again, because the initial size was rather small (4 entries). I changed that to 16 so the resizing occurs less frequent. My test case was the build of dictionary of 100000s of words, so even small things accumulate. A better solution to that problem would be, if automatons know the length of their longest word. In that case that above mentioned array could initially be sized right. But I don't know, if that length is always known during construction of automatons.
        Hide
        ASF subversion and git services added a comment -

        Commit 1689192 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1689192 ]

        LUCENE-6365: switch to iterator API to get all finite strings from an Automaton

        Show
        ASF subversion and git services added a comment - Commit 1689192 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1689192 ] LUCENE-6365 : switch to iterator API to get all finite strings from an Automaton
        Hide
        ASF subversion and git services added a comment -

        Commit 1689193 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1689193 ]

        LUCENE-6365: switch to iterator API to get all finite strings from an Automaton

        Show
        ASF subversion and git services added a comment - Commit 1689193 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1689193 ] LUCENE-6365 : switch to iterator API to get all finite strings from an Automaton
        Hide
        Michael McCandless added a comment -

        Markus Heiden maybe try the honest profiler: https://github.com/RichardWarburton/honest-profiler/wiki ?

        I haven't tried it yet, but it apparently avoids the safepoint sampling bias ...

        Show
        Michael McCandless added a comment - Markus Heiden maybe try the honest profiler: https://github.com/RichardWarburton/honest-profiler/wiki ? I haven't tried it yet, but it apparently avoids the safepoint sampling bias ...
        Hide
        Michael McCandless added a comment -

        Hmm this failure just appeared on Jenkins:

           [junit4] Started J0 PID(3197@localhost).
           [junit4] Suite: org.apache.lucene.util.automaton.FiniteStringsIteratorTest
           [junit4]   2> NOTE: reproduce with: ant test  -Dtestcase=FiniteStringsIteratorTest -Dtests.method=testRandomFiniteStrings1 -Dtests.seed=4A938C5F6E728DCC -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es_CU -Dtests.timezone=America/Porto_Velho -Dtests.asserts=true -Dtests.file.encoding=UTF-8
           [junit4] FAILURE 0.26s | FiniteStringsIteratorTest.testRandomFiniteStrings1 <<<
           [junit4]    > Throwable #1: java.lang.AssertionError: expected:<445> but was:<446>
           [junit4]    > 	at __randomizedtesting.SeedInfo.seed([4A938C5F6E728DCC:17BF51EE97CD2662]:0)
           [junit4]    > 	at org.apache.lucene.util.automaton.FiniteStringsIteratorTest.assertFiniteStringsRecursive(FiniteStringsIteratorTest.java:204)
           [junit4]    > 	at org.apache.lucene.util.automaton.FiniteStringsIteratorTest.testRandomFiniteStrings1(FiniteStringsIteratorTest.java:82)
           [junit4]    > 	at java.lang.Thread.run(Thread.java:745)
           [junit4]   2> NOTE: test params are: codec=Asserting(Lucene53): {}, docValues:{}, sim=RandomSimilarityProvider(queryNorm=true,coord=no): {}, locale=es_CU, timezone=America/Porto_Velho
           [junit4]   2> NOTE: Linux 3.13.0-46-generic amd64/Oracle Corporation 1.8.0_40 (64-bit)/cpus=8,threads=1,free=483134008,total=504889344
           [junit4]   2> NOTE: All tests run in this JVM: [FiniteStringsIteratorTest]
           [junit4] Completed [1/1] in 0.41s, 1 test, 1 failure <<< FAILURES!
        

        It reproduces ... I haven't dug into it yet.

        Show
        Michael McCandless added a comment - Hmm this failure just appeared on Jenkins: [junit4] Started J0 PID(3197@localhost). [junit4] Suite: org.apache.lucene.util.automaton.FiniteStringsIteratorTest [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=FiniteStringsIteratorTest -Dtests.method=testRandomFiniteStrings1 -Dtests.seed=4A938C5F6E728DCC -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es_CU -Dtests.timezone=America/Porto_Velho -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 0.26s | FiniteStringsIteratorTest.testRandomFiniteStrings1 <<< [junit4] > Throwable #1: java.lang.AssertionError: expected:<445> but was:<446> [junit4] > at __randomizedtesting.SeedInfo.seed([4A938C5F6E728DCC:17BF51EE97CD2662]:0) [junit4] > at org.apache.lucene.util.automaton.FiniteStringsIteratorTest.assertFiniteStringsRecursive(FiniteStringsIteratorTest.java:204) [junit4] > at org.apache.lucene.util.automaton.FiniteStringsIteratorTest.testRandomFiniteStrings1(FiniteStringsIteratorTest.java:82) [junit4] > at java.lang.Thread.run(Thread.java:745) [junit4] 2> NOTE: test params are: codec=Asserting(Lucene53): {}, docValues:{}, sim=RandomSimilarityProvider(queryNorm=true,coord=no): {}, locale=es_CU, timezone=America/Porto_Velho [junit4] 2> NOTE: Linux 3.13.0-46-generic amd64/Oracle Corporation 1.8.0_40 (64-bit)/cpus=8,threads=1,free=483134008,total=504889344 [junit4] 2> NOTE: All tests run in this JVM: [FiniteStringsIteratorTest] [junit4] Completed [1/1] in 0.41s, 1 test, 1 failure <<< FAILURES! It reproduces ... I haven't dug into it yet.
        Hide
        Markus Heiden added a comment -

        I have executed the tests in lucene/core again and they do not fail on the state of the trunk when I created the patch. There have to be changes elsewhere which lead to this failure.

        Show
        Markus Heiden added a comment - I have executed the tests in lucene/core again and they do not fail on the state of the trunk when I created the patch. There have to be changes elsewhere which lead to this failure.
        Hide
        Dawid Weiss added a comment - - edited

        I don't think so. The test in question (and the seed in question) fails because FiniteStringsIterator returns the same sequence twice ("h"), I just compared the outputs of the recursive and non-recursive outputs.

        Looks like the seed in question adds the same string to the input automaton twice ("h") and then doesn't minimize/ determinize. getFiniteStringsRecursive makes sure no string is emitted twice, FiniteStringsIterator doesn't have this check.

        Perhaps we should require that the input automaton doesn't have duplicates (Mike?). Alternatively, we could add compare-to-previous to the iterator and skip duplicates.

        Show
        Dawid Weiss added a comment - - edited I don't think so. The test in question (and the seed in question) fails because FiniteStringsIterator returns the same sequence twice ("h"), I just compared the outputs of the recursive and non-recursive outputs. Looks like the seed in question adds the same string to the input automaton twice ("h") and then doesn't minimize/ determinize. getFiniteStringsRecursive makes sure no string is emitted twice, FiniteStringsIterator doesn't have this check. Perhaps we should require that the input automaton doesn't have duplicates (Mike?). Alternatively, we could add compare-to-previous to the iterator and skip duplicates.
        Hide
        Markus Heiden added a comment - - edited

        Sorry, I missed to use the seed.

        The de-duplication has prior been implemented by the big set containing the result, which has been removed. So no de-duplication takes places. Should this be the responsibility of the iterator at all?

        Show
        Markus Heiden added a comment - - edited Sorry, I missed to use the seed. The de-duplication has prior been implemented by the big set containing the result, which has been removed. So no de-duplication takes places. Should this be the responsibility of the iterator at all?
        Hide
        Dawid Weiss added a comment -

        I think it even makes more sense to leave the code as is (without removing duplicates). It should be an assertion that no duplicates occurred in the automaton, but let Michael McCandless confirm this.

        Show
        Dawid Weiss added a comment - I think it even makes more sense to leave the code as is (without removing duplicates). It should be an assertion that no duplicates occurred in the automaton, but let Michael McCandless confirm this.
        Hide
        Michael McCandless added a comment -

        Ahh thanks for digging Dawid Weiss and Markus Heiden.

        Markus Heiden usually all that's necessary to reproduce a test is to copy/past the exact text after "Reproduce with: ...", in this case:

        ant test  -Dtestcase=FiniteStringsIteratorTest -Dtests.method=testRandomFiniteStrings1 -Dtests.seed=4A938C5F6E728DCC -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es_CU -Dtests.timezone=America/Porto_Velho -Dtests.asserts=true -Dtests.file.encoding=UTF-8
        

        That usually reproduces the failure, though sometimes you'll need the exact JVM version, added JVM flags, etc.

        I agree it should not be this iterator's job to deal with duplicates: I think if you pass a non-minimal automaton to it, it's fair game for it to return dups ... so this is a test bug.

        Show
        Michael McCandless added a comment - Ahh thanks for digging Dawid Weiss and Markus Heiden . Markus Heiden usually all that's necessary to reproduce a test is to copy/past the exact text after "Reproduce with: ...", in this case: ant test -Dtestcase=FiniteStringsIteratorTest -Dtests.method=testRandomFiniteStrings1 -Dtests.seed=4A938C5F6E728DCC -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=es_CU -Dtests.timezone=America/Porto_Velho -Dtests.asserts=true -Dtests.file.encoding=UTF-8 That usually reproduces the failure, though sometimes you'll need the exact JVM version, added JVM flags, etc. I agree it should not be this iterator's job to deal with duplicates: I think if you pass a non-minimal automaton to it, it's fair game for it to return dups ... so this is a test bug.
        Hide
        Michael McCandless added a comment -

        I'll fix the test to de-dup itself, and add a comment in FiniteStringsIterator's javadocs about this ...

        Show
        Michael McCandless added a comment - I'll fix the test to de-dup itself, and add a comment in FiniteStringsIterator's javadocs about this ...
        Hide
        ASF subversion and git services added a comment -

        Commit 1689404 from Michael McCandless in branch 'dev/trunk'
        [ https://svn.apache.org/r1689404 ]

        LUCENE-6365: fix test to not add duplicate strings

        Show
        ASF subversion and git services added a comment - Commit 1689404 from Michael McCandless in branch 'dev/trunk' [ https://svn.apache.org/r1689404 ] LUCENE-6365 : fix test to not add duplicate strings
        Hide
        ASF subversion and git services added a comment -

        Commit 1689405 from Michael McCandless in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1689405 ]

        LUCENE-6365: fix test to not add duplicate strings

        Show
        ASF subversion and git services added a comment - Commit 1689405 from Michael McCandless in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1689405 ] LUCENE-6365 : fix test to not add duplicate strings
        Hide
        Dawid Weiss added a comment - - edited

        Thanks Mike, thanks Markus.

        Show
        Dawid Weiss added a comment - - edited Thanks Mike, thanks Markus.
        Hide
        Shalin Shekhar Mangar added a comment -

        Bulk close for 5.3.0 release

        Show
        Shalin Shekhar Mangar added a comment - Bulk close for 5.3.0 release

          People

          • Assignee:
            Unassigned
            Reporter:
            Markus Heiden
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development