Lucene - Core
  1. Lucene - Core
  2. LUCENE-2933

Two-stage state expansion for the FST: distance-from-root and child-count criteria.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None

      Description

      In the current implementation FST states are expanded into a binary search on labels (from a vector of transitions) when the child count of a state exceeds a given predefined threshold (NUM_ARCS_FIXED_ARRAY). This threshold affects automaton size and traversal speed (as it turns out when benchmarked). For some degenerate data sets, close-to-the-root nodes could have a small number of children (below the threshold) and yet be traversed on every single seek.

      A fix of this is to introduce two control thresholds:
      EXPAND state if (distance-from-root <= MIN_DISTANCE || children-count >= NUM_ARCS_FIXED_ARRAY)

      My plan is to create a data set that will prove this first and then to implement the workaround above.

      1. LUCENE-2933.patch
        12 kB
        Dawid Weiss
      2. out copy.png
        214 kB
        Dawid Weiss

        Activity

        Transition Time In Source Status Execution Times Last Executer Last Execution Date
        Open Open Resolved Resolved
        2d 1h 27m 1 Dawid Weiss 23/Feb/11 10:17
        Resolved Resolved Closed Closed
        807d 26m 1 Uwe Schindler 10/May/13 11:44
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Dawid Weiss made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Lucene Fields [New]
        Resolution Fixed [ 1 ]
        Hide
        Dawid Weiss added a comment -

        In trunk.

        Show
        Dawid Weiss added a comment - In trunk.
        Hide
        Michael McCandless added a comment -

        Patch looks great Dawid! I like the blue = expanded node dot coloring.

        +1 to commit!

        Show
        Michael McCandless added a comment - Patch looks great Dawid! I like the blue = expanded node dot coloring. +1 to commit!
        Dawid Weiss made changes -
        Attachment out copy.png [ 12471573 ]
        Hide
        Dawid Weiss added a comment -

        A visual confirmation that this really works (other than JUnit tests).

        Show
        Dawid Weiss added a comment - A visual confirmation that this really works (other than JUnit tests).
        Hide
        Dawid Weiss added a comment -

        Guys, can you peek at this and review it? It's of minimal priority, but I checked and this change:

        a) doesn't seem to degenerate FST compilation performance
        b) hurt traversal performance on real data (checked against the allterms-20110115),
        c) improves worst case behavior on degenerate input sequences.

        If approved, I will commit it in.

        Show
        Dawid Weiss added a comment - Guys, can you peek at this and review it? It's of minimal priority, but I checked and this change: a) doesn't seem to degenerate FST compilation performance b) hurt traversal performance on real data (checked against the allterms-20110115), c) improves worst case behavior on degenerate input sequences. If approved, I will commit it in.
        Dawid Weiss made changes -
        Attachment LUCENE-2933.patch [ 12471572 ]
        Hide
        Dawid Weiss added a comment -

        Patch and tests.

        Show
        Dawid Weiss added a comment - Patch and tests.
        Dawid Weiss made changes -
        Attachment LUCENE-2933.patch [ 12471560 ]
        Dawid Weiss made changes -
        Attachment LUCENE-2933.patch [ 12471560 ]
        Hide
        Dawid Weiss added a comment -

        Implementation of node depth expansion into an array. This patch also piggybacks package-scope getBytes() so that other classes can access the internal representation (specifically, Util.toDot)

        Show
        Dawid Weiss added a comment - Implementation of node depth expansion into an array. This patch also piggybacks package-scope getBytes() so that other classes can access the internal representation (specifically, Util.toDot)
        Dawid Weiss made changes -
        Field Original Value New Value
        Assignee Dawid Weiss [ dweiss ]
        Hide
        Dawid Weiss added a comment -

        Thanks Uwe!

        Show
        Dawid Weiss added a comment - Thanks Uwe!
        Hide
        Uwe Schindler added a comment -

        Dawid: I added you as Committer to JIRA!

        Show
        Uwe Schindler added a comment - Dawid: I added you as Committer to JIRA!
        Hide
        Simon Willnauer added a comment -

        I can't assign issues to myself, is that all right?

        some jira admin needs to enable this. Grant has JIRA admin rights afaik.

        simon

        Show
        Simon Willnauer added a comment - I can't assign issues to myself, is that all right? some jira admin needs to enable this. Grant has JIRA admin rights afaik. simon
        Hide
        Dawid Weiss added a comment -

        I can't assign issues to myself, is that all right?

        Show
        Dawid Weiss added a comment - I can't assign issues to myself, is that all right?
        Dawid Weiss created issue -

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development