Lucene - Core
  1. Lucene - Core
  2. LUCENE-2933

Two-stage state expansion for the FST: distance-from-root and child-count criteria.

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-ALPHA
    • Component/s: core/index
    • Labels:
      None

      Description

      In the current implementation FST states are expanded into a binary search on labels (from a vector of transitions) when the child count of a state exceeds a given predefined threshold (NUM_ARCS_FIXED_ARRAY). This threshold affects automaton size and traversal speed (as it turns out when benchmarked). For some degenerate data sets, close-to-the-root nodes could have a small number of children (below the threshold) and yet be traversed on every single seek.

      A fix of this is to introduce two control thresholds:
      EXPAND state if (distance-from-root <= MIN_DISTANCE || children-count >= NUM_ARCS_FIXED_ARRAY)

      My plan is to create a data set that will prove this first and then to implement the workaround above.

      1. LUCENE-2933.patch
        12 kB
        Dawid Weiss
      2. out copy.png
        214 kB
        Dawid Weiss

        Activity

        Hide
        Dawid Weiss added a comment -

        In trunk.

        Show
        Dawid Weiss added a comment - In trunk.
        Hide
        Michael McCandless added a comment -

        Patch looks great Dawid! I like the blue = expanded node dot coloring.

        +1 to commit!

        Show
        Michael McCandless added a comment - Patch looks great Dawid! I like the blue = expanded node dot coloring. +1 to commit!
        Hide
        Dawid Weiss added a comment -

        A visual confirmation that this really works (other than JUnit tests).

        Show
        Dawid Weiss added a comment - A visual confirmation that this really works (other than JUnit tests).
        Hide
        Dawid Weiss added a comment -

        Guys, can you peek at this and review it? It's of minimal priority, but I checked and this change:

        a) doesn't seem to degenerate FST compilation performance
        b) hurt traversal performance on real data (checked against the allterms-20110115),
        c) improves worst case behavior on degenerate input sequences.

        If approved, I will commit it in.

        Show
        Dawid Weiss added a comment - Guys, can you peek at this and review it? It's of minimal priority, but I checked and this change: a) doesn't seem to degenerate FST compilation performance b) hurt traversal performance on real data (checked against the allterms-20110115), c) improves worst case behavior on degenerate input sequences. If approved, I will commit it in.
        Hide
        Dawid Weiss added a comment -

        Patch and tests.

        Show
        Dawid Weiss added a comment - Patch and tests.
        Hide
        Dawid Weiss added a comment -

        Implementation of node depth expansion into an array. This patch also piggybacks package-scope getBytes() so that other classes can access the internal representation (specifically, Util.toDot)

        Show
        Dawid Weiss added a comment - Implementation of node depth expansion into an array. This patch also piggybacks package-scope getBytes() so that other classes can access the internal representation (specifically, Util.toDot)
        Hide
        Dawid Weiss added a comment -

        Thanks Uwe!

        Show
        Dawid Weiss added a comment - Thanks Uwe!
        Hide
        Uwe Schindler added a comment -

        Dawid: I added you as Committer to JIRA!

        Show
        Uwe Schindler added a comment - Dawid: I added you as Committer to JIRA!
        Hide
        Simon Willnauer added a comment -

        I can't assign issues to myself, is that all right?

        some jira admin needs to enable this. Grant has JIRA admin rights afaik.

        simon

        Show
        Simon Willnauer added a comment - I can't assign issues to myself, is that all right? some jira admin needs to enable this. Grant has JIRA admin rights afaik. simon
        Hide
        Dawid Weiss added a comment -

        I can't assign issues to myself, is that all right?

        Show
        Dawid Weiss added a comment - I can't assign issues to myself, is that all right?

          People

          • Assignee:
            Dawid Weiss
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development