Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-9983

Stop sorting determinize powersets unnecessarily

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None
    • New

    Description

      Spinoff from LUCENE-9981.

      Today, our Operations.determinize implementation builds powersets of all subsets of NFA states that "belong" in the same determinized state, using this algorithm.

      To hold each powerset, we use a malleable SortedIntSet and periodically freeze it to a FrozenIntSet, also sorted.  We pay a high price to keep these growing maps of int key, int value sorted by key, e.g. upgrading to a TreeMap once the map is large enough (> 30 entries).

      But I think sorting is entirely unnecessary here!  Really all we need is the ability to add/delete keys from the map, and hashCode / equals (by key only – ignoring value!), and to freeze the map (a small optimization that we could skip initially).  We only use these maps to lookup in the (growing) determinized automaton whether this powerset has already been seen.

      Maybe we could simply poach the IntIntScatterMap implementation from HPPC?  And then change its hashCode/{{equals }}to only use keys (not values).

      This change should be a big speedup for the kinds of (admittedly adversarial) regexps we saw on LUCENE-9981.

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            mikemccand Michael McCandless
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 5h 40m
                5h 40m