Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-3297

FST doesn't fully share common prefix across all outputs

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: core/FSTs
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      FST will try to share prefixes of outputs when possible, however in the [I think unusual in practice] case where all outputs share a common prefix, FST really ought to store this just once, on the root arc, but instead it's only able to push back to the N root arcs. It's sort of an off-by-one on how far back the pushing goes...

      One [synthetic] example where this makes a big difference is the new Test2BPostings test, when it uses MemoryCodec, because this test has 26 terms (letters of alphabet) and each term has exactly the same long (~85 MB) all 1s byte[] as the postings. If we fixed this issue, then the resulting FST would only be ~85 MB but now instead it needs to be ~85 * 26 MB.

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              mikemccand Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated: