Lucene - Core
  1. Lucene - Core
  2. LUCENE-3681

FST.BYTE2 should save as fixed 2 byte not as vInt

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      We currently write BYTE1 as a single byte, but BYTE2/4 as vInt, but I think that's confusing. Also, for the FST for the new Kuromoji analyzer (LUCENE-3305), writing as 2 bytes instead shrank the FST and ran faster, presumably because more values were >= 16384 than were < 128.

      Separately the whole INPUT_TYPE is very confusing... really all it's doing is "declaring" the allowed range of the characters of the input alphabet, and then the only thing that uses that is the write/readLabel methods (well and some confusing sugar methods in Builder!). Not sure how to fix that yet...

      It's a simple change but it changes the FST binary format so any users w/ FSTs out there will have to rebuild (FST is marked experimental...).

      1. LUCENE-3681.patch
        3 kB
        Michael McCandless

        Issue Links

          Activity

          Hide
          Michael McCandless added a comment -

          I like the idea of moving read/writeLabel to BytesReader/Writer and then specializing...

          Show
          Michael McCandless added a comment - I like the idea of moving read/writeLabel to BytesReader/Writer and then specializing...
          Hide
          Robert Muir added a comment -

          +1 to this patch initially. I agree we should try to improve later and make it more extensible, even if its some steps like moving readLabel() to BytesReader, with concrete impls for BYTE1/2/4 and later maybe this can be customized or something like that.

          Show
          Robert Muir added a comment - +1 to this patch initially. I agree we should try to improve later and make it more extensible, even if its some steps like moving readLabel() to BytesReader, with concrete impls for BYTE1/2/4 and later maybe this can be customized or something like that.
          Hide
          Michael McCandless added a comment -

          Simple patch...

          Show
          Michael McCandless added a comment - Simple patch...
          Hide
          Dawid Weiss added a comment -

          Related to that old issue I don't have the time to work on. It would be nice to have a top-level symbols set/ algebra definition and build things on top of that.

          Show
          Dawid Weiss added a comment - Related to that old issue I don't have the time to work on. It would be nice to have a top-level symbols set/ algebra definition and build things on top of that.

            People

            • Assignee:
              Michael McCandless
              Reporter:
              Michael McCandless
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development