Lucene - Core
  1. Lucene - Core
  2. LUCENE-3695

FST Builder methods need fixing,documentation,or improved type safety

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.6, 4.0-ALPHA
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Its confusing the way an FST Builder has 4 add() methods, and you get assertion errors (what happens if assertions are disabled?) if you use the wrong one:

      For reference we have 3 FST input types:

      • BYTE1 (byte)
      • BYTE2 (char)
      • BYTE4 (int)

      For the builder add() method signatures we have:

      • add(BytesRef)
      • add(char[], int offset, int len)
      • add(CharSequence)
      • add(IntsRef)

      But certain methods only work with certain FST input types, and these mappings are not the ones you think.

      For example, you would think that if you have a char-based FST you should use add(char[]) or add(CharSequence), but this is not the case: those add methods actually only work with int-based FST (they use codePointAt() to extract codepoints). Instead, you have to use add(IntsRef) for the char-based one.

      The worst is if you use the wrong one, you get an assertion error, but i'm not sure what happens if assertions are disabled.

      Maybe the ultimate solution is to parameterize FST's generics on input too (FST<input,output>) and just require BytesRef/CharsRef/IntsRef as the parameter? Then you could just have add(), and this might clean up FSTEnum too (it would no longer need that InputOutput class but maybe could use Map.Entry<input,output> or something?

      I think the documentation is improving but i still notice add(BytesRef) has no javadoc at all, and it only works with BYTE1, so I think we still have some work to do even if we want to just pursue a documentation fix.

      1. LUCENE-3695.patch
        24 kB
        Michael McCandless
      2. LUCENE-3695.patch
        20 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        +1 to doing something here! It's very confusing now.

        The Builder today only operates on IntsRef inputs; the other add methods taking char[]/CharSequence, BytesRef are sugar, translating to IntsRef. Maybe... we should move these elsewhere (eg Util.XXX) and rename them to reflect that they are just converting XXX to IntsRef? Then Builder would only have add(IntsRef[]).

        The "INPUT_TYPE" really describes the allowed range of the input int labels...

        I'd love to parameterize by input type as well, but I think it's tricky (Uwe!?)? Ideally Builder, FST, and the FSTEnums would take <K,V>; FST is just like a SortedMap.

        Show
        Michael McCandless added a comment - +1 to doing something here! It's very confusing now. The Builder today only operates on IntsRef inputs; the other add methods taking char[]/CharSequence, BytesRef are sugar, translating to IntsRef. Maybe... we should move these elsewhere (eg Util.XXX) and rename them to reflect that they are just converting XXX to IntsRef? Then Builder would only have add(IntsRef[]). The "INPUT_TYPE" really describes the allowed range of the input int labels... I'd love to parameterize by input type as well, but I think it's tricky (Uwe!?)? Ideally Builder, FST, and the FSTEnums would take <K,V>; FST is just like a SortedMap.
        Hide
        Dawid Weiss added a comment -

        Just a note. Ideally it would be <<K extends Comparable>, <V extends ByteArraySerializable>, AlgebraOperations<V>>; K is Comparable for binary lookups inside array-expanded nodes, V is arbitrary as long as it can be serialized into bytes and we need certain algebraic operations defined for V... eh, this would be close to perfection

        Show
        Dawid Weiss added a comment - Just a note. Ideally it would be <<K extends Comparable>, <V extends ByteArraySerializable>, AlgebraOperations<V>>; K is Comparable for binary lookups inside array-expanded nodes, V is arbitrary as long as it can be serialized into bytes and we need certain algebraic operations defined for V... eh, this would be close to perfection
        Hide
        Michael McCandless added a comment -

        Patch, just moving the confusing sugar out of FST into Util... this makes it explicit that the Builder only takes IntsRef, and you have to pre-convert other things to IntsRef first...

        Show
        Michael McCandless added a comment - Patch, just moving the confusing sugar out of FST into Util... this makes it explicit that the Builder only takes IntsRef, and you have to pre-convert other things to IntsRef first...
        Hide
        Michael McCandless added a comment -

        Another iteration, removing confusing sugar from Util, and removing an iffy assert.

        Show
        Michael McCandless added a comment - Another iteration, removing confusing sugar from Util, and removing an iffy assert.
        Hide
        Michael McCandless added a comment -

        OK I committed de-sugaring... but we can leave this open for any more improvements.

        Show
        Michael McCandless added a comment - OK I committed de-sugaring... but we can leave this open for any more improvements.
        Hide
        Robert Muir added a comment -

        Lets resolve this for now? we made progress. We know we can further improve the FST APIs,
        but this fix solves the worst of the confusion in my opinion.

        Show
        Robert Muir added a comment - Lets resolve this for now? we made progress. We know we can further improve the FST APIs, but this fix solves the worst of the confusion in my opinion.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development