Lucene - Core
  1. Lucene - Core
  2. LUCENE-4310

NormalizeCharMap.build creates utf32-keyed automaton and uses it with utf16 keys

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      NormalizeCharMap#build method is inconsistent with later use in MappingCharFilter

              final org.apache.lucene.util.fst.Builder<CharsRef> builder = new org.apache.lucene.util.fst.Builder<CharsRef>(FST.INPUT_TYPE.BYTE2, outputs);
              final IntsRef scratch = new IntsRef();
              for(Map.Entry<String,String> ent : pendingPairs.entrySet()) {
                builder.add(Util.toUTF32(ent.getKey(), scratch),
                            new CharsRef(ent.getValue()));
      

      (note BYTE2 vs. toUTF32 later on).

      1. LUCENE-4310.patch
        4 kB
        Michael McCandless
      2. LUCENE-4310.patch
        3 kB
        Michael McCandless

        Activity

        Hide
        Michael McCandless added a comment -

        Eek, how awful! I'll fix.

        Show
        Michael McCandless added a comment - Eek, how awful! I'll fix.
        Hide
        Dawid Weiss added a comment -

        Thanks Mike!

        Show
        Dawid Weiss added a comment - Thanks Mike!
        Hide
        Michael McCandless added a comment -

        Patch.

        The FST needs be created w/ UTF16 code units since that's what we work with at mapping time...

        Show
        Michael McCandless added a comment - Patch. The FST needs be created w/ UTF16 code units since that's what we work with at mapping time...
        Hide
        Robert Muir added a comment -

        where is the sort done? if you do this you need to sort with the funky comparator too.

        Show
        Robert Muir added a comment - where is the sort done? if you do this you need to sort with the funky comparator too.
        Hide
        Michael McCandless added a comment -

        We sort using TreeMap<String,String>, so we should be OK (sorts by UTF16 order). But to be sure(r) I added U+FF01 test char too.

        Show
        Michael McCandless added a comment - We sort using TreeMap<String,String>, so we should be OK (sorts by UTF16 order). But to be sure(r) I added U+FF01 test char too.
        Hide
        Robert Muir added a comment -

        Thanks: +1

        Show
        Robert Muir added a comment - Thanks: +1
        Hide
        Michael McCandless added a comment -

        Thanks Dawid!

        Show
        Michael McCandless added a comment - Thanks Dawid!
        Hide
        Uwe Schindler added a comment -

        Closed after release.

        Show
        Uwe Schindler added a comment - Closed after release.

          People

          • Assignee:
            Michael McCandless
            Reporter:
            Dawid Weiss
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development