Solr
  1. Solr
  2. SOLR-1078

WordDelimiterFilter do wrong word breaking for Thai vowel

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None
    • Environment:

      Ubuntu 8.10 64bit
      Java 1.6.0_10

      Description

      With any configuration of schema.xml

      <filter class="solr.WordDelimiterFilterFactory" />

      will do wrong word breaking with Thai characters.


      Example: "ผู้ ใหญ่ บ้าน"

      Wrong result: 0 => "ผ", 1 => "ใหญ", 2 => "บ", 3 => "าน"

      Expect result: 0 => "ผู้", 1 => "ใหญ่", 2 => "บ้าน"


      Example2: "ผู้ใหญ่บ้าน" (no space)

      Wrong result: 0 => "ผ", 1 => "ใหญ", 2 => "บ", 3 => "าน" (same result)

      Expect result: 0 => "ผู้ใหญ่บ้าน"


      There's a similar problem with Drupal (http://drupal.org/node/335928)

      1. SOLR-1078.patch
        4 kB
        Yonik Seeley

        Activity

        Hide
        Yonik Seeley added a comment -

        Are these characters all in the basic multilingual plane?

        Here is the relevant code how WordDelimiterFilter characterizes chars:

          [...]
            } else if (Character.isLowerCase(ch)) {
              return LOWER;
            } else if (Character.isLetter(ch)) {
              return UPPER;
            } else {
              return SUBWORD_DELIM;
            }
        
        Show
        Yonik Seeley added a comment - Are these characters all in the basic multilingual plane? Here is the relevant code how WordDelimiterFilter characterizes chars: [...] } else if ( Character .isLowerCase(ch)) { return LOWER; } else if ( Character .isLetter(ch)) { return UPPER; } else { return SUBWORD_DELIM; }
        Hide
        Robert Muir added a comment -

        thai vowels are neither, they are Character.getType(ch) == Character.NON_SPACING_MARK

        Show
        Robert Muir added a comment - thai vowels are neither, they are Character.getType(ch) == Character.NON_SPACING_MARK
        Hide
        Yonik Seeley added a comment -

        Thanks for the tip Robert.
        Here's a patch that should improve things (and works for both examples given here).

        Show
        Yonik Seeley added a comment - Thanks for the tip Robert. Here's a patch that should improve things (and works for both examples given here).
        Hide
        Robert Muir added a comment -

        looks pretty good... i was concerned about the split on case-change behavior breaking with the obvious fix.

        i think you want to include MODIFIER_SYMBOL tho.

        Show
        Robert Muir added a comment - looks pretty good... i was concerned about the split on case-change behavior breaking with the obvious fix. i think you want to include MODIFIER_SYMBOL tho.
        Hide
        Yonik Seeley added a comment -

        committed. not perfect, but should be much, much better for other languages.

        Show
        Yonik Seeley added a comment - committed. not perfect, but should be much, much better for other languages.
        Hide
        Yonik Seeley added a comment -

        MODIFIER_SYMBOL as an ALPHA?

        Show
        Yonik Seeley added a comment - MODIFIER_SYMBOL as an ALPHA?
        Hide
        Robert Muir added a comment -

        i think so, U+005E CIRCUMFLEX ACCENT, U+0060 GRAVE ACCENT, etc.

        Show
        Robert Muir added a comment - i think so, U+005E CIRCUMFLEX ACCENT, U+0060 GRAVE ACCENT, etc.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4

          People

          • Assignee:
            Unassigned
            Reporter:
            SIriwat Aumngamsup
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development