Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-1078

WordDelimiterFilter do wrong word breaking for Thai vowel

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 1.4
    • 1.4
    • Schema and Analysis
    • None
    • Ubuntu 8.10 64bit
      Java 1.6.0_10

    Description

      With any configuration of schema.xml

      <filter class="solr.WordDelimiterFilterFactory" />

      will do wrong word breaking with Thai characters.


      Example: "ผู้ ใหญ่ บ้าน"

      Wrong result: 0 => "ผ", 1 => "ใหญ", 2 => "บ", 3 => "าน"

      Expect result: 0 => "ผู้", 1 => "ใหญ่", 2 => "บ้าน"


      Example2: "ผู้ใหญ่บ้าน" (no space)

      Wrong result: 0 => "ผ", 1 => "ใหญ", 2 => "บ", 3 => "าน" (same result)

      Expect result: 0 => "ผู้ใหญ่บ้าน"


      There's a similar problem with Drupal (http://drupal.org/node/335928)

      Attachments

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            noomz SIriwat Aumngamsup
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment