Uploaded image for project: 'Lucene - Core'
  1. Lucene - Core
  2. LUCENE-503

Contrib: ThaiAnalyzer to enable Thai full-text search in Lucene

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.4
    • Fix Version/s: None
    • Component/s: modules/analysis
    • Labels:
      None

      Description

      Thai text don't have space between words. Usually, a dictionary-based algorithm is used to break string into words. For Lucene to be usable for Thai, an Analyzer that know how to break Thai words is needed.

      I've implemented such Analyzer, ThaiAnalyzer, using ICU4j DictionaryBasedBreakIterator for word breaking. I'll upload the code later.

      I'm normally a C++ programmer and very new to Java. Please review the code for any problem. One possible problem is that it requires ICU4j. I don't know whether this is OK.

      1. TestThaiAnalyzer.java
        2 kB
        Samphan Raruenrom
      2. ThaiAnalyzer.java
        1 kB
        Samphan Raruenrom
      3. ThaiWordFilter.java
        2 kB
        Samphan Raruenrom

        Activity

        Hide
        samphan Samphan Raruenrom added a comment -

        ThaiAnalyzer which simply return a TokenFilter chain with ThaiWordFilter in the middle

        Show
        samphan Samphan Raruenrom added a comment - ThaiAnalyzer which simply return a TokenFilter chain with ThaiWordFilter in the middle
        Hide
        samphan Samphan Raruenrom added a comment -

        ThaiWordFilter which use java.text.BreakIterator to break Thai words into tokens

        Show
        samphan Samphan Raruenrom added a comment - ThaiWordFilter which use java.text.BreakIterator to break Thai words into tokens
        Hide
        samphan Samphan Raruenrom added a comment -

        I've changed the code to use java.text.BreakIterator instead of ICU4j to remove the dependency on ICU4j. The ThaiAnayzer is tested intensively by several groups of developers in at least two production systems (by To-Be-One Technology, who support the development) so it is quite stable. The code is rather small cause I try to make it as efficient and easy to read as possible. It's tested in Lucece 1.4 and lately in Lucene 1.9.1.

        Show
        samphan Samphan Raruenrom added a comment - I've changed the code to use java.text.BreakIterator instead of ICU4j to remove the dependency on ICU4j. The ThaiAnayzer is tested intensively by several groups of developers in at least two production systems (by To-Be-One Technology, who support the development) so it is quite stable. The code is rather small cause I try to make it as efficient and easy to read as possible. It's tested in Lucece 1.4 and lately in Lucene 1.9.1.
        Hide
        lucenebugs@danielnaber.de Daniel Naber added a comment -

        Thanks for your contribution. We're currently preparing Lucene 2.0 and as feature updates are only planned for the release after 2.0 it will take some more time to integrate this.

        Two remarks:

        -It uses the english stop words, does that make sense?
        -Could you write some test cases, similar maybe to those for the French analyzer?

        Show
        lucenebugs@danielnaber.de Daniel Naber added a comment - Thanks for your contribution. We're currently preparing Lucene 2.0 and as feature updates are only planned for the release after 2.0 it will take some more time to integrate this. Two remarks: -It uses the english stop words, does that make sense? -Could you write some test cases, similar maybe to those for the French analyzer?
        Hide
        samphan Samphan Raruenrom added a comment -

        > -It uses the english stop words, does that make sense?

        Yes. Thai usually mix English words in Thai text here and there. So English stop words should apply but this is arguable. I'll consull with the developer community.

        > -Could you write some test cases, similar maybe to those for the French analyzer?

        OK. I'm thinking of writing them.

        Show
        samphan Samphan Raruenrom added a comment - > -It uses the english stop words, does that make sense? Yes. Thai usually mix English words in Thai text here and there. So English stop words should apply but this is arguable. I'll consull with the developer community. > -Could you write some test cases, similar maybe to those for the French analyzer? OK. I'm thinking of writing them.
        Hide
        arthit Arthit Suriyawongkul added a comment -
        Show
        arthit Arthit Suriyawongkul added a comment - related projects/implementations: SansarnLook based on Lucene, with additional ThaiAnalyzer ref: http://sansarn.com/look/technique/ file: http://sansarn.com/look/download/ Pichai Ongvasith's ThaiAnalyzer ref: http://mail-archives.apache.org/mod_mbox/lucene-java-dev/200402.mbox/%3C20040218004604.61360.qmail@web41501.mail.yahoo.com%3E file: http://pichai.netfirms.com/thai_analyzer.zip
        Hide
        samphan Samphan Raruenrom added a comment -

        Add TestThaiAnalyzer junit test, modified from TestFrenchAnalyzer. The Thai words are picked so that changing the dictionary (or algorithm in JDK) should not affect the test.

        Show
        samphan Samphan Raruenrom added a comment - Add TestThaiAnalyzer junit test, modified from TestFrenchAnalyzer. The Thai words are picked so that changing the dictionary (or algorithm in JDK) should not affect the test.
        Hide
        samphan Samphan Raruenrom added a comment -

        All the code have been tested with Lucene 2.0.0.
        Thanks Art for the info/URL. I've never known about Pichai's work before I started this project. However I heard about NECTEC's SansarnLook when I visit them and talk about my ThaiAnalyzer. My goal for this job is for the code to be included in Lucene for Thai to work out-of-the-box. So no more wheel reinventing.

        Show
        samphan Samphan Raruenrom added a comment - All the code have been tested with Lucene 2.0.0. Thanks Art for the info/URL. I've never known about Pichai's work before I started this project. However I heard about NECTEC's SansarnLook when I visit them and talk about my ThaiAnalyzer. My goal for this job is for the code to be included in Lucene for Thai to work out-of-the-box. So no more wheel reinventing.
        Hide
        hossman Hoss Man added a comment -

        I don't know anything about the Thai language ... but this code is clean, fairly easy to follow, and has tests that pass.

        If no one (who knows something about Thai) sees anything wrong with this implimentation and objects i'll commit it sometime this weekend.

        Show
        hossman Hoss Man added a comment - I don't know anything about the Thai language ... but this code is clean, fairly easy to follow, and has tests that pass. If no one (who knows something about Thai) sees anything wrong with this implimentation and objects i'll commit it sometime this weekend.
        Hide
        hossman Hoss Man added a comment -

        commited

        Show
        hossman Hoss Man added a comment - commited

          People

          • Assignee:
            hossman Hoss Man
            Reporter:
            samphan Samphan Raruenrom
          • Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development