Lucene - Core
  1. Lucene - Core
  2. LUCENE-2653

ThaiAnalyzer assumes things about your jre

    Details

    • Lucene Fields:
      New

      Description

      The ThaiAnalyzer/ThaiWordFilter depends on the fact that BreakIterator.getWordInstance(new Locale("th")) returns a dictionary-based break iterator that can segment thai phrases into words (it does not use whitespace).

      But this is non-standard that the JRE will specialize this locale in this way, its nice, but you can't depend on it.
      For example, if you are running on IBM JRE, this analyzer/wordfilter is completely "broken" in the sense it won't do what it claims to do.

      At the minimum, we need to document this and suggest users look at ICUTokenizer for thai, which always has this breakiterator and is not jre-dependent.

      Better, would be to check statically that the thing actually works.
      when creating a new ThaiWordFilter we could clone() the BreakIterator, which is often cheaper than making a new one anyway.
      we could throw an exception, if its not supported, and add a boolean so the user knows it works.
      and we could refer to this boolean with Assert.assume in its tests.

        Activity

        Robert Muir created issue -
        Hide
        Robert Muir added a comment -

        Here's a patch: it detects statically if the BreakIterator from thai locale will actually work at all,
        and sets a boolean DBBI_AVAILABLE

        in the ctor if this is false, it throws UOE("This JRE does not have support for Thai segmentation")

        I also added docs referring to ICUTokenizer in case you need this across all jres, and put
        Assume.assumeTrue(ThaiWordFilter.DBBI_AVAILABLE) in the tests.

        Show
        Robert Muir added a comment - Here's a patch: it detects statically if the BreakIterator from thai locale will actually work at all, and sets a boolean DBBI_AVAILABLE in the ctor if this is false, it throws UOE("This JRE does not have support for Thai segmentation") I also added docs referring to ICUTokenizer in case you need this across all jres, and put Assume.assumeTrue(ThaiWordFilter.DBBI_AVAILABLE) in the tests.
        Robert Muir made changes -
        Field Original Value New Value
        Attachment LUCENE-2653.patch [ 12454946 ]
        Hide
        Simon Willnauer added a comment -

        Looks good to me robert! Make sure you add a CHANGES.TXT entry. Could that have been a bw break since it did not do what it claimed to do?

        simon

        Show
        Simon Willnauer added a comment - Looks good to me robert! Make sure you add a CHANGES.TXT entry. Could that have been a bw break since it did not do what it claimed to do? simon
        Hide
        Robert Muir added a comment -

        Could that have been a bw break since it did not do what it claimed to do?

        I dont understand the question. ThaiWordFilter has always been broken this way, it is broken by design.

        Show
        Robert Muir added a comment - Could that have been a bw break since it did not do what it claimed to do? I dont understand the question. ThaiWordFilter has always been broken this way, it is broken by design.
        Hide
        Simon Willnauer added a comment -

        I dont understand the question. ThaiWordFilter has always been broken this way, it is broken by design.

        could somebody have used the broken behavior and relies on it? Just making sure its not a bw break somehow which we should document.

        Show
        Simon Willnauer added a comment - I dont understand the question. ThaiWordFilter has always been broken this way, it is broken by design. could somebody have used the broken behavior and relies on it? Just making sure its not a bw break somehow which we should document.
        Hide
        Robert Muir added a comment -

        no, in this case the filter does not work at all, it does nothing.

        Show
        Robert Muir added a comment - no, in this case the filter does not work at all, it does nothing.
        rmuir committed 998688 (57 files)
        Reviews: none

        LUCENE-2653: ThaiAnalyzer assumes things about your jre

        Lucene branch_3x
        Hide
        Robert Muir added a comment -

        Committed revision 998684 (trunk), 998688 (3x)

        Show
        Robert Muir added a comment - Committed revision 998684 (trunk), 998688 (3x)
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Assignee Robert Muir [ rcmuir ]
        Fix Version/s 3.1 [ 12314822 ]
        Fix Version/s 4.0 [ 12314025 ]
        Resolution Fixed [ 1 ]
        Hide
        Robert Muir added a comment -

        reopening for possible 2.9.4/3.0.3 backport.

        Show
        Robert Muir added a comment - reopening for possible 2.9.4/3.0.3 backport.
        Robert Muir made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Robert Muir made changes -
        Fix Version/s 2.9.4 [ 12315148 ]
        Fix Version/s 3.0.3 [ 12315147 ]
        Hide
        Robert Muir added a comment -

        I'm gonna shoot for documentation-only fix here for 2.9.x and 3.0.x as well...
        its a no-risk "fix" at least to alert people that this won't work on e.g. IBM jdk...

        Show
        Robert Muir added a comment - I'm gonna shoot for documentation-only fix here for 2.9.x and 3.0.x as well... its a no-risk "fix" at least to alert people that this won't work on e.g. IBM jdk...
        rmuir committed 1028791 (27 files)
        Reviews: none

        LUCENE-2653: document that ThaiWordFilter doesn't work on all JREs

        Lucene lucene_2_9
        Hide
        Robert Muir added a comment -

        Committed documentation about this in:
        Revision 1028789 for 3.0.x
        Revision 1028791 for 2.9.x

        Show
        Robert Muir added a comment - Committed documentation about this in: Revision 1028789 for 3.0.x Revision 1028791 for 2.9.x
        Robert Muir made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Uwe Schindler made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Mark Thomas made changes -
        Workflow jira [ 12520887 ] Default workflow, editable Closed status [ 12564319 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12564319 ] jira [ 12584849 ]
        Shai Erera made changes -
        Component/s modules/analysis [ 12310230 ]
        Component/s contrib/analyzers [ 12312333 ]

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development