Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      While adding tests for offsets & type to ThaiAnalyzer, i discovered it does not type Thai numeric digits correctly.
      ThaiAnalyzer uses StandardTokenizer, and this is really an issue with the grammar, which adds the entire [:Thai:] block to ALPHANUM.

      i propose that alphanum be described a little bit differently in the grammar.
      Instead, [:letter:] should be allowed to have diacritics/signs/combining marks attached to it.

      this would allow the [:thai:] hack to be completely removed, would allow StandardTokenizer to parse complex writing systems such as Indian languages, and would fix LUCENE-1545.

        Issue Links

          Activity

          Robert Muir created issue -
          Hide
          Steve Rowe added a comment -

          +1 (I was involved in perpetuating the Thai grammar hack)

          FWIW, JFlex 1.5, which hopefully will be released in the next few months, will have better Unicode support, including general category, script, and block property support, as well as the ability to select the Unicode version. This will simplify the grammar. (Note that JFlex 1.5-generated scanners will require Java 1.5, so we won't be using it in Lucene until after Lucene 3.0 has been released.)

          Show
          Steve Rowe added a comment - +1 (I was involved in perpetuating the Thai grammar hack) FWIW, JFlex 1.5, which hopefully will be released in the next few months, will have better Unicode support, including general category, script, and block property support, as well as the ability to select the Unicode version. This will simplify the grammar. (Note that JFlex 1.5-generated scanners will require Java 1.5, so we won't be using it in Lucene until after Lucene 3.0 has been released.)
          Hide
          Robert Muir added a comment -

          Steven I have been watching that jflex 1.5 branch with great anticipation

          Do you think it will support characters outside of the BMP?

          (My hope is that it might perform better than the ICU RBBI for some other things I am working on)

          Show
          Robert Muir added a comment - Steven I have been watching that jflex 1.5 branch with great anticipation Do you think it will support characters outside of the BMP? (My hope is that it might perform better than the ICU RBBI for some other things I am working on)
          Hide
          Steve Rowe added a comment -

          Steven I have been watching that jflex 1.5 branch with great anticipation

          Cool! If you mention this on the jflex-devel mailing list, you may be able to help nudge Gerwin Klein (JFlex founder and main developer) into starting work on merging the 1.5 branch onto the trunk

          Do you think it will support characters outside of the BMP?

          As you may already know, the 1.5 branch does not yet include above-BMP support. However, this is definitely a future goal.

          My guess is that 1.5.0 will be BMP-only, and that 1.5.X or 1.6 will add above-BMP support. (This is my guess because the Unicode properties code is present and functional in the branch now, but no work has yet been done to add above-BMP support.)

          Show
          Steve Rowe added a comment - Steven I have been watching that jflex 1.5 branch with great anticipation Cool! If you mention this on the jflex-devel mailing list, you may be able to help nudge Gerwin Klein (JFlex founder and main developer) into starting work on merging the 1.5 branch onto the trunk Do you think it will support characters outside of the BMP? As you may already know, the 1.5 branch does not yet include above-BMP support. However, this is definitely a future goal. My guess is that 1.5.0 will be BMP-only, and that 1.5.X or 1.6 will add above-BMP support. (This is my guess because the Unicode properties code is present and functional in the branch now, but no work has yet been done to add above-BMP support.)
          Hide
          Robert Muir added a comment -

          Steven, even without >BMP support, 1.5 branch would make the grammar file more clear and maintainable.
          Otherwise, codepoint ranges must be used.

          I'll take your advice and send the nudge.

          I think for this issue it would be best to wait for the 1.5.0 version of jflex for clarity.
          I think even without >BMP support, we should be able to still function.
          ex: surrogate pairs with lead surrogate D840-D87F point to the SIP, so they should be typed as CJK.

          for reference (haven't looked at jflex), above-bmp support might require new data structures. I think ICU uses things like tries / compactarrays to deal with the fact you have thousands of codepoints with the same property value, etc.

          Show
          Robert Muir added a comment - Steven, even without >BMP support, 1.5 branch would make the grammar file more clear and maintainable. Otherwise, codepoint ranges must be used. I'll take your advice and send the nudge. I think for this issue it would be best to wait for the 1.5.0 version of jflex for clarity. I think even without >BMP support, we should be able to still function. ex: surrogate pairs with lead surrogate D840-D87F point to the SIP, so they should be typed as CJK. for reference (haven't looked at jflex), above-bmp support might require new data structures. I think ICU uses things like tries / compactarrays to deal with the fact you have thousands of codepoints with the same property value, etc.
          Hide
          Steve Rowe added a comment -

          I think for this issue it would be best to wait for the 1.5.0 version of jflex for clarity.

          +0, in that the arrival time for 1.5.0 is unknown, but I'll defer to your judgment.

          for reference (haven't looked at jflex), above-bmp support might require new data structures. I think ICU uses things like tries / compactarrays to deal with the fact you have thousands of codepoints with the same property value, etc.

          Thanks for the heads-up. The above-BMP property values for the currently supported properties are now encoded on the 1.5 branch as range pairs (they just aren't accessible yet because of the BMP limit). Since JFlex is a regular expression engine, code for handling large character sets (as sets of ranges) is already built-in, so I don't anticipate this will be a problem. The main thing will just be to switch from char to int for character representation.

          Show
          Steve Rowe added a comment - I think for this issue it would be best to wait for the 1.5.0 version of jflex for clarity. +0, in that the arrival time for 1.5.0 is unknown, but I'll defer to your judgment. for reference (haven't looked at jflex), above-bmp support might require new data structures. I think ICU uses things like tries / compactarrays to deal with the fact you have thousands of codepoints with the same property value, etc. Thanks for the heads-up. The above-BMP property values for the currently supported properties are now encoded on the 1.5 branch as range pairs (they just aren't accessible yet because of the BMP limit). Since JFlex is a regular expression engine, code for handling large character sets (as sets of ranges) is already built-in, so I don't anticipate this will be a problem. The main thing will just be to switch from char to int for character representation.
          Hide
          Robert Muir added a comment -

          Steven, thanks for the information, and the range representation sounds interesting.

          So I'll let others comment if they want it to be fixed pre-1.5.0, in this case we could define macros in jflex that represent what we want, with comments indicating how they will be defined in the future jflex.
          Either way, a specific unicode version should be selected, with the macros defined from that unicode version or that unicode version specified to jflex 1.5.0... unicode 5.1 sounds good to me

          The matchVersion could be used to ensure that back compat always works.

          Show
          Robert Muir added a comment - Steven, thanks for the information, and the range representation sounds interesting. So I'll let others comment if they want it to be fixed pre-1.5.0, in this case we could define macros in jflex that represent what we want, with comments indicating how they will be defined in the future jflex. Either way, a specific unicode version should be selected, with the macros defined from that unicode version or that unicode version specified to jflex 1.5.0... unicode 5.1 sounds good to me The matchVersion could be used to ensure that back compat always works.
          Hide
          Robert Muir added a comment -

          related to this issue, Steven has added support for unicode text segmentation properties to the 1.5 dev branch of jflex: http://sourceforge.net/mailarchive/message.php?msg_name=4A747D60.4090904%40odyssey.net

          we should be able to start prototyping a different definition of ALPHANUM, etc that solves this issue (and improves tokenization of many languages!)

          Show
          Robert Muir added a comment - related to this issue, Steven has added support for unicode text segmentation properties to the 1.5 dev branch of jflex: http://sourceforge.net/mailarchive/message.php?msg_name=4A747D60.4090904%40odyssey.net we should be able to start prototyping a different definition of ALPHANUM, etc that solves this issue (and improves tokenization of many languages!)
          Robert Muir made changes -
          Field Original Value New Value
          Link This issue is part of LUCENE-2167 [ LUCENE-2167 ]
          Hide
          Robert Muir added a comment -

          fixed in LUCENE-2167

          Show
          Robert Muir added a comment - fixed in LUCENE-2167
          Robert Muir made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Fix Version/s 3.1 [ 12314822 ]
          Fix Version/s 4.0 [ 12314025 ]
          Resolution Fixed [ 1 ]
          Mark Thomas made changes -
          Workflow jira [ 12466368 ] Default workflow, editable Closed status [ 12563813 ]
          Mark Thomas made changes -
          Workflow Default workflow, editable Closed status [ 12563813 ] jira [ 12585344 ]
          Hide
          Grant Ingersoll added a comment -

          Bulk close for 3.1

          Show
          Grant Ingersoll added a comment - Bulk close for 3.1
          Grant Ingersoll made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Shai Erera made changes -
          Component/s modules/analysis [ 12310230 ]
          Component/s contrib/analyzers [ 12312333 ]
          Transition Time In Source Status Execution Times Last Executer Last Execution Date
          Open Open Resolved Resolved
          466d 14h 18m 1 Robert Muir 29/Sep/10 06:51
          Resolved Resolved Closed Closed
          182d 9h 58m 1 Grant Ingersoll 30/Mar/11 16:50

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development