Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TokenFilter that folds all unicode digits (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) to 0-9.

      Historically a lot of the impacted analyzers couldn't even tokenize numbers at all, but now they use standardtokenizer for numbers/alphanum tokens. But its usually the case you will find e.g. a mix of both ascii digits and "native" digits, and today that makes searching difficult.

      Note this only impacts decimal digits, hence the name DecimalDigitFilter. So no processing of chinese numerals or anything crazy like that.

        Issue Links

          Activity

          Hide
          Robert Muir added a comment -

          patch.

          Show
          Robert Muir added a comment - patch.
          Hide
          Adrien Grand added a comment -

          +1

          Show
          Adrien Grand added a comment - +1
          Hide
          Uwe Schindler added a comment -

          +1

          Show
          Uwe Schindler added a comment - +1
          Hide
          ASF subversion and git services added a comment -

          Commit 1695898 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1695898 ]

          LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin

          Show
          ASF subversion and git services added a comment - Commit 1695898 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1695898 ] LUCENE-6737 : Add DecimalDigitFilter which folds unicode digits to basic latin
          Hide
          ASF subversion and git services added a comment -

          Commit 1695908 from Robert Muir in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1695908 ]

          LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin

          Show
          ASF subversion and git services added a comment - Commit 1695908 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1695908 ] LUCENE-6737 : Add DecimalDigitFilter which folds unicode digits to basic latin
          Hide
          Ramkumar Aiyengar added a comment -

          ICU folding does this right? This patch is still useful even if so, in case you don't want to do the full folding, or don't want to use ICU, just curious really..

          Show
          Ramkumar Aiyengar added a comment - ICU folding does this right? This patch is still useful even if so, in case you don't want to do the full folding, or don't want to use ICU, just curious really..
          Hide
          Robert Muir added a comment -

          It does, among other dangerous foldings you may not want. Additionally, it cant improve the behaviour for all these languages Analyzers as icu is optional. So this is just a simple filter like Lowercase to improve the situation.

          Show
          Robert Muir added a comment - It does, among other dangerous foldings you may not want. Additionally, it cant improve the behaviour for all these languages Analyzers as icu is optional. So this is just a simple filter like Lowercase to improve the situation.
          Hide
          Hoss Man added a comment -

          I think there may be a bug here for some digits ... created new issue LUCENE-6914 in case it's non trivial to fix and doesn't get resolved before 5.4 is released.

          Show
          Hoss Man added a comment - I think there may be a bug here for some digits ... created new issue LUCENE-6914 in case it's non trivial to fix and doesn't get resolved before 5.4 is released.
          Hide
          Uwe Schindler added a comment -

          Just as idea: We could expand UnicodeData.java autogen'd to ICU extracted digits like UnicodeWhitespaceTokenizer? Just in case that the Java data may be strange (although I think it is a bug in the filter, as Hoss' said).

          Show
          Uwe Schindler added a comment - Just as idea: We could expand UnicodeData.java autogen'd to ICU extracted digits like UnicodeWhitespaceTokenizer? Just in case that the Java data may be strange (although I think it is a bug in the filter, as Hoss' said).
          Hide
          Uwe Schindler added a comment -

          Ignore my last comment: The filter needs more Unicode info than Character#isDigit().

          Show
          Uwe Schindler added a comment - Ignore my last comment: The filter needs more Unicode info than Character#isDigit().

            People

            • Assignee:
              Unassigned
              Reporter:
              Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development