Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 5.4, 6.0
    • Component/s: None
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      TokenFilter that folds all unicode digits (http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Decimal_Number:]) to 0-9.

      Historically a lot of the impacted analyzers couldn't even tokenize numbers at all, but now they use standardtokenizer for numbers/alphanum tokens. But its usually the case you will find e.g. a mix of both ascii digits and "native" digits, and today that makes searching difficult.

      Note this only impacts decimal digits, hence the name DecimalDigitFilter. So no processing of chinese numerals or anything crazy like that.

        Issue Links

          Activity

          Hide
          thetaphi Uwe Schindler added a comment -

          Ignore my last comment: The filter needs more Unicode info than Character#isDigit().

          Show
          thetaphi Uwe Schindler added a comment - Ignore my last comment: The filter needs more Unicode info than Character#isDigit().
          Hide
          thetaphi Uwe Schindler added a comment -

          Just as idea: We could expand UnicodeData.java autogen'd to ICU extracted digits like UnicodeWhitespaceTokenizer? Just in case that the Java data may be strange (although I think it is a bug in the filter, as Hoss' said).

          Show
          thetaphi Uwe Schindler added a comment - Just as idea: We could expand UnicodeData.java autogen'd to ICU extracted digits like UnicodeWhitespaceTokenizer? Just in case that the Java data may be strange (although I think it is a bug in the filter, as Hoss' said).
          Hide
          hossman Hoss Man added a comment -

          I think there may be a bug here for some digits ... created new issue LUCENE-6914 in case it's non trivial to fix and doesn't get resolved before 5.4 is released.

          Show
          hossman Hoss Man added a comment - I think there may be a bug here for some digits ... created new issue LUCENE-6914 in case it's non trivial to fix and doesn't get resolved before 5.4 is released.
          Hide
          rcmuir Robert Muir added a comment -

          It does, among other dangerous foldings you may not want. Additionally, it cant improve the behaviour for all these languages Analyzers as icu is optional. So this is just a simple filter like Lowercase to improve the situation.

          Show
          rcmuir Robert Muir added a comment - It does, among other dangerous foldings you may not want. Additionally, it cant improve the behaviour for all these languages Analyzers as icu is optional. So this is just a simple filter like Lowercase to improve the situation.
          Hide
          andyetitmoves Ramkumar Aiyengar added a comment -

          ICU folding does this right? This patch is still useful even if so, in case you don't want to do the full folding, or don't want to use ICU, just curious really..

          Show
          andyetitmoves Ramkumar Aiyengar added a comment - ICU folding does this right? This patch is still useful even if so, in case you don't want to do the full folding, or don't want to use ICU, just curious really..
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1695908 from Robert Muir in branch 'dev/branches/branch_5x'
          [ https://svn.apache.org/r1695908 ]

          LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1695908 from Robert Muir in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1695908 ] LUCENE-6737 : Add DecimalDigitFilter which folds unicode digits to basic latin
          Hide
          jira-bot ASF subversion and git services added a comment -

          Commit 1695898 from Robert Muir in branch 'dev/trunk'
          [ https://svn.apache.org/r1695898 ]

          LUCENE-6737: Add DecimalDigitFilter which folds unicode digits to basic latin

          Show
          jira-bot ASF subversion and git services added a comment - Commit 1695898 from Robert Muir in branch 'dev/trunk' [ https://svn.apache.org/r1695898 ] LUCENE-6737 : Add DecimalDigitFilter which folds unicode digits to basic latin
          Hide
          thetaphi Uwe Schindler added a comment -

          +1

          Show
          thetaphi Uwe Schindler added a comment - +1
          Hide
          jpountz Adrien Grand added a comment -

          +1

          Show
          jpountz Adrien Grand added a comment - +1
          Hide
          rcmuir Robert Muir added a comment -

          patch.

          Show
          rcmuir Robert Muir added a comment - patch.

            People

            • Assignee:
              Unassigned
              Reporter:
              rcmuir Robert Muir
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development