Lucene - Core
  1. Lucene - Core
  2. LUCENE-1963

ArabicAnalyzer: Lowercase before Stopfilter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 2.9.1, 3.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      ArabicAnalyzer lowercases text in case you have some non-Arabic text around.
      It also allows you to set a custom stopword list (you might augment the Arabic list with some English ones, for example).

      In this case its helpful for these non-Arabic stopwords, to lowercase before stopfilter.

      1. LUCENE-1963.patch
        1 kB
        Robert Muir
      2. LUCENE-1963.patch
        3 kB
        Robert Muir
      3. LUCENE-1963_branch.patch
        4 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        simple patch, but will need to warn in CHANGES.txt that folks should reindex, if they are using non-Arabic stopwords.

        Show
        Robert Muir added a comment - simple patch, but will need to warn in CHANGES.txt that folks should reindex, if they are using non-Arabic stopwords.
        Hide
        Robert Muir added a comment -

        if no one objects, I'd like to commit this for 3.0 at the end of the day.

        Show
        Robert Muir added a comment - if no one objects, I'd like to commit this for 3.0 at the end of the day.
        Hide
        Robert Muir added a comment -

        here also update the javadocs to reflect the new order of what is going on in ArabicAnalyzer, to prevent any confusion to users.

        Show
        Robert Muir added a comment - here also update the javadocs to reflect the new order of what is going on in ArabicAnalyzer, to prevent any confusion to users.
        Hide
        DM Smith added a comment -

        can you commit it to 2.9.1 too? (For those stuck on Java 1.4, there is no 3.0).

        Show
        DM Smith added a comment - can you commit it to 2.9.1 too? (For those stuck on Java 1.4, there is no 3.0).
        Hide
        Robert Muir added a comment -

        can you commit it to 2.9.1 too? (For those stuck on Java 1.4, there is no 3.0).

        can someone comment on this one for me.
        I don't think its too much of a stretch to consider this a bug, even if it does not affect Arabic text.

        Show
        Robert Muir added a comment - can you commit it to 2.9.1 too? (For those stuck on Java 1.4, there is no 3.0). can someone comment on this one for me. I don't think its too much of a stretch to consider this a bug, even if it does not affect Arabic text.
        Hide
        Robert Muir added a comment -

        Committed revision 823534.
        (if it is ok to apply this to 2.9 branch as DM requested, we should reopen)

        Show
        Robert Muir added a comment - Committed revision 823534. (if it is ok to apply this to 2.9 branch as DM requested, we should reopen)
        Hide
        Mark Miller added a comment -

        Your issue - if you can stretch it to bugish territory, I'd +1 it. I'd be wary of getting into porting features to 2.9.1 - but I wouldn't have a problem with this one myself.

        Show
        Mark Miller added a comment - Your issue - if you can stretch it to bugish territory, I'd +1 it. I'd be wary of getting into porting features to 2.9.1 - but I wouldn't have a problem with this one myself.
        Hide
        Robert Muir added a comment -

        Mark, I think the problem is really that I overlooked this use case in LUCENE-1758, because Arabic is not case sensitive.

        It won't affect the default usage of the Analyzer (where all the stopwords are in Arabic and lowercase is a no-op).

        I am going to also set fix for 2.9.1 and give a day or two for people to comment if they disagree with applying to 2.9 branch.

        Show
        Robert Muir added a comment - Mark, I think the problem is really that I overlooked this use case in LUCENE-1758 , because Arabic is not case sensitive. It won't affect the default usage of the Analyzer (where all the stopwords are in Arabic and lowercase is a no-op). I am going to also set fix for 2.9.1 and give a day or two for people to comment if they disagree with applying to 2.9 branch.
        Hide
        Robert Muir added a comment -

        its been a few days, no one objected to applying this fix to the branch.

        but I do not have permissions to commit to the branch... can someone commit this for me? Attached is the patch.

        Show
        Robert Muir added a comment - its been a few days, no one objected to applying this fix to the branch. but I do not have permissions to commit to the branch... can someone commit this for me? Attached is the patch.
        Hide
        Michael McCandless added a comment -

        Committed on 2.9.x. Thanks Robert!

        Show
        Michael McCandless added a comment - Committed on 2.9.x. Thanks Robert!
        Hide
        Michael McCandless added a comment -

        Bulk close all 2.9.1 issues.

        Show
        Michael McCandless added a comment - Bulk close all 2.9.1 issues.

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development