Lucene - Core
  1. Lucene - Core
  2. LUCENE-1966

Arabic Analyzer: Stopwords list needs enhancement

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Trivial Trivial
    • Resolution: Fixed
    • Affects Version/s: 2.9
    • Fix Version/s: 3.0
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      The provided Arabic stopwords list needs some enhancements (e.g. it contains a lot of words that not stopwords, and some cleanup) . patch will be provided with this issue.

      1. arabic-stopwords-comments.txt
        2 kB
        Basem Narmok
      2. LUCENE-1966.patch
        3 kB
        Basem Narmok
      3. LUCENE-1966.patch
        4 kB
        Basem Narmok

        Activity

        Hide
        Robert Muir added a comment -

        Committed revision 825110.

        Thanks Basem!

        Show
        Robert Muir added a comment - Committed revision 825110. Thanks Basem!
        Hide
        Robert Muir added a comment -

        Basem, yes I think the improvements are good.

        My question is really: is it OK to commit this for 3.0 or should we wait for 3.1?

        Show
        Robert Muir added a comment - Basem, yes I think the improvements are good. My question is really: is it OK to commit this for 3.0 or should we wait for 3.1?
        Hide
        Basem Narmok added a comment -

        Seems good.

        BTW with FAST ESP we never used stopwords, as hits from stopwords get low relevancy (keywords with high number of hits = low value, low importance, so less relevant), so such hits will never get into the top results. Also, using stopwords will affect phrase search, most of the search engines avoid removing them. But, at the end it depends on the client's application, and what she really wants, as enterprise search could have very specific and different needs than Internet search.

        Anyways, still I am testing the Arabic Analyzer, and I will provide you with more comments soon. but for the stopwords they are good for now

        Show
        Basem Narmok added a comment - Seems good. BTW with FAST ESP we never used stopwords, as hits from stopwords get low relevancy (keywords with high number of hits = low value, low importance, so less relevant), so such hits will never get into the top results. Also, using stopwords will affect phrase search, most of the search engines avoid removing them. But, at the end it depends on the client's application, and what she really wants, as enterprise search could have very specific and different needs than Internet search. Anyways, still I am testing the Arabic Analyzer, and I will provide you with more comments soon. but for the stopwords they are good for now
        Hide
        Robert Muir added a comment -

        before I commit this, I want to solicit any comments/concerns about backwards compat, assuming the following notice:

        Changes in runtime behavior
        
         * LUCENE-1966: Modified and cleaned the default Arabic stopwords list used
           by ArabicAnalyzer. You'll need to fully re-index any previously created 
           indexes.  (Basem Narmok via Robert Muir)
        

        i know contrib has no bw compat guarantee, but just want to double-check.
        Perhaps in the future someone might help fix the Persian stopwords file also so this may happen again

        Show
        Robert Muir added a comment - before I commit this, I want to solicit any comments/concerns about backwards compat, assuming the following notice: Changes in runtime behavior * LUCENE-1966: Modified and cleaned the default Arabic stopwords list used by ArabicAnalyzer. You'll need to fully re-index any previously created indexes. (Basem Narmok via Robert Muir) i know contrib has no bw compat guarantee, but just want to double-check. Perhaps in the future someone might help fix the Persian stopwords file also so this may happen again
        Hide
        Robert Muir added a comment -

        Basem, ok! Thanks a lot for your help here. I will commit soon.

        Show
        Robert Muir added a comment - Basem, ok! Thanks a lot for your help here. I will commit soon.
        Hide
        Basem Narmok added a comment -

        Oh, my mistake, sorry, yes please remove the last two on 123 & 124.

        no, they are just duplicate of the ones on line 72 & 73

        Show
        Basem Narmok added a comment - Oh, my mistake, sorry, yes please remove the last two on 123 & 124. no, they are just duplicate of the ones on line 72 & 73
        Hide
        Robert Muir added a comment -

        Basem I can simply remove 123 & 124 if this is the case, but I did not want to do this without checking first.

        The reason is, I wonder if perhaps you intended for these two to be أيضاً and ايضاً (with fathatan)

        Show
        Robert Muir added a comment - Basem I can simply remove 123 & 124 if this is the case, but I did not want to do this without checking first. The reason is, I wonder if perhaps you intended for these two to be أيضاً and ايضاً (with fathatan)
        Hide
        Robert Muir added a comment - - edited

        Basem, I meant: there are two entries for أيضا , and two entries for ايضا (total of four)

        edit: here are the relevant line numbers from the new stopwords.txt:

        Lines 72 and 73:

        ايضا
        أيضا
        

        Lines 123 and 124:

        ايضا
        أيضا
        
        Show
        Robert Muir added a comment - - edited Basem, I meant: there are two entries for أيضا , and two entries for ايضا (total of four) edit: here are the relevant line numbers from the new stopwords.txt: Lines 72 and 73: ايضا أيضا Lines 123 and 124: ايضا أيضا
        Hide
        Basem Narmok added a comment -

        Hi Robert,

        Regarding ايضا / أيضا ...

        No, not by accident, I included both formats (normalized,unnormalized). Arabic users tend to use both on the internet (different spellings), another example is words like أي / اي

        Show
        Basem Narmok added a comment - Hi Robert, Regarding ايضا / أيضا ... No, not by accident, I included both formats (normalized,unnormalized). Arabic users tend to use both on the internet (different spellings), another example is words like أي / اي
        Hide
        Robert Muir added a comment -

        Basem, thanks. I like the new list.

        I have one very minor question: in the list we have أيضا / ايضا twice.

        I wanted to check with you, is this by accident or did you have some other spellings in mind?

        If it is by accident, let me know, I can just remove the duplicates before committing.

        Show
        Robert Muir added a comment - Basem, thanks. I like the new list. I have one very minor question: in the list we have أيضا / ايضا twice. I wanted to check with you, is this by accident or did you have some other spellings in mind? If it is by accident, let me know, I can just remove the duplicates before committing.
        Hide
        Basem Narmok added a comment -

        Robert, you are correct, to solve the problem we have two options:
        1- to remove words like علي and وفي
        2- to use unnormalized stiowirds list, before the normalization filter.

        I think the best is the second option, so this patch only modifies the list (unnormalized), please try it.

        Show
        Basem Narmok added a comment - Robert, you are correct, to solve the problem we have two options: 1- to remove words like علي and وفي 2- to use unnormalized stiowirds list, before the normalization filter. I think the best is the second option, so this patch only modifies the list (unnormalized), please try it.
        Hide
        Robert Muir added a comment -

        Basem, thanks for the patch, and the comments.

        One thing I noticed: if I apply the patch, على (the stopword) will not be filtered as a stopword. This is because it will be normalized to علي (the name).

        So, if we are going to normalize before stopfilter, I think we need to make sure the stopwords do not contain yeh without dots, or else these will not work. This is one example of why I was scared to apply normalization before stopwords, because by doing so, we cause على and علي to conflate.

        Let me know what you think about this.

        Show
        Robert Muir added a comment - Basem, thanks for the patch, and the comments. One thing I noticed: if I apply the patch, على (the stopword) will not be filtered as a stopword. This is because it will be normalized to علي (the name). So, if we are going to normalize before stopfilter, I think we need to make sure the stopwords do not contain yeh without dots, or else these will not work. This is one example of why I was scared to apply normalization before stopwords, because by doing so, we cause على and علي to conflate. Let me know what you think about this.
        Hide
        Basem Narmok added a comment -

        Please see the arabic-stopwords-comments.txt to see my comments on the list, and why/what did I change.

        The patch provides an updated Arabic stopwords file, and modifies ArabicAnalyzer to filter stopwords after the normalization, as the provided list is a normalized Arabic stop words.

        Best,

        Show
        Basem Narmok added a comment - Please see the arabic-stopwords-comments.txt to see my comments on the list, and why/what did I change. The patch provides an updated Arabic stopwords file, and modifies ArabicAnalyzer to filter stopwords after the normalization, as the provided list is a normalized Arabic stop words. Best,

          People

          • Assignee:
            Robert Muir
            Reporter:
            Basem Narmok
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development