Lucene - Core
  1. Lucene - Core
  2. LUCENE-1758

improve arabic analyzer: light8 -> light10

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      Someone mentioned on the java user list that the arabic analysis was not as good as they would like.

      This patch adds the لل- prefix (light10 algorithm versus light8 algorithm).
      In the light10 paper, this improves precision from .390 to .413
      They mention this is not statistically significant, but it makes linguistic sense and at least has been shown not to hurt.

      In the future, I hope openrelevance will allow us to try some more approaches.

      1. LUCENE-1758.txt
        2 kB
        Robert Muir
      2. LUCENE-1758.patch
        7 kB
        Robert Muir
      3. LUCENE-1758.patch
        10 kB
        Robert Muir
      4. LUCENE-1758.patch
        11 kB
        Robert Muir

        Activity

        Shai Erera made changes -
        Component/s modules/analysis [ 12310230 ]
        Component/s contrib/analyzers [ 12312333 ]
        Mark Thomas made changes -
        Workflow Default workflow, editable Closed status [ 12563701 ] jira [ 12585260 ]
        Mark Thomas made changes -
        Workflow jira [ 12471413 ] Default workflow, editable Closed status [ 12563701 ]
        Mark Miller made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Robert Muir made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Robert Muir added a comment -

        committed revision 801348.

        Show
        Robert Muir added a comment - committed revision 801348.
        Robert Muir made changes -
        Attachment LUCENE-1758.patch [ 12415547 ]
        Hide
        Robert Muir added a comment -

        add note under Changes in Runtime Behavior warning any existing users of ArabicAnalyzer that they should reindex.

        Show
        Robert Muir added a comment - add note under Changes in Runtime Behavior warning any existing users of ArabicAnalyzer that they should reindex.
        Hide
        Mark Miller added a comment -

        Its not released, but Arabic Analyzer has been around long enough thats its certainly in use by at least a couple people - lets be very explicit in the contrib changes (which I'm sure non of these users will read ) about the lowercase compat break.

        Show
        Mark Miller added a comment - Its not released, but Arabic Analyzer has been around long enough thats its certainly in use by at least a couple people - lets be very explicit in the contrib changes (which I'm sure non of these users will read ) about the lowercase compat break.
        Hide
        Robert Muir added a comment -

        if there are no objections to this one I would like to resolve it soon.

        Show
        Robert Muir added a comment - if there are no objections to this one I would like to resolve it soon.
        Michael McCandless made changes -
        Assignee Michael McCandless [ mikemccand ] Robert Muir [ rcmuir ]
        Hide
        Michael McCandless added a comment -

        Welcome aboard Robert!

        Show
        Michael McCandless added a comment - Welcome aboard Robert!
        Robert Muir made changes -
        Attachment LUCENE-1758.patch [ 12414981 ]
        Hide
        Robert Muir added a comment -

        add lowercasefilter, and replace TODO: more tests with some tests.

        Show
        Robert Muir added a comment - add lowercasefilter, and replace TODO: more tests with some tests.
        Hide
        Michael McCandless added a comment -

        perhaps both this and LUCENE-1628 should include LowerCaseFilter.

        That seems reasonable?

        Show
        Michael McCandless added a comment - perhaps both this and LUCENE-1628 should include LowerCaseFilter. That seems reasonable?
        Hide
        Robert Muir added a comment -

        i think it is probably ready, the only other easy improvement I can think of at the moment is perhaps both this and LUCENE-1628 should include LowerCaseFilter.
        This has nothing to do with Arabic (it does not have case) but just user-friendliness for English content that is encountered.

        example from java-user: http://www.gossamer-threads.com/lists/lucene/java-user/75631#75631

        Show
        Robert Muir added a comment - i think it is probably ready, the only other easy improvement I can think of at the moment is perhaps both this and LUCENE-1628 should include LowerCaseFilter. This has nothing to do with Arabic (it does not have case) but just user-friendliness for English content that is encountered. example from java-user: http://www.gossamer-threads.com/lists/lucene/java-user/75631#75631
        Michael McCandless made changes -
        Assignee Michael McCandless [ mikemccand ]
        Michael McCandless made changes -
        Fix Version/s 2.9 [ 12312682 ]
        Hide
        Michael McCandless added a comment -

        If it's ready to commit, then let's get it into 2.9?

        Show
        Michael McCandless added a comment - If it's ready to commit, then let's get it into 2.9?
        Hide
        Robert Muir added a comment -

        I am curious, could this be considered for 2.9?

        Mostly because Arabic Analyzer is unreleased (so no back compat issues): I think the combination of لل + stopwords improvement will really help.

        More details are available at: http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf

        Show
        Robert Muir added a comment - I am curious, could this be considered for 2.9? Mostly because Arabic Analyzer is unreleased (so no back compat issues): I think the combination of لل + stopwords improvement will really help. More details are available at: http://www.mtholyoke.edu/~lballest/Pubs/arab_stem05.pdf
        Robert Muir made changes -
        Attachment LUCENE-1758.patch [ 12414568 ]
        Hide
        Robert Muir added a comment -

        also updated the stopwords list, it was in need of much improvement.

        Show
        Robert Muir added a comment - also updated the stopwords list, it was in need of much improvement.
        Robert Muir made changes -
        Field Original Value New Value
        Attachment LUCENE-1758.txt [ 12414357 ]
        Hide
        Robert Muir added a comment -

        patch to change from light8 to light10

        Show
        Robert Muir added a comment - patch to change from light8 to light10
        Robert Muir created issue -

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development