[LUCENE-1406] new Arabic Analyzer (Apache license) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: New Feature
Status: Closed
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.9
Component/s: modules/analysis
Labels:
None

Lucene Fields:

Patch Available

Description

I've noticed there is no Arabic analyzer for Lucene, most likely because Tim Buckwalter's morphological dictionary is GPL.

However, it is not necessary to have full morphological analysis engine for a quality arabic search.
This implementation implements the light-8s algorithm present in the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf

As you can see from the paper, improvement via this method over searching surface forms (as lucene currently does) is significant, with almost 100% improvement in average precision.

While I personally don't think all the choices were the best, and some easily improvements are still possible, the major motivation for implementing it exactly the way it is presented in the paper is that the algorithm is TREC-tested, so the precision/recall improvements to lucene are already documented.

For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html simply because the creator of this list documents the data as BSD-licensed.

This implementation (Analyzer) consists of above mentioned stopword list plus two filters:
ArabicNormalizationFilter: performs orthographic normalization (such as hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, etc)
ArabicStemFilter: performs arabic light stemming

Both filters operate directly on termbuffer for maximum performance. There is no object creation in this Analyzer.

There are no external dependencies. I've indexed about half a billion words of arabic text and tested against that.

If there are any issues with this implementation I am willing to fix them. I use lucene on a daily basis and would like to give something back. Thanks.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

LUCENE-1406.patch
26/Sep/08 15:34
34 kB
Robert Muir

Activity

People

Assignee:: Grant Ingersoll

Reporter:: Robert Muir

Votes:: 1 Vote for this issue

Watchers:: 0 Start watching this issue

Dates

Created:: 26/Sep/08 11:56

Updated:: 28/Aug/22 11:53

Resolved:: 21/Oct/08 14:59