Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.9
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      A simple persian analyzer.

      i measured trec scores with the benchmark package below against http://ece.ut.ac.ir/DBRG/Hamshahri/ :

      SimpleAnalyzer:
      SUMMARY
      Search Seconds: 0.012
      DocName Seconds: 0.020
      Num Points: 981.015
      Num Good Points: 33.738
      Max Good Points: 36.185
      Average Precision: 0.374
      MRR: 0.667
      Recall: 0.905
      Precision At 1: 0.585
      Precision At 2: 0.531
      Precision At 3: 0.513
      Precision At 4: 0.496
      Precision At 5: 0.486
      Precision At 6: 0.487
      Precision At 7: 0.479
      Precision At 8: 0.465
      Precision At 9: 0.458
      Precision At 10: 0.460
      Precision At 11: 0.453
      Precision At 12: 0.453
      Precision At 13: 0.445
      Precision At 14: 0.438
      Precision At 15: 0.438
      Precision At 16: 0.438
      Precision At 17: 0.429
      Precision At 18: 0.429
      Precision At 19: 0.419
      Precision At 20: 0.415

      PersianAnalyzer:
      SUMMARY
      Search Seconds: 0.004
      DocName Seconds: 0.011
      Num Points: 987.692
      Num Good Points: 36.123
      Max Good Points: 36.185
      Average Precision: 0.481
      MRR: 0.833
      Recall: 0.998
      Precision At 1: 0.754
      Precision At 2: 0.715
      Precision At 3: 0.646
      Precision At 4: 0.646
      Precision At 5: 0.631
      Precision At 6: 0.621
      Precision At 7: 0.593
      Precision At 8: 0.577
      Precision At 9: 0.573
      Precision At 10: 0.566
      Precision At 11: 0.572
      Precision At 12: 0.562
      Precision At 13: 0.554
      Precision At 14: 0.549
      Precision At 15: 0.542
      Precision At 16: 0.538
      Precision At 17: 0.533
      Precision At 18: 0.527
      Precision At 19: 0.525
      Precision At 20: 0.518

      1. LUCENE-1628.txt
        28 kB
        Robert Muir
      2. LUCENE-1628.patch
        17 kB
        Robert Muir
      3. LUCENE-1628.patch
        19 kB
        Robert Muir
      4. LUCENE-1628.patch
        27 kB
        Mark Miller
      5. LUCENE-1628.patch
        28 kB
        Robert Muir
      6. LUCENE-1628.patch
        29 kB
        Robert Muir
      7. LUCENE-1628.patch
        31 kB
        Robert Muir

        Activity

        Hide
        Robert Muir added a comment -

        Committed revision 802955.

        Show
        Robert Muir added a comment - Committed revision 802955.
        Hide
        Robert Muir added a comment -

        implement reusableTokenStream here too.

        Show
        Robert Muir added a comment - implement reusableTokenStream here too.
        Hide
        Robert Muir added a comment -

        I have been looking this over, I think this one is ready. any comments/concerns?

        Show
        Robert Muir added a comment - I have been looking this over, I think this one is ready. any comments/concerns?
        Hide
        Robert Muir added a comment -

        add lowercasefilter, consistent with the arabic analyzer, its userfriendly for the common case where there is also some english text.

        Show
        Robert Muir added a comment - add lowercasefilter, consistent with the arabic analyzer, its userfriendly for the common case where there is also some english text.
        Hide
        Michael McCandless added a comment -

        Robert/Mark is this one ready to be committed?

        Show
        Michael McCandless added a comment - Robert/Mark is this one ready to be committed?
        Hide
        Robert Muir added a comment -

        analyzers/fa -> analyzers/common/fa
        make PersianNormalizationFilter final
        switch to new API.

        Show
        Robert Muir added a comment - analyzers/fa -> analyzers/common/fa make PersianNormalizationFilter final switch to new API.
        Hide
        Robert Muir added a comment -

        Mark, I think we should figure out a plan with the order of this, LUCENE-1460, and LUCENE-1728...
        I'm not really sure what the best order would be, I think just as long as we coordinate it won't be difficult.

        If we move things around in 1728 it might make life difficult, trying to avoid that.

        Show
        Robert Muir added a comment - Mark, I think we should figure out a plan with the order of this, LUCENE-1460 , and LUCENE-1728 ... I'm not really sure what the best order would be, I think just as long as we coordinate it won't be difficult. If we move things around in 1728 it might make life difficult, trying to avoid that.
        Hide
        Robert Muir added a comment -

        mark, I will be more careful in the future!

        problem is some habit of mine (ctrl-I versus a real format)... also its a tad difficult to spot real warnings (not a real excuse) because of deprecated token api... hopefully we can fix that soon!

        thanks for cleaning it up.

        Show
        Robert Muir added a comment - mark, I will be more careful in the future! problem is some habit of mine (ctrl-I versus a real format)... also its a tad difficult to spot real warnings (not a real excuse) because of deprecated token api... hopefully we can fix that soon! thanks for cleaning it up.
        Hide
        Mark Miller added a comment -

        mark, i'm sorry you had to reformat it.

        No worries - I certainly didn't have to. I just ran it because I recently re-added it to eclipse today. Certainly wasn't necessary, and perhaps there are more than one of these files floating around out there with a slight difference?

        No big deal at all, just wanted to mention the change - I wouldn't have even made the patch other than to remove the imports and they are not a big deal either. There are a bunch in Lucene right now. And there is some crazy, whacky formatting as well. Its easy to be anal about the small stuff when someone else has done all the work on the big stuff

        Show
        Mark Miller added a comment - mark, i'm sorry you had to reformat it. No worries - I certainly didn't have to. I just ran it because I recently re-added it to eclipse today. Certainly wasn't necessary, and perhaps there are more than one of these files floating around out there with a slight difference? No big deal at all, just wanted to mention the change - I wouldn't have even made the patch other than to remove the imports and they are not a big deal either. There are a bunch in Lucene right now. And there is some crazy, whacky formatting as well. Its easy to be anal about the small stuff when someone else has done all the work on the big stuff
        Hide
        Robert Muir added a comment -

        mark, i'm sorry you had to reformat it.

        I am using the lucene formatter file apparently something slipped thru, along with the unused imports... ugh.

        Show
        Robert Muir added a comment - mark, i'm sorry you had to reformat it. I am using the lucene formatter file apparently something slipped thru, along with the unused imports... ugh.
        Hide
        Mark Miller added a comment -

        Thanks a lot Robert, looks great!

        Here is a quick tiny update thats been formatted with the Lucene/Solr eclipse formatter file and with a few unused imports removed.

        I'd rather just wait for some finalization with the new token api, but I'll defer to Mike on whether we wait to commit or not.

        Show
        Mark Miller added a comment - Thanks a lot Robert, looks great! Here is a quick tiny update thats been formatted with the Lucene/Solr eclipse formatter file and with a few unused imports removed. I'd rather just wait for some finalization with the new token api, but I'll defer to Mike on whether we wait to commit or not.
        Hide
        Robert Muir added a comment -

        add additional tests, showing behavior of this analyzer as a whole.

        Show
        Robert Muir added a comment - add additional tests, showing behavior of this analyzer as a whole.
        Hide
        Robert Muir added a comment -

        Mark, no problem.

        I will upload a new patch showing some behavior of the analyzer as a whole...

        Show
        Robert Muir added a comment - Mark, no problem. I will upload a new patch showing some behavior of the analyzer as a whole...
        Hide
        Mark Miller added a comment -

        Should we add a coulple tests Robert?

        +public class TestPersianAnalyzer extends TestCase {

        +

        + /** This test fails with NPE when the

        + * stopwords file is missing in classpath */

        + public void testResourcesAvailable()

        { + new PersianAnalyzer(); + }

        +

        + /* TODO: more tests */

        +

        +}

        Show
        Mark Miller added a comment - Should we add a coulple tests Robert? +public class TestPersianAnalyzer extends TestCase { + + /** This test fails with NPE when the + * stopwords file is missing in classpath */ + public void testResourcesAvailable() { + new PersianAnalyzer(); + } + + /* TODO: more tests */ + +}
        Hide
        Michael McCandless added a comment -

        I think we should go ahead and commit this and cutover to the new API as a separate step?

        Show
        Michael McCandless added a comment - I think we should go ahead and commit this and cutover to the new API as a separate step?
        Hide
        Mark Miller added a comment -

        Okay, fair enough. I figured you'd know better than me, just wanted to check. Certainly if we have other code that way, no reason to change it here. And of course it makes sense that you would still run into issues with the comments - garbalage at best.

        I only ever use apply to/from clipboard so I have luckily never seen that issue

        We should be good to put this in then - I'll wait till we get squared away with the new token api patch then commit.

        Show
        Mark Miller added a comment - Okay, fair enough. I figured you'd know better than me, just wanted to check. Certainly if we have other code that way, no reason to change it here. And of course it makes sense that you would still run into issues with the comments - garbalage at best. I only ever use apply to/from clipboard so I have luckily never seen that issue We should be good to put this in then - I'll wait till we get squared away with the new token api patch then commit.
        Hide
        Robert Muir added a comment -

        mark: thanks for the followup on the licenses!

        wrt non-english text, I will say that if you set encoding to UTF-8 (such as in eclipse under project>properties>text encoding) then things are fine.
        the ant build also does the right thing, and there are definitely other analyzers that behave like this too, and will break if things aren't set right.

        also, if you do not set encoding to UTF-8, most editors (such as eclipse) will not be able to save the file, and will error out with encoding issues... even if the text is inside a comment!

        not really (ok a little) trying to talk you out of this, but I'm just not sure it would really help anything...

        that being said... (my) eclipse still jacks up if you team->apply patch from file. if you open the patch in notepad, ctrl-a,ctrl-c, and then team->apply patch from clipboard, it works fine... very annoying!

        Show
        Robert Muir added a comment - mark: thanks for the followup on the licenses! wrt non-english text, I will say that if you set encoding to UTF-8 (such as in eclipse under project>properties>text encoding) then things are fine. the ant build also does the right thing, and there are definitely other analyzers that behave like this too, and will break if things aren't set right. also, if you do not set encoding to UTF-8, most editors (such as eclipse) will not be able to save the file, and will error out with encoding issues... even if the text is inside a comment! not really (ok a little) trying to talk you out of this, but I'm just not sure it would really help anything... that being said... (my) eclipse still jacks up if you team->apply patch from file. if you open the patch in notepad, ctrl-a,ctrl-c, and then team->apply patch from clipboard, it works fine... very annoying!
        Hide
        Mark Miller added a comment -

        Looks pretty good. Not sure if we should update to the new token api here or just commit and hit it with the other issue. I guess we might as well get it here first.

        Is it better to put the raw text in there like that (in the tests) or do you think it would be better to use utf8 codes with maybe the raw text in a comment? I'm just remembering running into issues with such things in a past life as I moved around source code.

        Show
        Mark Miller added a comment - Looks pretty good. Not sure if we should update to the new token api here or just commit and hit it with the other issue. I guess we might as well get it here first. Is it better to put the raw text in there like that (in the tests) or do you think it would be better to use utf8 codes with maybe the raw text in a comment? I'm just remembering running into issues with such things in a past life as I moved around source code.
        Hide
        Mark Miller added a comment -

        mark, on the same topic: if possible, at some time it would be great to know which licenses are OK, and which ones are not.

        Found it.

        No Problem:

        • Apache License 2.0
        • ASL 1.1
        • BSD
        • MIT/X11
        • NCSA
        • W3C Software license
        • X.Net
        • zlib/libpng

        with some hassle:

        • CDDL 1.0
        • CPL 1.0
        • EPL 1.0
        • IPL 1.0
        • MPL 1.0 and MPL 1.1
        • SPL 1.0

        http://www.apache.org/legal/3party.html

        Show
        Mark Miller added a comment - mark, on the same topic: if possible, at some time it would be great to know which licenses are OK, and which ones are not. Found it. No Problem: Apache License 2.0 ASL 1.1 BSD MIT/X11 NCSA W3C Software license X.Net zlib/libpng with some hassle: CDDL 1.0 CPL 1.0 EPL 1.0 IPL 1.0 MPL 1.0 and MPL 1.1 SPL 1.0 http://www.apache.org/legal/3party.html
        Hide
        Robert Muir added a comment -

        mark, on the same topic: if possible, at some time it would be great to know which licenses are OK, and which ones are not.

        Show
        Robert Muir added a comment - mark, on the same topic: if possible, at some time it would be great to know which licenses are OK, and which ones are not.
        Hide
        Mark Miller added a comment -

        Okay, I see that the stopword list for Arabic was committed by Grant with the BSD license. I'll take that as an "its okay" unless anyone speaks up.

        Thanks for all these great Analyzers Robert.

        Show
        Mark Miller added a comment - Okay, I see that the stopword list for Arabic was committed by Grant with the BSD license. I'll take that as an "its okay" unless anyone speaks up. Thanks for all these great Analyzers Robert.
        Hide
        Mark Miller added a comment -

        Thanks Robert, looks cool.

        Anyone know what the policy on the stop word list being BSD license is? I assume its compatible with Apache? Whats our BSD license policy? I don't see anything definitive on a quick mailing list search.

        • Mark
        Show
        Mark Miller added a comment - Thanks Robert, looks cool. Anyone know what the policy on the stop word list being BSD license is? I assume its compatible with Apache? Whats our BSD license policy? I don't see anything definitive on a quick mailing list search. Mark
        Hide
        Robert Muir added a comment -

        farsi stopwords file moved to resources folder and test to ensure it loads.

        Show
        Robert Muir added a comment - farsi stopwords file moved to resources folder and test to ensure it loads.
        Hide
        Robert Muir added a comment -

        patch file

        Show
        Robert Muir added a comment - patch file

          People

          • Assignee:
            Robert Muir
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development