Details

    • Lucene Fields:
      New, Patch Available

      Description

      Adds analysis for Irish.

      The stemmer is generated from a snowball stemmer. I've sent it to Martin Porter, who says it will be added during the week.

      1. irish.sbl
        2 kB
        Jim Regan
      2. LUCENE-3883.patch
        40 kB
        Jim Regan
      3. LUCENE-3883.patch
        42 kB
        Robert Muir
      4. LUCENE-3883.patch
        50 kB
        Robert Muir

        Activity

        Hide
        Jim Regan added a comment -

        Patch adding Irish analysis

        Show
        Jim Regan added a comment - Patch adding Irish analysis
        Hide
        Jim Regan added a comment -

        Patch, redone from top level of svn.

        Show
        Jim Regan added a comment - Patch, redone from top level of svn.
        Hide
        Robert Muir added a comment -

        Thanks Jim! This looks really nicely done...

        Out of curiousity could you share your snowball rules (the .sbl) with us?

        Show
        Robert Muir added a comment - Thanks Jim! This looks really nicely done... Out of curiousity could you share your snowball rules (the .sbl) with us?
        Hide
        Uwe Schindler added a comment -

        Hi,

        very funny lowercase filter! One thing: It does not actually ArrayIndexOutOfBoundsEx in the filter because of the way how CharTermAttributeImpl is implemented internally, but theoretically there is a length check missing. The nUpper/tUpper stuff can get out of bounds if the length of term in 0 or 1 (which are valid length). But thats only a minor complaint about the code. Otherwise looks great. Just appearing from no irish support at all! really needed!

        Uwe

        Show
        Uwe Schindler added a comment - Hi, very funny lowercase filter! One thing: It does not actually ArrayIndexOutOfBoundsEx in the filter because of the way how CharTermAttributeImpl is implemented internally, but theoretically there is a length check missing. The nUpper/tUpper stuff can get out of bounds if the length of term in 0 or 1 (which are valid length). But thats only a minor complaint about the code. Otherwise looks great. Just appearing from no irish support at all! really needed! Uwe
        Hide
        Robert Muir added a comment -

        By the way I created LUCENE-3884 to move the ElisionFilter out of the french package
        into a more general .util package. That doesnt need to hold up this issue: it just
        reminded me we should move it because its not really french-specific.

        Show
        Robert Muir added a comment - By the way I created LUCENE-3884 to move the ElisionFilter out of the french package into a more general .util package. That doesnt need to hold up this issue: it just reminded me we should move it because its not really french-specific.
        Hide
        Jim Regan added a comment -

        Irish snowball script

        Show
        Jim Regan added a comment - Irish snowball script
        Hide
        Jim Regan added a comment -

        Yeah, it's quite an odd thing (Scots Gaelic has a similar phenomenon, but they consistently keep the hyphen), but it does help with the stemmer in those cases to know that the t or n at the start of the word is due only to mutation.

        Show
        Jim Regan added a comment - Yeah, it's quite an odd thing (Scots Gaelic has a similar phenomenon, but they consistently keep the hyphen), but it does help with the stemmer in those cases to know that the t or n at the start of the word is due only to mutation.
        Hide
        Jim Regan added a comment -

        I'm not sure if I actually needed to use the ElisionFilter, because the stemmer handles those - because of the initial mutation in Irish, trimming the start of the word is more important than trimming the end. I was copying the Catalan analyser, and using ElisionFilter seemed like The Thing To Do.

        Show
        Jim Regan added a comment - I'm not sure if I actually needed to use the ElisionFilter, because the stemmer handles those - because of the initial mutation in Irish, trimming the start of the word is more important than trimming the end. I was copying the Catalan analyser, and using ElisionFilter seemed like The Thing To Do.
        Hide
        Jim Regan added a comment -

        New version of patch, also checking that chLen (array length) > 1

        Show
        Jim Regan added a comment - New version of patch, also checking that chLen (array length) > 1
        Hide
        Robert Muir added a comment -

        Thanks for updating the patch Jim!

        one concern doing some very very rudimentary testing:

        we have special lowercasing for situations like nAthair -> n-athair,

        which the snowball rules then strip:

        define initial_morph as (
          [substring] among (
            'h-' 'n-' 't-' //nAthair -> n-athair, but alone are problematic
            (delete)
        

        The problem is if the input initially comes as n-athair, Unicode break rules
        will split this up on the hyphen into two tokens

        {n, athair}

        . You can visualize this at http://unicode.org/cldr/utility/breaks.jsp

        This means we can add many spurious 'n' tokens in the index...

        So we have two potential solutions to this:

        1. we can simply add 'n', 'h', 't', etc to the stopwords list. This is the simplest solution. Would this be too aggressive?
        2. we can add a CharFilter for IrishAnalyzer to prevent this splitting from happening. This is more complex.
        Show
        Robert Muir added a comment - Thanks for updating the patch Jim! one concern doing some very very rudimentary testing: we have special lowercasing for situations like nAthair -> n-athair, which the snowball rules then strip: define initial_morph as ( [substring] among ( 'h-' 'n-' 't-' //nAthair -> n-athair, but alone are problematic (delete) The problem is if the input initially comes as n-athair, Unicode break rules will split this up on the hyphen into two tokens {n, athair} . You can visualize this at http://unicode.org/cldr/utility/breaks.jsp This means we can add many spurious 'n' tokens in the index... So we have two potential solutions to this: we can simply add 'n', 'h', 't', etc to the stopwords list. This is the simplest solution. Would this be too aggressive? we can add a CharFilter for IrishAnalyzer to prevent this splitting from happening. This is more complex.
        Hide
        Robert Muir added a comment -

        Hmm another downside of #1 is that with a simple stopfilter approach, position increments won't line up
        if we have a phrase query of "n-athair" with indexed nAthair.

        So I start to lean towards #2 since it would be a better solution... but I'm going to think about it
        and see if I come up with any other ideas.

        Separately, what about h- when succeeded by a vowel? Is there actually usually a hyphen here?
        (Wikipedia says no, playing around with GaelSpell seems to agree, but I don't know anything about this language!)
        Would this case be too aggressive to handle?

        Show
        Robert Muir added a comment - Hmm another downside of #1 is that with a simple stopfilter approach, position increments won't line up if we have a phrase query of "n-athair" with indexed nAthair. So I start to lean towards #2 since it would be a better solution... but I'm going to think about it and see if I come up with any other ideas. Separately, what about h- when succeeded by a vowel? Is there actually usually a hyphen here? (Wikipedia says no, playing around with GaelSpell seems to agree, but I don't know anything about this language!) Would this case be too aggressive to handle?
        Hide
        Robert Muir added a comment -

        To make matters worse: this exact example of splitting on hyphen for this Irish case is
        actually mentioned on http://en.wikipedia.org/wiki/Hyphen#In_computing

        From there it seems like the right thing to do is heuristically convert to
        U+2011 (non-breaking hyphen) but this only affects Unicode line-break rules,
        not word break rules

        So it seems like the least hackish workaround would be for a charfilter to
        convert n-athair -> nAthair (to prevent the tokenizer from splitting it up),
        since the IrishLowerCaseFilter will convert it back and stem it anyway.

        I'll see if i can hack something up.

        Show
        Robert Muir added a comment - To make matters worse: this exact example of splitting on hyphen for this Irish case is actually mentioned on http://en.wikipedia.org/wiki/Hyphen#In_computing From there it seems like the right thing to do is heuristically convert to U+2011 (non-breaking hyphen) but this only affects Unicode line-break rules, not word break rules So it seems like the least hackish workaround would be for a charfilter to convert n-athair -> nAthair (to prevent the tokenizer from splitting it up), since the IrishLowerCaseFilter will convert it back and stem it anyway. I'll see if i can hack something up.
        Hide
        Robert Muir added a comment -

        updated patch, with a simple solution to the hyphen-phrasequery-problem:

        I added a special stopset just for these:

          /**
           * When StandardTokenizer splits t‑athair into {t, athair}, we don't
           * want to cause a position increment, otherwise there will be problems
           * with phrase queries versus tAthair (which would not have a gap).
           */
          private static final CharArraySet HYPHENATIONS = CharArraySet.unmodifiableSet(
              new CharArraySet(Version.LUCENE_CURRENT,
                  Arrays.asList(
                      "h", "n", "t"
                  ), true));
        

        This is used with enablePositionIncrements=false to ensure no gap is added... I also added a simple test for this.

        Show
        Robert Muir added a comment - updated patch, with a simple solution to the hyphen-phrasequery-problem: I added a special stopset just for these: /** * When StandardTokenizer splits t‑athair into {t, athair}, we don't * want to cause a position increment, otherwise there will be problems * with phrase queries versus tAthair (which would not have a gap). */ private static final CharArraySet HYPHENATIONS = CharArraySet.unmodifiableSet( new CharArraySet(Version.LUCENE_CURRENT, Arrays.asList( "h" , "n" , "t" ), true )); This is used with enablePositionIncrements=false to ensure no gap is added... I also added a simple test for this.
        Hide
        Jim Regan added a comment -

        Wow! Thanks Robert!

        There isn't usually a hyphen with 'h' before a vowel, but I've started to see it recently – there are no native Irish words beginning with 'h', so it used to be relatively unambiguous that a 'h' was a mutation, but with an increase of scientific literature in Irish, there are more Greek and Latin loan words being added which do begin with 'h', so it's no longer clear.

        Show
        Jim Regan added a comment - Wow! Thanks Robert! There isn't usually a hyphen with 'h' before a vowel, but I've started to see it recently – there are no native Irish words beginning with 'h', so it used to be relatively unambiguous that a 'h' was a mutation, but with an increase of scientific literature in Irish, there are more Greek and Latin loan words being added which do begin with 'h', so it's no longer clear.
        Hide
        Robert Muir added a comment -

        Thanks Jim. Personally I think this patch is ready to be committed.

        I'm just going to wait a bit in case you get any feedback from Martin or other snowball developers,
        but I won't wait too long

        Show
        Robert Muir added a comment - Thanks Jim. Personally I think this patch is ready to be committed. I'm just going to wait a bit in case you get any feedback from Martin or other snowball developers, but I won't wait too long
        Hide
        Jim Regan added a comment -

        Great

        Regarding the initial 'h', I asked Kevin Scannell (among other feathers in his cap, he created the dictionary used in GaelSpell, and ran an Irish-language search engine), who said:
        "I looked carefully at how often initial h is a prefix vs not a while ago. I can send you those data - non-prefixes might be more common than you'd think in running text bc of proper names, English mixed in, etc. So upshot is it's a bad idea to strip all initial h's with no hyphen following.
        As far as h- (with hyphen) goes, it's non-standard but common enough that I'd leave it in the stemmer. Not like there would be false positives in that case if the hyphen is there.'

        Show
        Jim Regan added a comment - Great Regarding the initial 'h', I asked Kevin Scannell (among other feathers in his cap, he created the dictionary used in GaelSpell, and ran an Irish-language search engine), who said: "I looked carefully at how often initial h is a prefix vs not a while ago. I can send you those data - non-prefixes might be more common than you'd think in running text bc of proper names, English mixed in, etc. So upshot is it's a bad idea to strip all initial h's with no hyphen following. As far as h- (with hyphen) goes, it's non-standard but common enough that I'd leave it in the stemmer. Not like there would be false positives in that case if the hyphen is there.'
        Hide
        Robert Muir added a comment -

        This makes sense to me, I agree with the conservative approach here!

        Show
        Robert Muir added a comment - This makes sense to me, I agree with the conservative approach here!
        Hide
        David Smiley added a comment -

        How ironic this issue is created nearly on St. Patrick's Day.

        Show
        David Smiley added a comment - How ironic this issue is created nearly on St. Patrick's Day.
        Hide
        Jim Regan added a comment -

        It was on my mind, a little I made the stemmer on the 15th, on the 17th I made ICU transliteration rules for Irish->IPA, but that's not quite relevant here.

        Show
        Jim Regan added a comment - It was on my mind, a little I made the stemmer on the 15th, on the 17th I made ICU transliteration rules for Irish->IPA, but that's not quite relevant here.
        Hide
        Robert Muir added a comment -

        Same patch but with the solr pieces too (factory/test for the lowercasefilter, text_ga fieldtype, resources synced, etc).

        Show
        Robert Muir added a comment - Same patch but with the solr pieces too (factory/test for the lowercasefilter, text_ga fieldtype, resources synced, etc).
        Hide
        Robert Muir added a comment -

        Thank you very much Jim! I just committed this.

        Show
        Robert Muir added a comment - Thank you very much Jim! I just committed this.
        Hide
        Jim Regan added a comment -

        Yay! Thanks for all your help!

        Show
        Jim Regan added a comment - Yay! Thanks for all your help!
        Hide
        Jim Regan added a comment -

        Just to follow up, the Irish stemmer is now available from the Snowball site: http://snowball.tartarus.org/otherapps/oregan/intro.html

        Show
        Jim Regan added a comment - Just to follow up, the Irish stemmer is now available from the Snowball site: http://snowball.tartarus.org/otherapps/oregan/intro.html
        Hide
        Robert Muir added a comment -

        Thanks Jim! I already removed our local copy of the irish.sbl as its now available on
        the snowball site.

        I have to investigate the Czech implementation, I think we should make it available
        as well, since it also supports stemming of derivational endings: Dawid opened
        LUCENE-4042 for that.

        Thanks for contributing these to snowball.

        Show
        Robert Muir added a comment - Thanks Jim! I already removed our local copy of the irish.sbl as its now available on the snowball site. I have to investigate the Czech implementation, I think we should make it available as well, since it also supports stemming of derivational endings: Dawid opened LUCENE-4042 for that. Thanks for contributing these to snowball.
        Hide
        Jim Regan added a comment -

        I wouldn't recommend the aggressive mode, and I regret that I left it uncommented. If you really think an alternative would be welcome, it would be quite easy to get the best of both (in fact, I spent roughly half the time on that trying to beat Snowball into overstemming to match the original).

        Show
        Jim Regan added a comment - I wouldn't recommend the aggressive mode, and I regret that I left it uncommented. If you really think an alternative would be welcome, it would be quite easy to get the best of both (in fact, I spent roughly half the time on that trying to beat Snowball into overstemming to match the original).

          People

          • Assignee:
            Robert Muir
            Reporter:
            Jim Regan
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development