Thanks for updating the patch Jim!
one concern doing some very very rudimentary testing:
we have special lowercasing for situations like nAthair -> n-athair,
which the snowball rules then strip:
define initial_morph as (
[substring] among (
'h-' 'n-' 't-' //nAthair -> n-athair, but alone are problematic
The problem is if the input initially comes as n-athair, Unicode break rules
will split this up on the hyphen into two tokens
. You can visualize this at http://unicode.org/cldr/utility/breaks.jsp
This means we can add many spurious 'n' tokens in the index...
So we have two potential solutions to this:
- we can simply add 'n', 'h', 't', etc to the stopwords list. This is the simplest solution. Would this be too aggressive?
- we can add a CharFilter for IrishAnalyzer to prevent this splitting from happening. This is more complex.