Details
-
Improvement
-
Status: Resolved
-
Major
-
Resolution: Won't Fix
-
None
-
None
-
None
-
None
-
New
Description
//Since I am a .Net programmer, Sample codes will be in c# but I don't think that it would be a problem to understand them.
//
Assume an input text like "İ" and and analyzer like below
public class SomeAnalyzer : Analyzer { public override TokenStream TokenStream(string fieldName, System.IO.TextReader reader) { TokenStream t = new SomeTokenizer(reader); t = new Lucene.Net.Analysis.ASCIIFoldingFilter(t); t = new LowerCaseFilter(t); return t; } }
ASCIIFoldingFilter will return "I" and after, LowerCaseFilter will return
"i" (if locale is "en-US")
or
"ı' if(locale is "tr-TR") (that means,this token should be input to another instance of ASCIIFoldingFilter)
So, calling LowerCaseFilter before ASCIIFoldingFilter would be a solution, but a better approach can be adding
a new constructor to LowerCaseFilter and forcing it to use a specific locale.
public sealed class LowerCaseFilter : TokenFilter { /* +++ */System.Globalization.CultureInfo CultureInfo = System.Globalization.CultureInfo.CurrentCulture; public LowerCaseFilter(TokenStream in) : base(in) { } /* +++ */ public LowerCaseFilter(TokenStream in, System.Globalization.CultureInfo CultureInfo) : base(in) /* +++ */ { /* +++ */ this.CultureInfo = CultureInfo; /* +++ */ } public override Token Next(Token result) { result = Input.Next(result); if (result != null) { char[] buffer = result.TermBuffer(); int length = result.termLength; for (int i = 0; i < length; i++) /* +++ */ buffer[i] = System.Char.ToLower(buffer[i],CultureInfo); return result; } else return null; } }
DIGY