Solr
  1. Solr
  2. SOLR-822

CharFilter - normalize characters before tokenizer

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: Schema and Analysis
    • Labels:
      None

      Description

      A new plugin which can be placed in front of <tokenizer/>.

      <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
        <analyzer>
          <charFilter class="solr.MappingCharFilterFactory" mapping="mapping_ja.txt" />
          <tokenizer class="solr.MappingCJKTokenizerFactory"/>
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
          <filter class="solr.LowerCaseFilterFactory"/>
        </analyzer>
      </fieldType>
      

      <charFilter/> can be multiple (chained). I'll post a JPEG file to show character normalization sample soon.

      MOTIVATION:
      In Japan, there are two types of tokenizers – N-gram (CJKTokenizer) and Morphological Analyzer.
      When we use morphological analyzer, because the analyzer uses Japanese dictionary to detect terms,
      we need to normalize characters before tokenization.

      I'll post a patch soon, too.

      1. character-normalization.JPG
        30 kB
        Koji Sekiguchi
      2. japanese-h-to-k-mapping.txt
        3 kB
        Mark Bennett
      3. sample_mapping_ja.txt
        2 kB
        Koji Sekiguchi
      4. sample_mapping_ja.txt
        1 kB
        Koji Sekiguchi
      5. SOLR-822.patch
        60 kB
        Koji Sekiguchi
      6. SOLR-822.patch
        57 kB
        Koji Sekiguchi
      7. SOLR-822.patch
        48 kB
        Koji Sekiguchi
      8. SOLR-822.patch
        52 kB
        Koji Sekiguchi
      9. SOLR-822.patch
        52 kB
        Koji Sekiguchi
      10. SOLR-822-for-1.3.patch
        60 kB
        Koji Sekiguchi
      11. SOLR-822-renameMethod.patch
        7 kB
        Koji Sekiguchi

        Activity

        Hide
        Koji Sekiguchi added a comment -

        forgive me if I'm wrong something with German and Chinese language.

        Show
        Koji Sekiguchi added a comment - forgive me if I'm wrong something with German and Chinese language.
        Hide
        Koji Sekiguchi added a comment -

        patch attached. it includes MappingCharFilter and its Factory as a sample charFilter.

        Known bug:
        analysis.jsp has not been supported yet in this patch. This can be fixed.

        Show
        Koji Sekiguchi added a comment - patch attached. it includes MappingCharFilter and its Factory as a sample charFilter. Known bug: analysis.jsp has not been supported yet in this patch. This can be fixed.
        Hide
        Koji Sekiguchi added a comment -

        Known bug:
        analysis.jsp has not been supported yet in this patch. This can be fixed.

        Now the patch fixes the bug.

        Show
        Koji Sekiguchi added a comment - Known bug: analysis.jsp has not been supported yet in this patch. This can be fixed. Now the patch fixes the bug.
        Hide
        Koji Sekiguchi added a comment -

        sample_mapping_ja.txt file is attached. To use it, open schema.xml by editor and define textCharNorm fieldType and text_cjk field as follows:

        <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
          <analyzer>
            <charFilter class="solr.MappingCharFilterFactory" mapping="sample_mapping_ja.txt"/>
            <tokenizer class="solr.MappingCJKTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory"/>
          </analyzer>
        </fieldType>
        
        <field name="text_cjk" type="textCharNorm" indexed="true" stored="true"/>
        

        then start solr and access analysis.jsp.

        Show
        Koji Sekiguchi added a comment - sample_mapping_ja.txt file is attached. To use it, open schema.xml by editor and define textCharNorm fieldType and text_cjk field as follows: <fieldType name= "textCharNorm" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <charFilter class= "solr.MappingCharFilterFactory" mapping= "sample_mapping_ja.txt" /> <tokenizer class= "solr.MappingCJKTokenizerFactory" /> <filter class= "solr.LowerCaseFilterFactory" /> </analyzer> </fieldType> <field name= "text_cjk" type= "textCharNorm" indexed= "true" stored= "true" /> then start solr and access analysis.jsp.
        Hide
        Todd Feak added a comment -

        Seems like a very flexible way to solve the issue, as well as SOLR-814 and SOLR-815. It should also work for existing filters like LowerCase. Seems like it has the potential to be faster then the filters, as it doesn't have to perform the same replacement multiple times if a particular character is replicated into multiple tokens, like in NGramTokenizer or CJKTokenizer.

        I didn't look in depth at the patch (good size patch to look through), but I wanted to verify at least 2 things. First, I assume that this only affects indexing and searching, not the actual document field contents? Second, is it easy to create a MappingCharFilter subclass with a hardcoded map built in? I don't think users should all have to recreate the same mapping files if we can just embed them.

        However, what about Lucene? Is this something that should exist in Lucene first, then be expanded to Solr? Are Lucene users in need of a similar functionality?

        Show
        Todd Feak added a comment - Seems like a very flexible way to solve the issue, as well as SOLR-814 and SOLR-815 . It should also work for existing filters like LowerCase. Seems like it has the potential to be faster then the filters, as it doesn't have to perform the same replacement multiple times if a particular character is replicated into multiple tokens, like in NGramTokenizer or CJKTokenizer. I didn't look in depth at the patch (good size patch to look through), but I wanted to verify at least 2 things. First, I assume that this only affects indexing and searching, not the actual document field contents? Second, is it easy to create a MappingCharFilter subclass with a hardcoded map built in? I don't think users should all have to recreate the same mapping files if we can just embed them. However, what about Lucene? Is this something that should exist in Lucene first, then be expanded to Solr? Are Lucene users in need of a similar functionality?
        Hide
        Todd Feak added a comment -

        Oh, and another thought. Can it support characters written as "\uff01" format in the mapping file?

        Show
        Todd Feak added a comment - Oh, and another thought. Can it support characters written as "\uff01" format in the mapping file?
        Hide
        Walter Underwood added a comment -

        Yes, it should be in Lucene. LIke this: http://webui.sourcelabs.com/lucene/issues/1343

        There are (at least) four kinds of character mapping:

        Unicode normalization from decomposed to composed forms (always safe).

        Unicode normalization from compatability forms to standard forms (may change the look, like fullwidth to halfwidth Latin).

        Language-specific normalization, like "oe" to "ö" (German-only).

        Mappings that improve search but are linguistically dodgy, like stripping accents and mapping katakana to hirigana.

        wunder

        Show
        Walter Underwood added a comment - Yes, it should be in Lucene. LIke this: http://webui.sourcelabs.com/lucene/issues/1343 There are (at least) four kinds of character mapping: Unicode normalization from decomposed to composed forms (always safe). Unicode normalization from compatability forms to standard forms (may change the look, like fullwidth to halfwidth Latin). Language-specific normalization, like "oe" to "ö" (German-only). Mappings that improve search but are linguistically dodgy, like stripping accents and mapping katakana to hirigana. wunder
        Hide
        Koji Sekiguchi added a comment -

        First, I assume that this only affects indexing and searching, not the actual document field contents?

        Right. This only affects indexing and searching.

        Second, is it easy to create a MappingCharFilter subclass with a hardcoded map built in?

        I didn't expect such use case, but it is must have.

        Can it support characters written as "\uff01" format in the mapping file?

        The patch doesn't support this format, but must have, too.

        Show
        Koji Sekiguchi added a comment - First, I assume that this only affects indexing and searching, not the actual document field contents? Right. This only affects indexing and searching. Second, is it easy to create a MappingCharFilter subclass with a hardcoded map built in? I didn't expect such use case, but it is must have. Can it support characters written as "\uff01" format in the mapping file? The patch doesn't support this format, but must have, too.
        Hide
        Hoss Man added a comment -

        Koji:

        1) the patch is a little hard to read ... there seems to be a ton of unrelated whitespace changes (some in files that don't seem like they need modified for this functionality at all)

        2) the motivation of adding a new type of plugin that has direct access to the "stream of characters" in the Reader before the tokenizer gets access to it seems like a great idea, but i'm a little unclear as to the specific reason for some of the new apis: CharReader, CharFilter, CharStream. What value do these add beyond something like...

        public abstract class ReaderWrapperFactory {
          public void init(Map<String,String> args) { ... }
          public Map<String,String> getArgs() { ... }
          public Reader create(Reader input) { 
             return input;
          }
        }
        

        ?

        Show
        Hoss Man added a comment - Koji: 1) the patch is a little hard to read ... there seems to be a ton of unrelated whitespace changes (some in files that don't seem like they need modified for this functionality at all) 2) the motivation of adding a new type of plugin that has direct access to the "stream of characters" in the Reader before the tokenizer gets access to it seems like a great idea, but i'm a little unclear as to the specific reason for some of the new apis: CharReader, CharFilter, CharStream. What value do these add beyond something like... public abstract class ReaderWrapperFactory { public void init(Map< String , String > args) { ... } public Map< String , String > getArgs() { ... } public Reader create(Reader input) { return input; } } ?
        Hide
        Koji Sekiguchi added a comment -

        Hoss,

        Sorry about the unrelated whitespaces in the patch. I'll remove them in the next patch.

        About CharStream, CharReader and CharFilter classes, I created CharFilterFactory:

        public interface CharFilterFactory {
          public void init(Map<String,String> args);
          public Map<String,String> getArgs();
          public CharStream create(CharStream input);
        }
        

        instead of ReaderWrapperFactory mentioned by Hoss. CharFilterFactory is a factory of CharFilter which reads CharStream and outputs CharStream. CharStream is a Reader but has correctPosition method:

        public abstract class CharStream extends Reader {
          public abstract int correctPosition( int currentPos );
        }
        

        The method will be called by CharFilters and Tokenizer(in this case, Tokenizer should be CharStream "aware") to correct start/end offsets of tokens, because CharFilters may convert 1 char to 2 chars or the other way around. The following is a sample implementation of the method:

        MappingCharFilter.java
        private List<PosCorrectMap> pcmList;
        
        public int correctPosition( int currentPos ){
          currentPos = input.correctPosition( currentPos );
          if( pcmList.isEmpty() ) return currentPos;
          for( int i = pcmList.size() - 1; i >= 0; i-- ){
            if( currentPos >= pcmList.get( i ).pos )
              return currentPos + pcmList.get( i ).cumulativeDiff;
          }
          return currentPos;
        }
        
        static class PosCorrectMap {
          int pos;
          int cumulativeDiff;
          public PosCorrectMap( int pos, int cumulativeDiff ){
            this.pos = pos;
            this.cumulativeDiff = cumulativeDiff;
          }
        }
        

        There is another CharStream class, CharReader. It is a Reader wrapper and necessary to get Reader and outputs CharStream. CharReader is a concrete class and instanciated in TokenizerChain.

        Does that make sense to you?

        Show
        Koji Sekiguchi added a comment - Hoss, Sorry about the unrelated whitespaces in the patch. I'll remove them in the next patch. About CharStream, CharReader and CharFilter classes, I created CharFilterFactory: public interface CharFilterFactory { public void init(Map< String , String > args); public Map< String , String > getArgs(); public CharStream create(CharStream input); } instead of ReaderWrapperFactory mentioned by Hoss. CharFilterFactory is a factory of CharFilter which reads CharStream and outputs CharStream. CharStream is a Reader but has correctPosition method: public abstract class CharStream extends Reader { public abstract int correctPosition( int currentPos ); } The method will be called by CharFilters and Tokenizer(in this case, Tokenizer should be CharStream "aware") to correct start/end offsets of tokens, because CharFilters may convert 1 char to 2 chars or the other way around. The following is a sample implementation of the method: MappingCharFilter.java private List<PosCorrectMap> pcmList; public int correctPosition( int currentPos ){ currentPos = input.correctPosition( currentPos ); if ( pcmList.isEmpty() ) return currentPos; for ( int i = pcmList.size() - 1; i >= 0; i-- ){ if ( currentPos >= pcmList.get( i ).pos ) return currentPos + pcmList.get( i ).cumulativeDiff; } return currentPos; } static class PosCorrectMap { int pos; int cumulativeDiff; public PosCorrectMap( int pos, int cumulativeDiff ){ this .pos = pos; this .cumulativeDiff = cumulativeDiff; } } There is another CharStream class, CharReader. It is a Reader wrapper and necessary to get Reader and outputs CharStream. CharReader is a concrete class and instanciated in TokenizerChain. Does that make sense to you?
        Hide
        Koji Sekiguchi added a comment -

        I think I found a bug... the correctPosition() returns incorrect position. I'm working on that...

        Show
        Koji Sekiguchi added a comment - I think I found a bug... the correctPosition() returns incorrect position. I'm working on that...
        Hide
        Koji Sekiguchi added a comment -

        I think I found a bug... the correctPosition() returns incorrect position. I'm working on that...

        Attached patch fixes the problem. It also includes:

        • some unit tests
        • Javadoc for CharStream, CharReader and CharFilter
        • rename correctPosition() to correctOffset() and make it final in CharFilter:
        public final int correctOffset(int currentOff) {
          return input.correctOffset( correctPosition( currentOff ) );
        }
        
        protected int correctPosition( int pos ){
          return pos;
        }
        

        then correctOffset() calls correctPosition(). correctPosition() can be override to correct position in subclass of CharFilter.

        • rename MappingCJKTokenizer to CharStreamAwareCJKTokenizer

        TODO:

        1. support \uNNNN style in mapping.txt
        2. add StopCharFilter
        Show
        Koji Sekiguchi added a comment - I think I found a bug... the correctPosition() returns incorrect position. I'm working on that... Attached patch fixes the problem. It also includes: some unit tests Javadoc for CharStream, CharReader and CharFilter rename correctPosition() to correctOffset() and make it final in CharFilter: public final int correctOffset( int currentOff) { return input.correctOffset( correctPosition( currentOff ) ); } protected int correctPosition( int pos ){ return pos; } then correctOffset() calls correctPosition(). correctPosition() can be override to correct position in subclass of CharFilter. rename MappingCJKTokenizer to CharStreamAwareCJKTokenizer TODO: support \uNNNN style in mapping.txt add StopCharFilter
        Hide
        Hoss Man added a comment -

        Does that make sense to you?

        yes, definitely ... but still a few questions:

        1) if i understand correctly: another use case beyond character normalization could be refactoring the existing HTMLStrip___Tokenizers so that instead people would use an HTMLStripCharFilter and then whatever tokenizer they like, correct?

        2) based on your explanation, shouldn't CharFilterFactory be renamed CharStreamFactory ? ... there's no requirement that implementations produce a CharFilter, as long as they produce a ChaStream, correct?

        3) should CharStream extend FilterReader?

        One thing that worries me is the interaction of CharStreams with their corrected positions and Tokenizers that may not know about CharStream at all. Oviously that could just be an unsupported case (ie; if you want to use some CharStreamFactories, you better use a TokenizerFactory that can handle it) but i still suspect some people could easily be bitten by this.

        i wonder if we could protect people from this. perhaps a new CharStreamTokenizerFactory interface that must be implemented by any TokenizerFactory that knows about CharStreams (with a single "public TokenStream create(CharStream input)") if a fieldType uses any CharStreamFactory it's an initialize error unless the TokenizerFactory is also a CharStreamTokenizerFactory.

        Something else to consider: it seems like a lot of future headache could be simplied if the CharStream API was committed in lucene-java so that the Tokenizer API and all of the existing OOTB Tokenizers could know about it.

        Show
        Hoss Man added a comment - Does that make sense to you? yes, definitely ... but still a few questions: 1) if i understand correctly: another use case beyond character normalization could be refactoring the existing HTMLStrip___Tokenizers so that instead people would use an HTMLStripCharFilter and then whatever tokenizer they like, correct? 2) based on your explanation, shouldn't CharFilterFactory be renamed CharStreamFactory ? ... there's no requirement that implementations produce a CharFilter, as long as they produce a ChaStream, correct? 3) should CharStream extend FilterReader? — One thing that worries me is the interaction of CharStreams with their corrected positions and Tokenizers that may not know about CharStream at all. Oviously that could just be an unsupported case (ie; if you want to use some CharStreamFactories, you better use a TokenizerFactory that can handle it) but i still suspect some people could easily be bitten by this. i wonder if we could protect people from this. perhaps a new CharStreamTokenizerFactory interface that must be implemented by any TokenizerFactory that knows about CharStreams (with a single "public TokenStream create(CharStream input)") if a fieldType uses any CharStreamFactory it's an initialize error unless the TokenizerFactory is also a CharStreamTokenizerFactory. Something else to consider: it seems like a lot of future headache could be simplied if the CharStream API was committed in lucene-java so that the Tokenizer API and all of the existing OOTB Tokenizers could know about it.
        Hide
        Koji Sekiguchi added a comment -

        Hoss, sorry for the late reply.

        1) if i understand correctly: another use case beyond character normalization could be refactoring the existing HTMLStrip___Tokenizers so that instead people would use an HTMLStripCharFilter and then whatever tokenizer they like, correct?

        Correct.

        3) should CharStream extend FilterReader?

        I think we need all these classes to construct the CharFilter framework - CharStream, CharReader and CharFilter. CharReader and CharFilter are the subclass of CharStream. CharStream has an abstract method correctOffset():

        public abstract class CharStream extends Reader {
          /**
           * called by CharFilter(s) and Tokenizer to correct token offset.
           *
           * @param currentOff current offset
           * @return corrected token offset
           */
          public abstract int correctOffset( int currentOff );
        }
        

        CharStream extends Reader instead of FilterReader because FilterReader has a Reader member but I don't need it. Instead, CharReader has a Reader and CharFilter has CharStream. The role of CharReader is that it wraps Reader and makes it CharStream.

        public final class CharReader extends CharStream {
          protected Reader input;
          public CharReader( Reader in ){
            input = in;
          }
          @Override
          public int correctOffset(int currentOff) {
            return currentOff;
          }
          :
        }
        

        Then CharReader is placed at the beginning of char-filter-chain. Now we get CharStream, CharFilters can be used to organize
        a filter chain. I made the correctOffset() to final in CharFilter.

        public abstract class CharFilter extends CharStream {
          protected CharStream input;
          protected CharFilter( CharStream in ){
            input = in;
          }
          protected int correctPosition( int pos ){
            return pos;
          }
          @Override
          public final int correctOffset(int currentOff) {
            return input.correctOffset( correctPosition( currentOff ) );
          }
          :
        }
        

        Subclass of CharFilter can override correctPosition() method to correct current position.

        2) based on your explanation, shouldn't CharFilterFactory be renamed CharStreamFactory ? ... there's no requirement that implementations produce a CharFilter, as long as they produce a ChaStream, correct?

        Yes, CharFilterFactory creates CharStream but I like CharFilterFactory because 1) the factory will instanciate CharFilter (not CharStream) and 2) the return type of TokenFilterFactory.create() is TokenStream although it instantiates TokenFilter.

        Something else to consider: it seems like a lot of future headache could be simplied if the CharStream API was committed in lucene-java so that the Tokenizer API and all of the existing OOTB Tokenizers could know about it.

        Agreed. I'll open a ticket in Lucene.

        Show
        Koji Sekiguchi added a comment - Hoss, sorry for the late reply. 1) if i understand correctly: another use case beyond character normalization could be refactoring the existing HTMLStrip___Tokenizers so that instead people would use an HTMLStripCharFilter and then whatever tokenizer they like, correct? Correct. 3) should CharStream extend FilterReader? I think we need all these classes to construct the CharFilter framework - CharStream, CharReader and CharFilter. CharReader and CharFilter are the subclass of CharStream. CharStream has an abstract method correctOffset(): public abstract class CharStream extends Reader { /** * called by CharFilter(s) and Tokenizer to correct token offset. * * @param currentOff current offset * @ return corrected token offset */ public abstract int correctOffset( int currentOff ); } CharStream extends Reader instead of FilterReader because FilterReader has a Reader member but I don't need it. Instead, CharReader has a Reader and CharFilter has CharStream. The role of CharReader is that it wraps Reader and makes it CharStream. public final class CharReader extends CharStream { protected Reader input; public CharReader( Reader in ){ input = in; } @Override public int correctOffset( int currentOff) { return currentOff; } : } Then CharReader is placed at the beginning of char-filter-chain. Now we get CharStream, CharFilters can be used to organize a filter chain. I made the correctOffset() to final in CharFilter. public abstract class CharFilter extends CharStream { protected CharStream input; protected CharFilter( CharStream in ){ input = in; } protected int correctPosition( int pos ){ return pos; } @Override public final int correctOffset( int currentOff) { return input.correctOffset( correctPosition( currentOff ) ); } : } Subclass of CharFilter can override correctPosition() method to correct current position. 2) based on your explanation, shouldn't CharFilterFactory be renamed CharStreamFactory ? ... there's no requirement that implementations produce a CharFilter, as long as they produce a ChaStream, correct? Yes, CharFilterFactory creates CharStream but I like CharFilterFactory because 1) the factory will instanciate CharFilter (not CharStream) and 2) the return type of TokenFilterFactory.create() is TokenStream although it instantiates TokenFilter. Something else to consider: it seems like a lot of future headache could be simplied if the CharStream API was committed in lucene-java so that the Tokenizer API and all of the existing OOTB Tokenizers could know about it. Agreed. I'll open a ticket in Lucene.
        Hide
        Koji Sekiguchi added a comment -

        Agreed. I'll open a ticket in Lucene.

        Before opening a ticket, I'm seeking comments:
        http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html

        Show
        Koji Sekiguchi added a comment - Agreed. I'll open a ticket in Lucene. Before opening a ticket, I'm seeking comments: http://www.nabble.com/Proposal-for-introducing-CharFilter-to20327007.html
        Hide
        Koji Sekiguchi added a comment -

        The patch includes:

        • '\uNNNN' style supported in mapping.txt
        • mapping-ISOLatin1Accent.txt
        • CharStreamAwareWhitespaceTokenizer
        • <charFilter/> in example/solr/conf/schema.xml
          <!-- charFilter + WhitespaceTokenizer  -->
          <fieldType name="textCharNorm" class="solr.TextField" positionIncrementGap="100" >
            <analyzer>
              <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
              <tokenizer class="solr.CharStreamAwareWhitespaceTokenizerFactory"/>
            </analyzer>
          </fieldType>
          
        Show
        Koji Sekiguchi added a comment - The patch includes: '\uNNNN' style supported in mapping.txt mapping-ISOLatin1Accent.txt CharStreamAwareWhitespaceTokenizer <charFilter/> in example/solr/conf/schema.xml <!-- charFilter + WhitespaceTokenizer --> <fieldType name= "textCharNorm" class= "solr.TextField" positionIncrementGap= "100" > <analyzer> <charFilter class= "solr.MappingCharFilterFactory" mapping= "mapping-ISOLatin1Accent.txt" /> <tokenizer class= "solr.CharStreamAwareWhitespaceTokenizerFactory" /> </analyzer> </fieldType>
        Hide
        Koji Sekiguchi added a comment -

        added:

        • support multiple mapping files (SOLR-663)
        • an abstract base class - BaseCharFilter and moved PosCorrectMap to the base class
        • more test code
          I'll commit in a few days if there is no objections.
        Show
        Koji Sekiguchi added a comment - added: support multiple mapping files ( SOLR-663 ) an abstract base class - BaseCharFilter and moved PosCorrectMap to the base class more test code I'll commit in a few days if there is no objections.
        Hide
        Koji Sekiguchi added a comment -

        Committed revision 713902.

        Show
        Koji Sekiguchi added a comment - Committed revision 713902.
        Hide
        Koji Sekiguchi added a comment -

        patch file for Solr 1.3.0 users.

        Show
        Koji Sekiguchi added a comment - patch file for Solr 1.3.0 users.
        Hide
        Peter Wolanin added a comment -

        Is there an issue for CharStream API in lucene? The e-mail thread looks like people were generally in support.

        Show
        Peter Wolanin added a comment - Is there an issue for CharStream API in lucene? The e-mail thread looks like people were generally in support.
        Hide
        Koji Sekiguchi added a comment -

        Is there an issue for CharStream API in lucene? The e-mail thread looks like people were generally in support.

        Oops. The pointer of the ticket for Lucene is missing. That is LUCENE-1466 .

        Show
        Koji Sekiguchi added a comment - Is there an issue for CharStream API in lucene? The e-mail thread looks like people were generally in support. Oops. The pointer of the ticket for Lucene is missing. That is LUCENE-1466 .
        Hide
        Koji Sekiguchi added a comment -

        Reopening because currentPosition() method in CharFilter class is not for token position, but for token offset. It should be renamed before releasing Solr 1.4.

        Show
        Koji Sekiguchi added a comment - Reopening because currentPosition() method in CharFilter class is not for token position, but for token offset. It should be renamed before releasing Solr 1.4.
        Hide
        Koji Sekiguchi added a comment -

        I plan to commit shortly.

        Show
        Koji Sekiguchi added a comment - I plan to commit shortly.
        Hide
        Koji Sekiguchi added a comment -

        Committed revision 755945.

        Show
        Koji Sekiguchi added a comment - Committed revision 755945.
        Hide
        Otis Gospodnetic added a comment -

        Todd's comment from Oct 23, 2008 caught my attention:

        It should also work for existing filters like LowerCase. Seems like it has the potential to be faster then the filters, as it doesn't have to perform the same replacement multiple times if a particular character is replicated into multiple tokens, like in NGramTokenizer or CJKTokenizer.

        Couldn't we replace LowerCaseFilter then? Or does LCF still have some unique value? Ah, it does - it makes it possible to put it after something like WordDelimiterFilterFactory. Lowercasing at the very beginning would make it impossible for WDFF to do its job. Never mind. Leaving for posterity.

        Show
        Otis Gospodnetic added a comment - Todd's comment from Oct 23, 2008 caught my attention: It should also work for existing filters like LowerCase. Seems like it has the potential to be faster then the filters, as it doesn't have to perform the same replacement multiple times if a particular character is replicated into multiple tokens, like in NGramTokenizer or CJKTokenizer. Couldn't we replace LowerCaseFilter then? Or does LCF still have some unique value? Ah, it does - it makes it possible to put it after something like WordDelimiterFilterFactory. Lowercasing at the very beginning would make it impossible for WDFF to do its job. Never mind. Leaving for posterity.
        Hide
        Mark Bennett added a comment - - edited

        In SOLR-814 it was suggested that some systems might want to normalizes all Hiragana characters to their Katakana counterpart.

        Although this is not universally agreed to, if somebody wanted to do it, I believe this attached mapping file would peform that task when used with this 822 patch. I don't speak Japanese and don't have test content yet, so I'm not 100% it works, but wanted to upload it as a start.

        Show
        Mark Bennett added a comment - - edited In SOLR-814 it was suggested that some systems might want to normalizes all Hiragana characters to their Katakana counterpart. Although this is not universally agreed to, if somebody wanted to do it, I believe this attached mapping file would peform that task when used with this 822 patch. I don't speak Japanese and don't have test content yet, so I'm not 100% it works, but wanted to upload it as a start.
        Hide
        Lance Norskog added a comment -
        Show
        Lance Norskog added a comment - Please update the Wiki for this feature. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters?highlight=char+filters
        Hide
        Koji Sekiguchi added a comment -

        Please update the Wiki for this feature.

        Done.

        Show
        Koji Sekiguchi added a comment - Please update the Wiki for this feature. Done.
        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4
        Hide
        Victor Yap added a comment -
        Show
        Victor Yap added a comment - An old comment's link has been moved. Originally: http://webui.sourcelabs.com/lucene/issues/1343 Moved to: https://issues.apache.org/jira/browse/LUCENE-1343

          People

          • Assignee:
            Koji Sekiguchi
            Reporter:
            Koji Sekiguchi
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development