Lucene - Core
  1. Lucene - Core
  2. LUCENE-2295

Create a MaxFieldLengthAnalyzer to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.1, 4.0-ALPHA
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New, Patch Available

      Description

      A spinoff from LUCENE-2294. Instead of asking the user to specify on IndexWriter his requested MFL limit, we can get rid of this setting entirely by providing an Analyzer which will wrap any other Analyzer and its TokenStream with a TokenFilter that keeps track of the number of tokens produced and stop when the limit has reached.

      This will remove any count tracking in IW's indexing, which is done even if I specified UNLIMITED for MFL.

      Let's try to do it for 3.1.

      1. LUCENE-2295.patch
        11 kB
        Uwe Schindler
      2. LUCENE-2295-trunk.patch
        13 kB
        Uwe Schindler
      3. LUCENE-2295-2-3x.patch
        23 kB
        Shai Erera
      4. LUCENE-2295-2-trunk.patch
        24 kB
        Shai Erera

        Activity

        Hide
        Grant Ingersoll added a comment -

        Bulk close for 3.1

        Show
        Grant Ingersoll added a comment - Bulk close for 3.1
        Hide
        Shai Erera added a comment -

        Committed revision 1060340 (trunk).
        Committed revision 1060342 (3x).

        Show
        Shai Erera added a comment - Committed revision 1060340 (trunk). Committed revision 1060342 (3x).
        Hide
        Robert Muir added a comment -

        Shai, the patch looks good to me.

        I would also say that with your trunk patch we can resolve SOLR-2086,
        because then the maxFieldLength is totally implemented in the analyzer,
        so tools like analysis.jsp will automatically work with it.

        Show
        Robert Muir added a comment - Shai, the patch looks good to me. I would also say that with your trunk patch we can resolve SOLR-2086 , because then the maxFieldLength is totally implemented in the analyzer, so tools like analysis.jsp will automatically work with it.
        Hide
        Shai Erera added a comment -

        Patch against trunk - removes maxFieldLength handling from all the code.

        Show
        Shai Erera added a comment - Patch against trunk - removes maxFieldLength handling from all the code.
        Hide
        Shai Erera added a comment -

        Patch against 3x. Removed the get/set from IWC and changed code which used it. I also added some clarifying notes to the deprecation note in IW.setMaxFieldLength.

        I will post a separate patch for trunk where this setting will be removed altogether.

        Show
        Shai Erera added a comment - Patch against 3x. Removed the get/set from IWC and changed code which used it. I also added some clarifying notes to the deprecation note in IW.setMaxFieldLength. I will post a separate patch for trunk where this setting will be removed altogether.
        Hide
        Robert Muir added a comment -

        Hi Shai, that sounds like the right solution to me!

        Show
        Robert Muir added a comment - Hi Shai, that sounds like the right solution to me!
        Hide
        Shai Erera added a comment -

        I think the changes to 3x are less complicated than they seem - we don't need to deprecate anything, more than we already did. IndexWriterConfig is introduced in 3.1 and all IW ctors are already deprecated. So we can just remove the get/setMaxFieldLength from IWC and be done with it + some jdocs.

        Is that the intention behind the reopening of the issue?

        Show
        Shai Erera added a comment - I think the changes to 3x are less complicated than they seem - we don't need to deprecate anything, more than we already did. IndexWriterConfig is introduced in 3.1 and all IW ctors are already deprecated. So we can just remove the get/setMaxFieldLength from IWC and be done with it + some jdocs. Is that the intention behind the reopening of the issue?
        Hide
        Uwe Schindler added a comment -

        +1, I missed Mike's comment after resolving this issue!

        Show
        Uwe Schindler added a comment - +1, I missed Mike's comment after resolving this issue!
        Hide
        Robert Muir added a comment -

        I'm reopening this for discussion (both because it came up on the mailing list and Mike's question was never answered).

        I think the functionality added here is more flexible than the IW setting and we should consider deprecating/removing.

        Show
        Robert Muir added a comment - I'm reopening this for discussion (both because it came up on the mailing list and Mike's question was never answered). I think the functionality added here is more flexible than the IW setting and we should consider deprecating/removing.
        Hide
        Michael McCandless added a comment -

        Further investigantions showed, that there is some difference between using this filter/analyzer and the current setting in IndexWriter. IndexWriter uses the given MaxFieldLength as maximum value for all instances of the same field name. So if you add 100 fields "foo" (with each 1,000 terms) and have the default of 10,000 tokens, DocInverter will index 10 of these field instances (10,000 terms in total) and the rest will be supressed.

        In LUCENE-2450 I'm experimenting with having multi-valued fields be handled entirely by an analyzer stage, ie, the logical concatenation of tokens (with gaps) would "hidden" to IW, and IW would think its dealing with a single token stream. In this model, if you then appended the new LimitTokenCountFilter to the end, I think it'd result in the same behavior as maxFieldLength today.

        But, even before we eventually switch to that model... can't we still deprecate (on 3x) IW's maxFieldLength (remove from trunk) now? I realize the limiting is different (applying the limit pre vs post concatenation), but I think the javadocs can explain this difference? I think it's unlikely apps are relying on this specific interaction of truncation and multi-valued fields...

        Show
        Michael McCandless added a comment - Further investigantions showed, that there is some difference between using this filter/analyzer and the current setting in IndexWriter. IndexWriter uses the given MaxFieldLength as maximum value for all instances of the same field name. So if you add 100 fields "foo" (with each 1,000 terms) and have the default of 10,000 tokens, DocInverter will index 10 of these field instances (10,000 terms in total) and the rest will be supressed. In LUCENE-2450 I'm experimenting with having multi-valued fields be handled entirely by an analyzer stage, ie, the logical concatenation of tokens (with gaps) would "hidden" to IW, and IW would think its dealing with a single token stream. In this model, if you then appended the new LimitTokenCountFilter to the end, I think it'd result in the same behavior as maxFieldLength today. But, even before we eventually switch to that model... can't we still deprecate (on 3x) IW's maxFieldLength (remove from trunk) now? I realize the limiting is different (applying the limit pre vs post concatenation), but I think the javadocs can explain this difference? I think it's unlikely apps are relying on this specific interaction of truncation and multi-valued fields...
        Hide
        Uwe Schindler added a comment -

        Committed revision: 949445 (trunk), 949446 (3.x)

        Show
        Uwe Schindler added a comment - Committed revision: 949445 (trunk), 949446 (3.x)
        Hide
        Uwe Schindler added a comment -

        Updated patch for trunk.

        Show
        Uwe Schindler added a comment - Updated patch for trunk.
        Hide
        Uwe Schindler added a comment -

        Further investigantions showed, that there is some difference between using this filter/analyzer and the current setting in IndexWriter. IndexWriter uses the given MaxFieldLength as maximum value for all instances of the same field name. So if you add 100 fields "foo" (with each 1,000 terms) and have the default of 10,000 tokens, DocInverter will index 10 of these field instances (10,000 terms in total) and the rest will be supressed.

        If you use the Filter, the limit is per TokenStream, so the above example will index all field instances and produce 100,000 terms.

        But the current IndexWriter code has a bug, too: The check for too many terms is done after the first token of each input stream is indexed, so in the abovce example, IW will index 10,089 terms, because once the limit is reached, each stream left will index one term. This could be fixed (if really needed, as the MaxFieldLength in IW should be deprecated) by moving the check up and dont even try to index the field and create the TokenStream.

        I just wanted to add this difference here for further discussing.

        Show
        Uwe Schindler added a comment - Further investigantions showed, that there is some difference between using this filter/analyzer and the current setting in IndexWriter. IndexWriter uses the given MaxFieldLength as maximum value for all instances of the same field name. So if you add 100 fields "foo" (with each 1,000 terms) and have the default of 10,000 tokens, DocInverter will index 10 of these field instances (10,000 terms in total) and the rest will be supressed. If you use the Filter, the limit is per TokenStream, so the above example will index all field instances and produce 100,000 terms. But the current IndexWriter code has a bug, too: The check for too many terms is done after the first token of each input stream is indexed, so in the abovce example, IW will index 10,089 terms, because once the limit is reached, each stream left will index one term. This could be fixed (if really needed, as the MaxFieldLength in IW should be deprecated) by moving the check up and dont even try to index the field and create the TokenStream. I just wanted to add this difference here for further discussing.
        Hide
        Robert Muir added a comment -

        Add this to the list of potential future checks in BaseTokenStreamTestcase.

        Perhaps we can use this new limiting analyzerwrapper in assertAnalyzesToReuse
        to help determine if tokenstreams implement reset() correctly, versus just assuming
        the entire stream will be consumed completely.

        Show
        Robert Muir added a comment - Add this to the list of potential future checks in BaseTokenStreamTestcase. Perhaps we can use this new limiting analyzerwrapper in assertAnalyzesToReuse to help determine if tokenstreams implement reset() correctly, versus just assuming the entire stream will be consumed completely.
        Hide
        Uwe Schindler added a comment - - edited

        Here is a first patch.

        Robert & me found a bug in CharTokenizer that it not correctly sets the endOffset when the underlying reader is not exhausted. This is fixed in the patch, too. The bug in CharTokenizer was also there when sombody used the MaxFieldLength with IW – possible highlighting problems

        Nevertheless, CharTokenizer should be rewritten, the code is not easy understandable. Too many states and branches and possibly uninitialized variables (i fixed by two asserts).

        Deprecating IW.MaxFieldLength is not yet added, this is just the new Filter/Analyzer.

        Show
        Uwe Schindler added a comment - - edited Here is a first patch. Robert & me found a bug in CharTokenizer that it not correctly sets the endOffset when the underlying reader is not exhausted. This is fixed in the patch, too. The bug in CharTokenizer was also there when sombody used the MaxFieldLength with IW – possible highlighting problems Nevertheless, CharTokenizer should be rewritten, the code is not easy understandable. Too many states and branches and possibly uninitialized variables (i fixed by two asserts). Deprecating IW.MaxFieldLength is not yet added, this is just the new Filter/Analyzer.
        Hide
        Uwe Schindler added a comment -

        After some discussion with rmuir, we realized, that an explicit reuse of the filter does not make sense. The maintenance of the Map<String,ReuseableStream> is more resource and maintenance than simply creating a class instance without any initialization cost.
        The simple implementation of the Analyzer would be:

        • override reusableTokenStream that delegates to the inner analyzer and wrap it with the filter. The cost of creating the filter is neglectible, as the filter has no initialization cost (it uses no attributes, does not create attribute maps,...)
        • override tokenStream that does the same, but instead delegates to inner analyzers tokenStream method.
        • Make this analyzer final, else we need VirtualMethod (also the TokenFilter, of course)
        • Override the rest of the methods in Analyzer and simply delegate. Don't forget the posIncr Gap methods and so on!

        I will supply a patch with filter and analyzer later.

        Show
        Uwe Schindler added a comment - After some discussion with rmuir, we realized, that an explicit reuse of the filter does not make sense. The maintenance of the Map<String,ReuseableStream> is more resource and maintenance than simply creating a class instance without any initialization cost. The simple implementation of the Analyzer would be: override reusableTokenStream that delegates to the inner analyzer and wrap it with the filter. The cost of creating the filter is neglectible, as the filter has no initialization cost (it uses no attributes, does not create attribute maps,...) override tokenStream that does the same, but instead delegates to inner analyzers tokenStream method. Make this analyzer final, else we need VirtualMethod (also the TokenFilter, of course) Override the rest of the methods in Analyzer and simply delegate. Don't forget the posIncr Gap methods and so on! I will supply a patch with filter and analyzer later.
        Hide
        Uwe Schindler added a comment - - edited

        In the indexer, the backwards code is hairy: If a max limit is set, the indexer just plugs our Filter around the TokenStream. This would limit nr. of tokens for all cases:

        1. An ANALYZED field (wrapping the TokenStream got from Analyzer)
        2. A Field with ctor taking TokenStream: wrap the Field's TokenStream

        We need some cache here, too – The analyzer cannot be used because of (2).

        For backwards we leave the code as it is and just remove the limiting code in 4.0. We just add coments, what needs to be removed.

        Show
        Uwe Schindler added a comment - - edited In the indexer, the backwards code is hairy: If a max limit is set, the indexer just plugs our Filter around the TokenStream. This would limit nr. of tokens for all cases: An ANALYZED field (wrapping the TokenStream got from Analyzer) A Field with ctor taking TokenStream: wrap the Field's TokenStream We need some cache here, too – The analyzer cannot be used because of (2). For backwards we leave the code as it is and just remove the limiting code in 4.0. We just add coments, what needs to be removed.
        Hide
        Uwe Schindler added a comment -

        The TokenFilter is quite easy, only few lines of code:

        • no attributes to be registered
        • use a counter which is 0
        • override incrementToken() to update counter on true, return false when counter reaches limit or input exhausted
        • reset() resets counter
        • no other methods need to be overridden (this emulates the original behaviour of MaxFieldLength)

        The Analyzer is more complicated as it should respect reusable streams. It should work like QueryAutoStopWordAnalyzer and maintain a Map of field names to chached streams. To detect if reusableTokenStream has reused a stream compare with cache. If new stream wrap.

        Show
        Uwe Schindler added a comment - The TokenFilter is quite easy, only few lines of code: no attributes to be registered use a counter which is 0 override incrementToken() to update counter on true, return false when counter reaches limit or input exhausted reset() resets counter no other methods need to be overridden (this emulates the original behaviour of MaxFieldLength) The Analyzer is more complicated as it should respect reusable streams. It should work like QueryAutoStopWordAnalyzer and maintain a Map of field names to chached streams. To detect if reusableTokenStream has reused a stream compare with cache. If new stream wrap.
        Hide
        Shai Erera added a comment -

        This will open the door for more extensible field-length limit - today MFL dictates the field length for all fields, while this Analyzer (TokenFilter actually) someone could wrap the TokenStream only for specific fields, and/or have fields with different limits. Just a thought.

        Show
        Shai Erera added a comment - This will open the door for more extensible field-length limit - today MFL dictates the field length for all fields, while this Analyzer (TokenFilter actually) someone could wrap the TokenStream only for specific fields, and/or have fields with different limits. Just a thought.

          People

          • Assignee:
            Uwe Schindler
            Reporter:
            Shai Erera
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development