Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 4.0-BETA, Trunk
    • Component/s: modules/analysis
    • Labels:
      None
    • Lucene Fields:
      New

      Description

      Spinoff from SOLR-3684.

      Most lucene tokenizers have some buffer size, e.g. in CharTokenizer/ICUTokenizer its char[4096].

      But the jflex tokenizers use char[16384] by default, which seems overkill. I'm not sure we really see any performance bonus by having such a huge buffer size as a default.

      There is a jflex parameter to set this: I think we should consider reducing it.

      In a configuration like solr, tokenizers are reused per-thread-per-field,
      so these can easily stack up in RAM.

      Additionally CharFilters are not reused so the configuration in e.g.
      HtmlStripCharFilter might not be great since its per-document garbage.

        Activity

        Hide
        Steve Rowe added a comment -

        +1.

        For tokenizers, the buffer needs to be able to hold a token (and its trailing context, if lookahead is used), but nothing more. 16k tokens are likely extremely rare. 4k seems reasonable to me - it's still way bigger than most people are likely to hit over normal text input.

        HTMLStripCharFilter is a bit different, since it buffers HTML constructs rather than tokens. In the face of malformed input (e.g. an opening angle bracket '<' with no closing angle bracket '>'), the scanner might buffer the entire remaining input. In contrast, LegacyHTMLStripCharFilter, the pre-JFlex implementation, limits this kind of buffering, to 8k max chars IIRC.

        Show
        Steve Rowe added a comment - +1. For tokenizers, the buffer needs to be able to hold a token (and its trailing context, if lookahead is used), but nothing more. 16k tokens are likely extremely rare. 4k seems reasonable to me - it's still way bigger than most people are likely to hit over normal text input. HTMLStripCharFilter is a bit different, since it buffers HTML constructs rather than tokens. In the face of malformed input (e.g. an opening angle bracket '<' with no closing angle bracket '>'), the scanner might buffer the entire remaining input. In contrast, LegacyHTMLStripCharFilter , the pre-JFlex implementation, limits this kind of buffering, to 8k max chars IIRC.
        Hide
        Robert Muir added a comment -

        For tokenizers, the buffer needs to be able to hold a token (and its trailing context, if lookahead is used), but nothing more. 16k tokens are likely extremely rare. 4k seems reasonable to me - it's still way bigger than most people are likely to hit over normal text input.

        Yes, I think its reasonable too: especially since maxTokenLength is 255 by default.

        HTMLStripCharFilter is a bit different, since it buffers HTML constructs rather than tokens. In the face of malformed input (e.g. an opening angle bracket '<' with no closing angle bracket '>'), the scanner might buffer the entire remaining input. In contrast, LegacyHTMLStripCharFilter, the pre-JFlex implementation, limits this kind of buffering, to 8k max chars IIRC.

        OK, I can leave this one alone. We can revisit if we can make CharFilters reusable (not simple to do cleanly today). Its not as much of an issue since nothing is hanging on to it.

        I'll work up a patch.

        Show
        Robert Muir added a comment - For tokenizers, the buffer needs to be able to hold a token (and its trailing context, if lookahead is used), but nothing more. 16k tokens are likely extremely rare. 4k seems reasonable to me - it's still way bigger than most people are likely to hit over normal text input. Yes, I think its reasonable too: especially since maxTokenLength is 255 by default. HTMLStripCharFilter is a bit different, since it buffers HTML constructs rather than tokens. In the face of malformed input (e.g. an opening angle bracket '<' with no closing angle bracket '>'), the scanner might buffer the entire remaining input. In contrast, LegacyHTMLStripCharFilter, the pre-JFlex implementation, limits this kind of buffering, to 8k max chars IIRC. OK, I can leave this one alone. We can revisit if we can make CharFilters reusable (not simple to do cleanly today). Its not as much of an issue since nothing is hanging on to it. I'll work up a patch.
        Hide
        Robert Muir added a comment -

        Here's a patch: with regenerations.

        Note that, by default 'ant jflex' gave me an error for all the includes (as of jflex r612).

        So thats why you see changes like:

        -%include src/java/org/apache/lucene/analysis/charfilter/HTMLCharacterEntities.jflex
        +%include HTMLCharacterEntities.jflex
        

        It seems jflex now expects these file paths to be relative to the input file?

        Show
        Robert Muir added a comment - Here's a patch: with regenerations. Note that, by default 'ant jflex' gave me an error for all the includes (as of jflex r612). So thats why you see changes like: -%include src/java/org/apache/lucene/analysis/charfilter/HTMLCharacterEntities.jflex +%include HTMLCharacterEntities.jflex It seems jflex now expects these file paths to be relative to the input file?
        Hide
        Steve Rowe added a comment - - edited

        It seems jflex now expects these file paths to be relative to the input file?

        Gerwin Klein recently fixed JFlex issue 3420809, with exactly this change.

        Show
        Steve Rowe added a comment - - edited It seems jflex now expects these file paths to be relative to the input file? Gerwin Klein recently fixed JFlex issue 3420809 , with exactly this change.
        Hide
        Robert Muir added a comment -

        OK thanks, that explains it! I'd like to commit this if there are no objections.

        Show
        Robert Muir added a comment - OK thanks, that explains it! I'd like to commit this if there are no objections.
        Hide
        Steve Rowe added a comment -

        I'd like to commit this if there are no objections.

        +1, patch looks good.

        Show
        Steve Rowe added a comment - I'd like to commit this if there are no objections. +1, patch looks good.

          People

          • Assignee:
            Unassigned
            Reporter:
            Robert Muir
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development