Details

    • Lucene Fields:
      New

      Description

      When indexing large documents, the lexer buffer may stay large forever. This sub-issue resets the lexer buffer back to the default on reset(Reader).

      This is done on the enclosing issue.

      1. LUCENE-2384-3x.patch
        25 kB
        Uwe Schindler
      2. LUCENE-2384-trunk.patch
        9 kB
        Uwe Schindler
      3. reset.diff
        2 kB
        Ruben Laguna

        Issue Links

          Activity

          Hide
          Ruben Laguna added a comment -

          The mailing list discussion that originated this is [1]

          [1] http://lucene.markmail.org/thread/ndmcgffg2mnwjo47

          Show
          Ruben Laguna added a comment - The mailing list discussion that originated this is [1] [1] http://lucene.markmail.org/thread/ndmcgffg2mnwjo47
          Hide
          Robert Muir added a comment -

          If tokenizers like StandardTokenizer just end out reading things into ram anyway, we should remove Reader from the Tokenizer interface.

          supporting reader instead of simply tokenizing the entire doc causes our tokenizers to be very very complex (see CharTokenizer).
          It would be nice to remove this complexity, if the objective doesn't really work anyway.

          Show
          Robert Muir added a comment - If tokenizers like StandardTokenizer just end out reading things into ram anyway, we should remove Reader from the Tokenizer interface. supporting reader instead of simply tokenizing the entire doc causes our tokenizers to be very very complex (see CharTokenizer). It would be nice to remove this complexity, if the objective doesn't really work anyway.
          Hide
          Uwe Schindler added a comment -

          For JFlex this does not help as the Jflex-generated code always needs a Reader. This is special here, the lexer will not need to load the whole document into the reader, it only needs sometimes a large look forward/backwards buffer.

          Show
          Uwe Schindler added a comment - For JFlex this does not help as the Jflex-generated code always needs a Reader. This is special here, the lexer will not need to load the whole document into the reader, it only needs sometimes a large look forward/backwards buffer.
          Hide
          Ruben Laguna added a comment - - edited

          patch to reset the zzBuffer when the input is reseted. The code is really taken from https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@web38901.mail.mud.yahoo.com so I can't really grant license to use it but I think the guy realeased it as public domain by posting it to the mailing list.

          I tested it and it seems to work for me. Just including it here is case somebody want to apply the patch directly to 3.0.1 (although it's better to wait for 3.1)

          Show
          Ruben Laguna added a comment - - edited patch to reset the zzBuffer when the input is reseted. The code is really taken from https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@web38901.mail.mud.yahoo.com so I can't really grant license to use it but I think the guy realeased it as public domain by posting it to the mailing list. I tested it and it seems to work for me. Just including it here is case somebody want to apply the patch directly to 3.0.1 (although it's better to wait for 3.1)
          Hide
          Robert Muir added a comment -

          For JFlex this does not help as the Jflex-generated code always needs a Reader.

          This can be fixed. Currently all I/O in all tokenizers is broken and buggy, and does not correctly handle special cases around their 'buffering'.

          The only one that is correct is CharTokenizer, but at what cost? It has so much complexity because of this Reader issue.

          We should stop pretending like we can really stream docs with Reader.
          We should stop pretending like 8GB documents or something exist, where we cant just analyze the whole doc at once and make things simple.
          And then we can fix the lucene tokenizers to be correct.

          Show
          Robert Muir added a comment - For JFlex this does not help as the Jflex-generated code always needs a Reader. This can be fixed. Currently all I/O in all tokenizers is broken and buggy, and does not correctly handle special cases around their 'buffering'. The only one that is correct is CharTokenizer, but at what cost? It has so much complexity because of this Reader issue. We should stop pretending like we can really stream docs with Reader. We should stop pretending like 8GB documents or something exist, where we cant just analyze the whole doc at once and make things simple. And then we can fix the lucene tokenizers to be correct.
          Hide
          Uwe Schindler added a comment -

          patch to reset the zzBuffer when the input is reseted. The code is really taken from https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@web38901.mail.mud.yahoo.com so I can't really grant license to use it but I think the guy realeased it as public domain by posting it to the mailing list.
          I tested it and it seems to work for me. Just including it here is case somebody want to apply the patch directly to 3.0.1 (although it's better to wait for 3.1)

          Your fix adds an addtional complexity. Just reset the buffer back to the default ZZ_BUFFERSIZE if grown on reset. Your patch always reallocates a new buffer.

          Use this:

          public final void reset(Reader r) {
            // reset to default buffer size, if buffer has grown
            if (zzBuffer.length > ZZ_BUFFERSIZE) {
              zzBuffer = new char[ZZ_BUFFERSIZE];
            }
            yyreset(r);
          }
          
          Show
          Uwe Schindler added a comment - patch to reset the zzBuffer when the input is reseted. The code is really taken from https://sourceforge.net/mailarchive/message.php?msg_id=444070.38422.qm@web38901.mail.mud.yahoo.com so I can't really grant license to use it but I think the guy realeased it as public domain by posting it to the mailing list. I tested it and it seems to work for me. Just including it here is case somebody want to apply the patch directly to 3.0.1 (although it's better to wait for 3.1) Your fix adds an addtional complexity. Just reset the buffer back to the default ZZ_BUFFERSIZE if grown on reset. Your patch always reallocates a new buffer. Use this: public final void reset(Reader r) { // reset to default buffer size, if buffer has grown if (zzBuffer.length > ZZ_BUFFERSIZE) { zzBuffer = new char [ZZ_BUFFERSIZE]; } yyreset(r); }
          Hide
          Uwe Schindler added a comment -

          Committed revision: 932163

          Show
          Uwe Schindler added a comment - Committed revision: 932163
          Hide
          Uwe Schindler added a comment -

          The zzBuffer bug is fixed in JFlex r591, we should add a version check and remove the code. Also WikipediaTokenizer's files should be regened.

          Show
          Uwe Schindler added a comment - The zzBuffer bug is fixed in JFlex r591, we should add a version check and remove the code. Also WikipediaTokenizer's files should be regened.
          Hide
          Uwe Schindler added a comment -

          Patch for 3.x and trunk. The 3.x patch also contains the lost merge of JFlex 1.5 update in Wikipedia

          Show
          Uwe Schindler added a comment - Patch for 3.x and trunk. The 3.x patch also contains the lost merge of JFlex 1.5 update in Wikipedia
          Hide
          Uwe Schindler added a comment -

          Committed:

          • trunk revision: 945130
          • 3x revision: 945133
          Show
          Uwe Schindler added a comment - Committed: trunk revision: 945130 3x revision: 945133
          Hide
          Robert Muir added a comment -

          reopening for possible 2.9.4/3.0.3 backport.

          Show
          Robert Muir added a comment - reopening for possible 2.9.4/3.0.3 backport.
          Hide
          Uwe Schindler added a comment -

          Backported to 3.0 branch revision: 1028739
          Backported to 2.9 branch revision: 1028744

          Show
          Uwe Schindler added a comment - Backported to 3.0 branch revision: 1028739 Backported to 2.9 branch revision: 1028744

            People

            • Assignee:
              Uwe Schindler
              Reporter:
              Uwe Schindler
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development