Solr
  1. Solr
  2. SOLR-3881

frequent OOM in LanguageIdentifierUpdateProcessor

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 4.0
    • Fix Version/s: 4.10.4, 5.0, 6.0
    • Component/s: update
    • Labels:
      None
    • Environment:

      CentOS 6.x, JDK 1.6, (java -server -Xms2G -Xmx2G -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=....)

      Description

      We are seeing frequent failures from Solr causing it to OOM. Here is the stack trace we observe when this happens:

      Caused by: java.lang.OutOfMemoryError: Java heap space
              at java.util.Arrays.copyOf(Arrays.java:2882)
              at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:100)
              at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:390)
              at java.lang.StringBuffer.append(StringBuffer.java:224)
              at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.concatFields(LanguageIdentifierUpdateProcessor.java:286)
              at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.process(LanguageIdentifierUpdateProcessor.java:189)
              at org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor.processAdd(LanguageIdentifierUpdateProcessor.java:171)
              at org.apache.solr.handler.BinaryUpdateRequestHandler$2.update(BinaryUpdateRequestHandler.java:90)
              at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readOuterMostDocIterator(JavaBinUpdateRequestCodec.java:140)
              at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readIterator(JavaBinUpdateRequestCodec.java:120)
              at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:221)
              at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec$1.readNamedList(JavaBinUpdateRequestCodec.java:105)
              at org.apache.solr.common.util.JavaBinCodec.readVal(JavaBinCodec.java:186)
              at org.apache.solr.common.util.JavaBinCodec.unmarshal(JavaBinCodec.java:112)
              at org.apache.solr.client.solrj.request.JavaBinUpdateRequestCodec.unmarshal(JavaBinUpdateRequestCodec.java:147)
              at org.apache.solr.handler.BinaryUpdateRequestHandler.parseAndLoadDocs(BinaryUpdateRequestHandler.java:100)
              at org.apache.solr.handler.BinaryUpdateRequestHandler.access$000(BinaryUpdateRequestHandler.java:47)
              at org.apache.solr.handler.BinaryUpdateRequestHandler$1.load(BinaryUpdateRequestHandler.java:58)
              at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:59)
              at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
              at org.apache.solr.core.SolrCore.execute(SolrCore.java:1540)
              at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:435)
              at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:256)
              at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1337)
              at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:484)
              at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
              at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:524)
              at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:233)
              at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1065)
              at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:413)
              at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:192)
              at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:999)
      
      1. SOLR-3881.patch
        21 kB
        Steve Rowe
      2. SOLR-3881.patch
        21 kB
        Steve Rowe
      3. SOLR-3881.patch
        20 kB
        Vitaliy Zhovtyuk
      4. SOLR-3881.patch
        26 kB
        Vitaliy Zhovtyuk
      5. SOLR-3881.patch
        13 kB
        Vitaliy Zhovtyuk
      6. SOLR-3881.patch
        8 kB
        Vitaliy Zhovtyuk
      7. SOLR-3881.patch
        7 kB
        Tomás Fernández Löbbe

        Activity

        Hide
        Rob Tulloh added a comment -

        One possible solution is to limit the size of the string that is selected for concatenation. I don't know if there is a good way to guess which part of the string to select. Just picking the first 64K seems like a reasonable solution with the lack of any heuristics to suggest otherwise. Thoughts?

        In LanguageIdentifierUpdateProcessor.concatFields:

                  String fieldValue = (String) doc.getFieldValue(fieldName);
                  if (fieldValue != null && fieldValue.length() > 0) {
                     fieldValue = fieldValue.substring(0,Math.min(65536,fieldValue.length()));
                  }
                  sb.append(fieldValue);
        
        Show
        Rob Tulloh added a comment - One possible solution is to limit the size of the string that is selected for concatenation. I don't know if there is a good way to guess which part of the string to select. Just picking the first 64K seems like a reasonable solution with the lack of any heuristics to suggest otherwise. Thoughts? In LanguageIdentifierUpdateProcessor.concatFields: String fieldValue = (String) doc.getFieldValue(fieldName); if (fieldValue != null && fieldValue.length() > 0) { fieldValue = fieldValue.substring(0,Math.min(65536,fieldValue.length())); } sb.append(fieldValue);
        Hide
        Hoss Man added a comment -

        One possible solution is to limit the size of the string that is selected for concatenation.

        I don't know if there is anyway to make LanguageIdentifierUpdateProcessor more memory efficient (in particular, i'm not sure why it needs to concat the field values instead of operating on them directly) but if you want to give langId just the first N characters of another field: that should already be possible w/o cod changes by wiring together the CloneFieldUpdateProcessorFactory with the TruncateFieldUpdateProcessorFactory.

        Something like this should work...

         ...
         <processor class="solr.CloneFieldUpdateProcessorFactory">
           <str name="source">GIANT_HONKING_STRING_FIELD</str>
           <str name="dest">truncated_string_field_for_lang_detect</str>
         </processor>
         <processor class="solr.TruncateFieldUpdateProcessorFactory">
           <str name="fieldName">truncated_string_field_for_lang_detect</str>
           <int name="maxLength">65536</int>
         </processor>
         <processor class="solr.LangDetectLanguageIdentifierUpdateProcessorFactory">
           <!-- <str name="langid.fl">title,subject,GIANT_HONKING_STRING_FIELD</str> -->
           <str name="langid.fl">title,subject,truncated_string_field_for_lang_detect</str>
           ...
         </processor>
         <processor class="solr.IgnoreFieldUpdateProcessorFactory">
           <str name="fieldName">truncated_string_field_for_lang_detect</str>
         </processor>
         ...
        

        Neither CloneFieldUpdateProcessorFactory nor TruncateFieldUpdateProcessorFactory will make a full copy of the original String value, and TruncateFieldUpdateProcessorFactory will only make a truncated copy if the sources is longer then the configured max (and even then wether any copy is actaully made really just depends on how the JVM implements substring). IgnoreFieldUpdateProcessorFactory will ensure that the truncated copy is freed up for GC as soon as you are done with LangId.

        Show
        Hoss Man added a comment - One possible solution is to limit the size of the string that is selected for concatenation. I don't know if there is anyway to make LanguageIdentifierUpdateProcessor more memory efficient (in particular, i'm not sure why it needs to concat the field values instead of operating on them directly) but if you want to give langId just the first N characters of another field: that should already be possible w/o cod changes by wiring together the CloneFieldUpdateProcessorFactory with the TruncateFieldUpdateProcessorFactory. Something like this should work... ... <processor class= "solr.CloneFieldUpdateProcessorFactory" > <str name= "source" >GIANT_HONKING_STRING_FIELD</str> <str name= "dest" >truncated_string_field_for_lang_detect</str> </processor> <processor class= "solr.TruncateFieldUpdateProcessorFactory" > <str name= "fieldName" >truncated_string_field_for_lang_detect</str> < int name= "maxLength" >65536</ int > </processor> <processor class= "solr.LangDetectLanguageIdentifierUpdateProcessorFactory" > <!-- <str name= "langid.fl" >title,subject,GIANT_HONKING_STRING_FIELD</str> --> <str name= "langid.fl" >title,subject,truncated_string_field_for_lang_detect</str> ... </processor> <processor class= "solr.IgnoreFieldUpdateProcessorFactory" > <str name= "fieldName" >truncated_string_field_for_lang_detect</str> </processor> ... Neither CloneFieldUpdateProcessorFactory nor TruncateFieldUpdateProcessorFactory will make a full copy of the original String value, and TruncateFieldUpdateProcessorFactory will only make a truncated copy if the sources is longer then the configured max (and even then wether any copy is actaully made really just depends on how the JVM implements substring). IgnoreFieldUpdateProcessorFactory will ensure that the truncated copy is freed up for GC as soon as you are done with LangId.
        Hide
        Jan Høydahl added a comment -

        I'm sure it's possible to optimize memory footprint somehow. The reason why we concat all "fl" fields before detection was originally because Tika's detector gets better and better the longer input text you have. So while detection for individual short fields have a high risk of mis-detection, the resulting concatenated string has a better chance.

        A configurable max-cap in the concatenation may make sense, as the detection accuracy flattens out after some threshold.

        Perhaps we could also avoid the expandCapacity() and Ararys.copyOf() calls if we pre-allocate the StringBuffer with the theoretical max size, being the size of our SolrInputDoc. If StringBuffer is at 10kb and needs an extra 10b for an append, it will allocate a new buffer of (10kb+1)*2 capacity which is a waste. We should also switch to StringBuilder which is more performant.

        Show
        Jan Høydahl added a comment - I'm sure it's possible to optimize memory footprint somehow. The reason why we concat all "fl" fields before detection was originally because Tika's detector gets better and better the longer input text you have. So while detection for individual short fields have a high risk of mis-detection, the resulting concatenated string has a better chance. A configurable max-cap in the concatenation may make sense, as the detection accuracy flattens out after some threshold. Perhaps we could also avoid the expandCapacity() and Ararys.copyOf() calls if we pre-allocate the StringBuffer with the theoretical max size, being the size of our SolrInputDoc. If StringBuffer is at 10kb and needs an extra 10b for an append, it will allocate a new buffer of (10kb+1)*2 capacity which is a waste. We should also switch to StringBuilder which is more performant.
        Hide
        Hoss Man added a comment -

        The reason why we concat all "fl" fields before detection was originally because Tika's detector gets better and better the longer input text you have.

        But is it possible to give Tika a String[] or List<String> instead of concating everything into a single String?

        Show
        Hoss Man added a comment - The reason why we concat all "fl" fields before detection was originally because Tika's detector gets better and better the longer input text you have. But is it possible to give Tika a String[] or List<String> instead of concating everything into a single String?
        Hide
        Robert Muir added a comment -

        The langdetect implementation can append each piece at a time.

        It can also take reader: append(Reader), but that is really just syntactic sugar forwarding to append(String)
        and not exceeding the Detector.max_text_length.

        Seems like the concatenating stuff should be pushed out of the base class into the Tika impl.

        Show
        Robert Muir added a comment - The langdetect implementation can append each piece at a time. It can also take reader: append(Reader), but that is really just syntactic sugar forwarding to append(String) and not exceeding the Detector.max_text_length. Seems like the concatenating stuff should be pushed out of the base class into the Tika impl.
        Hide
        Jan Høydahl added a comment -

        Probably built-in truncation is enough to avoid the OOMs, and we could refactor the multi string append if neccesary later.

        Show
        Jan Høydahl added a comment - Probably built-in truncation is enough to avoid the OOMs, and we could refactor the multi string append if neccesary later.
        Hide
        Tomás Fernández Löbbe added a comment -

        Do you think it would make more sense to limit each append (for the different fields) or to limit the total size of the buffer/builder (stop appending fields when the maximum was reached)? Both ways would prevent OOM, however they could give different results.

        Show
        Tomás Fernández Löbbe added a comment - Do you think it would make more sense to limit each append (for the different fields) or to limit the total size of the buffer/builder (stop appending fields when the maximum was reached)? Both ways would prevent OOM, however they could give different results.
        Hide
        Tomás Fernández Löbbe added a comment -

        Added the proposed changes

        Show
        Tomás Fernández Löbbe added a comment - Added the proposed changes
        Hide
        Steve Rowe added a comment -

        Bulk move 4.4 issues to 4.5 and 5.0

        Show
        Steve Rowe added a comment - Bulk move 4.4 issues to 4.5 and 5.0
        Hide
        Vitaliy Zhovtyuk added a comment -

        Updated to latest trunk.
        Fixed multivalue support.
        Added string size calculation as string builder capacity. Used to prevent multiple array allocation on append. (Maybe also need to be configurable - for large documents only)

        Show
        Vitaliy Zhovtyuk added a comment - Updated to latest trunk. Fixed multivalue support. Added string size calculation as string builder capacity. Used to prevent multiple array allocation on append. (Maybe also need to be configurable - for large documents only)
        Hide
        Uwe Schindler added a comment -

        Move issue to Solr 4.9.

        Show
        Uwe Schindler added a comment - Move issue to Solr 4.9.
        Hide
        Steve Rowe added a comment -

        Added string size calculation as string builder capacity. Used to prevent multiple array allocation on append. (Maybe also need to be configurable - for large documents only)

        Vitaliy Zhovtyuk, I agree - I think we should have two configurable limits: max chars per field value (already in Tomás Fernández Löbbe and your updated patches), and a max total chars (not there yet).

        Tomás wrote:

        Do you think it would make more sense to limit each append (for the different fields) or to limit the total size of the buffer/builder (stop appending fields when the maximum was reached)? Both ways would prevent OOM, however they could give different results.

        I think we should have both limits.

        I think it's more important, though, to do as Robert Muir said earlier in this issue:

        The langdetect implementation can append each piece at a time.

        It can also take reader: append(Reader), but that is really just syntactic sugar forwarding to append(String)
        and not exceeding the Detector.max_text_length.

        Seems like the concatenating stuff should be pushed out of the base class into the Tika impl.

        See http://language-detection.googlecode.com/svn/trunk/doc/com/cybozu/labs/langdetect/Detector.html#setMaxTextLength(int) - the default is 10K chars - we can pass the configured max total chars here.

        We should also set default maxima for both per-value and total chars, rather than MAX_INT, as in the current patch.

        Show
        Steve Rowe added a comment - Added string size calculation as string builder capacity. Used to prevent multiple array allocation on append. (Maybe also need to be configurable - for large documents only) Vitaliy Zhovtyuk , I agree - I think we should have two configurable limits: max chars per field value (already in Tomás Fernández Löbbe and your updated patches), and a max total chars (not there yet). Tomás wrote: Do you think it would make more sense to limit each append (for the different fields) or to limit the total size of the buffer/builder (stop appending fields when the maximum was reached)? Both ways would prevent OOM, however they could give different results. I think we should have both limits. I think it's more important, though, to do as Robert Muir said earlier in this issue: The langdetect implementation can append each piece at a time. It can also take reader: append(Reader), but that is really just syntactic sugar forwarding to append(String) and not exceeding the Detector.max_text_length. Seems like the concatenating stuff should be pushed out of the base class into the Tika impl. See http://language-detection.googlecode.com/svn/trunk/doc/com/cybozu/labs/langdetect/Detector.html#setMaxTextLength(int ) - the default is 10K chars - we can pass the configured max total chars here. We should also set default maxima for both per-value and total chars, rather than MAX_INT, as in the current patch.
        Hide
        Steve Rowe added a comment - - edited

        See http://language-detection.googlecode.com/svn/trunk/doc/com/cybozu/labs/langdetect/Detector.html#setMaxTextLength(int) - the default is 10K chars - we can pass the configured max total chars here.

        The default is actually 10K bytes, not chars, so we'd need to divide by two when passing the configured max total chars.

        edit Disregard the above comment; the javadocs refer to "10KB" as the default max text length, but Detector.append() uses the max_text_length config as a max number of chars.

        Show
        Steve Rowe added a comment - - edited See http://language-detection.googlecode.com/svn/trunk/doc/com/cybozu/labs/langdetect/Detector.html#setMaxTextLength(int ) - the default is 10K chars - we can pass the configured max total chars here. The default is actually 10K bytes , not chars, so we'd need to divide by two when passing the configured max total chars. edit Disregard the above comment; the javadocs refer to "10KB" as the default max text length, but Detector.append() uses the max_text_length config as a max number of chars.
        Hide
        Vitaliy Zhovtyuk added a comment -

        Added global limit to concatenated string
        Added limit to detector to detector.setMaxTextLength(maxTotalAppendSize);

        About moving org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor#concatFields to org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor it's a bit unclear because concatFields is used in both org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor and org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor

        Show
        Vitaliy Zhovtyuk added a comment - Added global limit to concatenated string Added limit to detector to detector.setMaxTextLength(maxTotalAppendSize); About moving org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor#concatFields to org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor it's a bit unclear because concatFields is used in both org.apache.solr.update.processor.TikaLanguageIdentifierUpdateProcessor and org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor
        Hide
        Steve Rowe added a comment -

        Vitaliy, those changes look good.

        About moving concatFields() to the tika language identifier: I think the way to go is just move the whole method there, then change the detectLanguage() method to take the SolrInputDocument instead of a String. You don't need to carry over the field[] parameter from concatFields(), since data member inputFields will be accessible everywhere it's needed.

        I should have mentioned previously: I don't like the maxAppendSize and maxTotalAppendSize names - "size" is ambiguous (could refer to bytes, chars, whatever), and "append" refers to an internal operation... I'd like to see "append"=>"field value" and "size"=>"chars": maxFieldValueChars, and maxTotalChars (since appending doesn't need to be mentioned for the global limit). The same thing goes for the default constants and the test method names.

        Some minor issues I found with your patch:

        1. As I said previously: "We should also set default maxima for both per-value and total chars, rather than MAX_INT, as in the current patch."
        2. The total chars default should be its own setting; I was thinking we could make it double the per-value default?
        3. It's better not to reorder import statements unless you're already making significant changes to them; it distracts from the meat of the change. (You reordered them in LangDetectLanguageIdentifierUpdateProcessor and LanguageIdentifierUpdateProcessorFactoryTestCase)
        4. In LanguageIdentifierUpdateProcessor.concatFields(), when you trim the concatenated text to maxTotalAppendSize, I think StringBuilder.setLength(maxTotalAppendSize); would be more efficient than StringBuilder.delete(maxTotalAppendSize, sb.length() - 1);
        5. In addition to the test you added for the global limit, we should also test using both the per-value and global limits at the same time.
        Show
        Steve Rowe added a comment - Vitaliy, those changes look good. About moving concatFields() to the tika language identifier: I think the way to go is just move the whole method there, then change the detectLanguage() method to take the SolrInputDocument instead of a String. You don't need to carry over the field[] parameter from concatFields() , since data member inputFields will be accessible everywhere it's needed. I should have mentioned previously: I don't like the maxAppendSize and maxTotalAppendSize names - "size" is ambiguous (could refer to bytes, chars, whatever), and "append" refers to an internal operation... I'd like to see "append"=>"field value" and "size"=>"chars": maxFieldValueChars , and maxTotalChars (since appending doesn't need to be mentioned for the global limit). The same thing goes for the default constants and the test method names. Some minor issues I found with your patch: As I said previously: "We should also set default maxima for both per-value and total chars, rather than MAX_INT, as in the current patch." The total chars default should be its own setting; I was thinking we could make it double the per-value default? It's better not to reorder import statements unless you're already making significant changes to them; it distracts from the meat of the change. (You reordered them in LangDetectLanguageIdentifierUpdateProcessor and LanguageIdentifierUpdateProcessorFactoryTestCase ) In LanguageIdentifierUpdateProcessor.concatFields() , when you trim the concatenated text to maxTotalAppendSize , I think StringBuilder.setLength(maxTotalAppendSize); would be more efficient than StringBuilder.delete(maxTotalAppendSize, sb.length() - 1); In addition to the test you added for the global limit, we should also test using both the per-value and global limits at the same time.
        Hide
        Vitaliy Zhovtyuk added a comment -

        About moving concatFields() to the tika language identifier: I think the way to go is just move the whole method there, then change the detectLanguage() method to take the SolrInputDocument instead of a String. You don't need to carry over the field[] parameter from concatFields(), since data member inputFields will be accessible everywhere it's needed.
        [VZ] This call looks more cleaner now, i changed inputFields to private now to reduce visibility scope

        I should have mentioned previously: I don't like the maxAppendSize and maxTotalAppendSize names - "size" is ambiguous (could refer to bytes, chars, whatever), and "append" refers to an internal operation... I'd like to see "append"=>"field value" and "size"=>"chars": maxFieldValueChars, and maxTotalChars (since appending doesn't need to be mentioned for the global limit). The same thing goes for the default constants and the test method names.
        [VZ] Renamed parameters and test methods

        Some minor issues I found with your patch:
        As I said previously: "We should also set default maxima for both per-value and total chars, rather than MAX_INT, as in the current patch."
        The total chars default should be its own setting; I was thinking we could make it double the per-value default?
        [VZ] added default value to maxTotalChars and changed both to 10K like in com.cybozu.labs.langdetect.Detector.maxLength

        It's better not to reorder import statements unless you're already making significant changes to them; it distracts from the meat of the change. (You reordered them in LangDetectLanguageIdentifierUpdateProcessor and LanguageIdentifierUpdateProcessorFactoryTestCase)
        [VZ] This is IDE optimization to put imports in alphabetical order - restored it to original order

        In LanguageIdentifierUpdateProcessor.concatFields(), when you trim the concatenated text to maxTotalAppendSize, I think StringBuilder.setLength(maxTotalAppendSize); would be more efficient than StringBuilder.delete(maxTotalAppendSize, sb.length() - 1);
        [VZ] Yep, cleaned that

        In addition to the test you added for the global limit, we should also test using both the per-value and global limits at the same time.
        [VZ] Tests for both limits added

        Show
        Vitaliy Zhovtyuk added a comment - About moving concatFields() to the tika language identifier: I think the way to go is just move the whole method there, then change the detectLanguage() method to take the SolrInputDocument instead of a String. You don't need to carry over the field[] parameter from concatFields(), since data member inputFields will be accessible everywhere it's needed. [VZ] This call looks more cleaner now, i changed inputFields to private now to reduce visibility scope I should have mentioned previously: I don't like the maxAppendSize and maxTotalAppendSize names - "size" is ambiguous (could refer to bytes, chars, whatever), and "append" refers to an internal operation... I'd like to see "append"=>"field value" and "size"=>"chars": maxFieldValueChars, and maxTotalChars (since appending doesn't need to be mentioned for the global limit). The same thing goes for the default constants and the test method names. [VZ] Renamed parameters and test methods Some minor issues I found with your patch: As I said previously: "We should also set default maxima for both per-value and total chars, rather than MAX_INT, as in the current patch." The total chars default should be its own setting; I was thinking we could make it double the per-value default? [VZ] added default value to maxTotalChars and changed both to 10K like in com.cybozu.labs.langdetect.Detector.maxLength It's better not to reorder import statements unless you're already making significant changes to them; it distracts from the meat of the change. (You reordered them in LangDetectLanguageIdentifierUpdateProcessor and LanguageIdentifierUpdateProcessorFactoryTestCase) [VZ] This is IDE optimization to put imports in alphabetical order - restored it to original order In LanguageIdentifierUpdateProcessor.concatFields(), when you trim the concatenated text to maxTotalAppendSize, I think StringBuilder.setLength(maxTotalAppendSize); would be more efficient than StringBuilder.delete(maxTotalAppendSize, sb.length() - 1); [VZ] Yep, cleaned that In addition to the test you added for the global limit, we should also test using both the per-value and global limits at the same time. [VZ] Tests for both limits added
        Hide
        Steve Rowe added a comment -

        Vitaliy, thanks for the changes.

        I see a few more issues in your latest patch:

        1. LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() still uses concatFields(), but it shouldn't – that was the whole point about moving it to TikaLanguageIdentifierUpdateProcessor; instead, LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() should loop over inputFields and call detector.append() (similarly to what concatFields() does).
        2. concatFields() and getExpectedSize() should move to TikaLanguageIdentifierUpdateProcessor.
        3. LanguageIdentifierUpdateProcessor.getExpectedSize() still takes a maxAppendSize, which didn't get renamed, but that param could be removed entirely, since maxFieldValueChars is available as a data member.
        4. There are a bunch of whitespace changes in LanguageIdentifierUpdateProcessorFactoryTestCase.java - it makes reviewing patches significantly harder when they include changes like this. Your IDE should have settings that make it stop doing this.
        5. There is still some import reordering in TikaLanguageIdentifierUpdateProcessor.java.

        One last thing:

        The total chars default should be its own setting; I was thinking we could make it double the per-value default?

        [VZ] added default value to maxTotalChars and changed both to 10K like in com.cybozu.labs.langdetect.Detector.maxLength

        Thanks for adding the total chars default, but you didn't make it double the field value chars default, as I suggested. Not sure if that's better - if the user specifies multiple fields and the first one is the only one that's used to determine the language because it's larger than the total char default, is that an issue? I was thinking that it would be better to visit at least one other field (hence the idea of total = 2 * per-field), but that wouldn't fully address the issue. What do you think?

        Show
        Steve Rowe added a comment - Vitaliy, thanks for the changes. I see a few more issues in your latest patch: LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() still uses concatFields() , but it shouldn't – that was the whole point about moving it to TikaLanguageIdentifierUpdateProcessor ; instead, LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() should loop over inputFields and call detector.append() (similarly to what concatFields() does). concatFields() and getExpectedSize() should move to TikaLanguageIdentifierUpdateProcessor . LanguageIdentifierUpdateProcessor.getExpectedSize() still takes a maxAppendSize , which didn't get renamed, but that param could be removed entirely, since maxFieldValueChars is available as a data member. There are a bunch of whitespace changes in LanguageIdentifierUpdateProcessorFactoryTestCase.java - it makes reviewing patches significantly harder when they include changes like this. Your IDE should have settings that make it stop doing this. There is still some import reordering in TikaLanguageIdentifierUpdateProcessor.java . One last thing: The total chars default should be its own setting; I was thinking we could make it double the per-value default? [VZ] added default value to maxTotalChars and changed both to 10K like in com.cybozu.labs.langdetect.Detector.maxLength Thanks for adding the total chars default, but you didn't make it double the field value chars default, as I suggested. Not sure if that's better - if the user specifies multiple fields and the first one is the only one that's used to determine the language because it's larger than the total char default, is that an issue? I was thinking that it would be better to visit at least one other field (hence the idea of total = 2 * per-field), but that wouldn't fully address the issue. What do you think?
        Hide
        Vitaliy Zhovtyuk added a comment -

        1. LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() still uses concatFields(), but it shouldn't – that was the whole point about moving it to TikaLanguageIdentifierUpdateProcessor; instead, LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() should loop over inputFields and call detector.append() (similarly to what concatFields() does).
        [VZ] LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() changed to use old flow with limit on field and max total on detector.
        Each field value appended to detector.

        2. concatFields() and getExpectedSize() should move to TikaLanguageIdentifierUpdateProcessor.
        [VZ] Moved to TikaLanguageIdentifierUpdateProcessor. Tests using concatFields() moved to TikaLanguageIdentifierUpdateProcessorFactoryTest.

        3. LanguageIdentifierUpdateProcessor.getExpectedSize() still takes a maxAppendSize, which didn't get renamed, but that param could be removed entirely, since maxFieldValueChars is available as a data member.
        [VZ] Argument removed.

        4. There are a bunch of whitespace changes in LanguageIdentifierUpdateProcessorFactoryTestCase.java - it makes reviewing patches significantly harder when they include changes like this. Your IDE should have settings that make it stop doing this.
        [VZ] Whitespaces removed.

        5. There is still some import reordering in TikaLanguageIdentifierUpdateProcessor.java.
        [VZ] Fixed.

        One last thing:
        The total chars default should be its own setting; I was thinking we could make it double the per-value default?
        [VZ] added default value to maxTotalChars and changed both to 10K like in com.cybozu.labs.langdetect.Detector.maxLength
        Thanks for adding the total chars default, but you didn't make it double the field value chars default, as I suggested. Not sure if that's better - if the user specifies multiple fields and the first one is the only one that's used to determine the language because it's larger than the total char default, is that an issue? I was thinking that it would be better to visit at least one other field (hence the idea of total = 2 * per-field), but that wouldn't fully address the issue. What do you think?
        [VZ] i think in most cases it will be only one field, but since both parameters are optional we should not restrict result if only per field specified more then 10K.
        Updated total default value to 20K.

        Show
        Vitaliy Zhovtyuk added a comment - 1. LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() still uses concatFields(), but it shouldn't – that was the whole point about moving it to TikaLanguageIdentifierUpdateProcessor; instead, LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() should loop over inputFields and call detector.append() (similarly to what concatFields() does). [VZ] LangDetectLanguageIdentifierUpdateProcessor.detectLanguage() changed to use old flow with limit on field and max total on detector. Each field value appended to detector. 2. concatFields() and getExpectedSize() should move to TikaLanguageIdentifierUpdateProcessor. [VZ] Moved to TikaLanguageIdentifierUpdateProcessor. Tests using concatFields() moved to TikaLanguageIdentifierUpdateProcessorFactoryTest. 3. LanguageIdentifierUpdateProcessor.getExpectedSize() still takes a maxAppendSize, which didn't get renamed, but that param could be removed entirely, since maxFieldValueChars is available as a data member. [VZ] Argument removed. 4. There are a bunch of whitespace changes in LanguageIdentifierUpdateProcessorFactoryTestCase.java - it makes reviewing patches significantly harder when they include changes like this. Your IDE should have settings that make it stop doing this. [VZ] Whitespaces removed. 5. There is still some import reordering in TikaLanguageIdentifierUpdateProcessor.java. [VZ] Fixed. One last thing: The total chars default should be its own setting; I was thinking we could make it double the per-value default? [VZ] added default value to maxTotalChars and changed both to 10K like in com.cybozu.labs.langdetect.Detector.maxLength Thanks for adding the total chars default, but you didn't make it double the field value chars default, as I suggested. Not sure if that's better - if the user specifies multiple fields and the first one is the only one that's used to determine the language because it's larger than the total char default, is that an issue? I was thinking that it would be better to visit at least one other field (hence the idea of total = 2 * per-field), but that wouldn't fully address the issue. What do you think? [VZ] i think in most cases it will be only one field, but since both parameters are optional we should not restrict result if only per field specified more then 10K. Updated total default value to 20K.
        Hide
        Steve Rowe added a comment -

        Thanks Vitaliy, looks great.

        [VZ] ... since both parameters are optional we should not restrict result if only per field specified more then 10K.

        I agree - I added code to LanguageIdentifierUpdateProcessor.initParams() to check if maxFieldValueChars > maxTotalChars, and if so, set maxTotalChars to the value of maxFieldValueChars and log a warning.

        Attaching a version of the patch that includes the above change, as well as a CHANGES.txt entry.

        I'll commit when Subversion lets me (can't connect right now).

        Show
        Steve Rowe added a comment - Thanks Vitaliy, looks great. [VZ] ... since both parameters are optional we should not restrict result if only per field specified more then 10K. I agree - I added code to LanguageIdentifierUpdateProcessor.initParams() to check if maxFieldValueChars > maxTotalChars , and if so, set maxTotalChars to the value of maxFieldValueChars and log a warning. Attaching a version of the patch that includes the above change, as well as a CHANGES.txt entry. I'll commit when Subversion lets me (can't connect right now).
        Hide
        Steve Rowe added a comment -

        I added code to LanguageIdentifierUpdateProcessor.initParams() to check if maxFieldValueChars > maxTotalChars, and if so, set maxTotalChars to the value of maxFieldValueChars and log a warning.

        The above was too simplistic. If maxFieldValueChars > maxTotalChars and the user only set maxTotalChars, then the reverse should happen: maxFieldValueChars should be set maxTotalChars.

        The attached patch fixes this, and allows the tests to pass.

        Show
        Steve Rowe added a comment - I added code to LanguageIdentifierUpdateProcessor.initParams() to check if maxFieldValueChars > maxTotalChars , and if so, set maxTotalChars to the value of maxFieldValueChars and log a warning. The above was too simplistic. If maxFieldValueChars > maxTotalChars and the user only set maxTotalChars , then the reverse should happen: maxFieldValueChars should be set maxTotalChars . The attached patch fixes this, and allows the tests to pass.
        Hide
        ASF subversion and git services added a comment -

        Commit 1643377 from Use account "steve_rowe" instead in branch 'dev/trunk'
        [ https://svn.apache.org/r1643377 ]

        SOLR-3881: Avoid OOMs in LanguageIdentifierUpdateProcessor:

        • Added langid.maxFieldValueChars and langid.maxTotalChars params to limit
          input, by default 10k and 20k chars, respectively.
        • Moved input concatenation to Tika implementation; the langdetect
          implementation instead appends each input piece via the langdetect API.
        Show
        ASF subversion and git services added a comment - Commit 1643377 from Use account "steve_rowe" instead in branch 'dev/trunk' [ https://svn.apache.org/r1643377 ] SOLR-3881 : Avoid OOMs in LanguageIdentifierUpdateProcessor: Added langid.maxFieldValueChars and langid.maxTotalChars params to limit input, by default 10k and 20k chars, respectively. Moved input concatenation to Tika implementation; the langdetect implementation instead appends each input piece via the langdetect API.
        Hide
        ASF subversion and git services added a comment -

        Commit 1643390 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x'
        [ https://svn.apache.org/r1643390 ]

        SOLR-3881: Avoid OOMs in LanguageIdentifierUpdateProcessor:

        • Added langid.maxFieldValueChars and langid.maxTotalChars params to limit
          input, by default 10k and 20k chars, respectively.
        • Moved input concatenation to Tika implementation; the langdetect
          implementation instead appends each input piece via the langdetect API.
          (merged trunk r1643377)
        Show
        ASF subversion and git services added a comment - Commit 1643390 from Use account "steve_rowe" instead in branch 'dev/branches/branch_5x' [ https://svn.apache.org/r1643390 ] SOLR-3881 : Avoid OOMs in LanguageIdentifierUpdateProcessor: Added langid.maxFieldValueChars and langid.maxTotalChars params to limit input, by default 10k and 20k chars, respectively. Moved input concatenation to Tika implementation; the langdetect implementation instead appends each input piece via the langdetect API. (merged trunk r1643377)
        Hide
        Steve Rowe added a comment -

        Committed to trunk and branch_5x.

        Thanks all!

        Show
        Steve Rowe added a comment - Committed to trunk and branch_5x. Thanks all!
        Hide
        Anshum Gupta added a comment -

        Bulk close after 5.0 release.

        Show
        Anshum Gupta added a comment - Bulk close after 5.0 release.
        Hide
        Shalin Shekhar Mangar added a comment -

        Reopening to backport to 4.10.4

        Show
        Shalin Shekhar Mangar added a comment - Reopening to backport to 4.10.4
        Hide
        ASF subversion and git services added a comment -

        Commit 1662436 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10'
        [ https://svn.apache.org/r1662436 ]

        SOLR-3881: Avoid OOMs in LanguageIdentifierUpdateProcessor:

        • Added langid.maxFieldValueChars and langid.maxTotalChars params to limit
          input, by default 10k and 20k chars, respectively.
        • Moved input concatenation to Tika implementation; the langdetect
          implementation instead appends each input piece via the langdetect API.
        Show
        ASF subversion and git services added a comment - Commit 1662436 from shalin@apache.org in branch 'dev/branches/lucene_solr_4_10' [ https://svn.apache.org/r1662436 ] SOLR-3881 : Avoid OOMs in LanguageIdentifierUpdateProcessor: Added langid.maxFieldValueChars and langid.maxTotalChars params to limit input, by default 10k and 20k chars, respectively. Moved input concatenation to Tika implementation; the langdetect implementation instead appends each input piece via the langdetect API.
        Hide
        Michael McCandless added a comment -

        Bulk close for 4.10.4 release

        Show
        Michael McCandless added a comment - Bulk close for 4.10.4 release

          People

          • Assignee:
            Steve Rowe
            Reporter:
            Rob Tulloh
          • Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development