Uploaded image for project: 'Solr'
  1. Solr
  2. SOLR-16589

Large fields with large="true" can be truncated when using unicode values

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 9.0, 9.1
    • main (10.0), 9.2, 9.1.1
    • search
    • None

    Description

      Summary

      For fields using large="true", large fields (which is what they are intended for) can be truncated in v9+ of Solr.

      Example fieldtype definition:

      <fieldtype name="string_large"  class="solr.TextField" multiValued="false" indexed="false" stored="true" omitNorms="true" large="true" />

      Cause

      Looks like this is a bug introduced along with https://issues.apache.org/jira/browse/LUCENE-8805 / https://github.com/apache/lucene/issues/9849

      The current code is here:
      https://github.com/apache/solr/blob/main/solr/core/src/java/org/apache/solr/search/SolrDocumentFetcher.java#L511
       

      public void stringField(FieldInfo fieldInfo, String value) throws IOException {
          Objects.requireNonNull(value, "String value should not be null");
          bytesRef.bytes = value.getBytes(StandardCharsets.UTF_8);
          bytesRef.length = value.length();
      

       
      Specifically with respect to "large" fields handling.

      The length in utf8 bytes will often be longer than the string length `value.length()`, hence the truncation.

      Fix

      bytesRef.length = bytesRef.bytes.length 

       

      Attachments

        Issue Links

          Activity

            People

              krisden Kevin Risden
              nosvalds Nikolas Osvalds
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 40m
                  40m