Solr
  1. Solr
  2. SOLR-32

Result of select request is not well-formed XML when text field contains non-ASCII chars and ampersand

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.1.0
    • Component/s: search
    • Labels:
      None
    • Environment:

      Seen when running with the supplied Jetty container, macosx, JDK 1.5.0_06

      Description

      Starting with the supplied start.jar, the ampersand from this field is not correctly escaped in the XML search results provided by the select page:

      <?xml version="1.0" encoding="UTF-8"?>
      <add>
      <doc>
      <field name="id">amp-test-one</field>
      <field name="content">Les événements chez Bonnie & Clyde.</field>
      </doc>
      </add>
      </stuff>

      The "content" field is defined as a "text" field in the schema.

      Adding this document to the index and querying on "id:amp-test-one" returns
      ...
      <doc>
      <str name="content">Les événements chez Bonnie & Clyde.& Clyde.</str>
      <str name="id">amp-test-one</str>
      </doc>

      With first "Bonnie & Clyde" unescaped and then the correct escaped &

      Browsing the index with Luke shows that the field is correctly stored.

      I think this might be a Jetty bug: patching the util/XML class of SOLR to avoid the use of Writer.write(String,start,len) fixes the problem. Maybe the Jetty ServletWriter gets confused by the presence of non-ascii chars?

      Here are my changes in util/XML.java. It looks like the class did use String.substring(...) before, Writer.write might be faster but it seems like it's broken in that environment.

      Here are my patches to util/XML.java:

      Index: src/java/org/apache/solr/util/XML.java
      ===================================================================
      — src/java/org/apache/solr/util/XML.java (revision 422655)
      +++ src/java/org/apache/solr/util/XML.java (working copy)
      @@ -159,8 +159,8 @@
      }
      if (subst != null) {
      if (start<i)

      { - // out.write(str.substring(start,i)); - out.write(str, start, i-start); + out.write(str.substring(start,i)); + // out.write(str, start, i-start); // n+=i-start; }

      out.write(subst);
      @@ -172,8 +172,8 @@
      out.write(str);
      // n += str.length();
      } else if (start<str.length())

      { - // out.write(str.substring(start)); - out.write(str, start, str.length()-start); + out.write(str.substring(start)); + // out.write(str, start, str.length()-start); // n += str.length()-start; }

      // return n;

        Activity

        Hide
        Yonik Seeley added a comment -

        Yes, we had been having problems all along with Jetty and it's UTF-8 writer.
        I just committed this (correctness before performance...)
        Thanks for tracking down the problem!

        Show
        Yonik Seeley added a comment - Yes, we had been having problems all along with Jetty and it's UTF-8 writer. I just committed this (correctness before performance...) Thanks for tracking down the problem!
        Hide
        Mike Klaas added a comment -

        Just wanted to confirm this problem in my application. It is an odd interaction with unicode and Jetty's io stack--it seems to only occur when an offseted write() into a String with unicode characters that was preceded by a non-unicode write (writing the whole string is fine, as is writing char arrays).

        The attached patch fixed the problem, though it was only necessary to convert the first write() to substring.

        Show
        Mike Klaas added a comment - Just wanted to confirm this problem in my application. It is an odd interaction with unicode and Jetty's io stack--it seems to only occur when an offseted write() into a String with unicode characters that was preceded by a non-unicode write (writing the whole string is fine, as is writing char arrays). The attached patch fixed the problem, though it was only necessary to convert the first write() to substring.
        Hide
        Yonik Seeley added a comment -

        Anyone know if this is still a problem in the latest Jetty6? Someone might want to follow up with them on it.

        Show
        Yonik Seeley added a comment - Anyone know if this is still a problem in the latest Jetty6? Someone might want to follow up with them on it.
        Hide
        Hoss Man added a comment -

        This bug was modified as part of a bulk update using the criteria...

        • Marked ("Resolved" or "Closed") and "Fixed"
        • Had no "Fix Version" versions
        • Was listed in the CHANGES.txt for 1.1

        The Fix Version for all 38 issues found was set to 1.1, email notification
        was suppressed to prevent excessive email.

        For a list of all the issues modified, search jira comments for this
        (hopefully) unique string: 20080415hossman3

        Show
        Hoss Man added a comment - This bug was modified as part of a bulk update using the criteria... Marked ("Resolved" or "Closed") and "Fixed" Had no "Fix Version" versions Was listed in the CHANGES.txt for 1.1 The Fix Version for all 38 issues found was set to 1.1, email notification was suppressed to prevent excessive email. For a list of all the issues modified, search jira comments for this (hopefully) unique string: 20080415hossman3

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            Bertrand Delacretaz
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development