Solr
  1. Solr
  2. SOLR-1091

"phps" (serialized PHP) writer produces invalid output

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: 1.3
    • Fix Version/s: 1.4
    • Component/s: search
    • Labels:
      None
    • Environment:

      Sun JRE 1.6.0 on Centos 5

      Description

      The serialized PHP output writer can outputs invalid string lengths for certain (unusual) input values. Specifically, I had a document containing the following 6 byte character sequence: \xED\xAF\x80\xED\xB1\xB8

      I was able to create a document in the index containing this value without issue; however, when fetching the document back out using the serialized PHP writer, it returns a string like the following:

      s:4:"􀁸";

      Note that the string length specified is 4, while the string is actually 6 bytes long.

      When using PHP's native serialize() function, it correctly sets the length to 6:

      1. php -r 'var_dump(serialize("\xED\xAF\x80\xED\xB1\xB8"));'
        string(13) "s:6:"􀁸";"

      The "wt=php" writer, which produces output to be parsed with eval(), doesn't have any trouble with this string.

      1. SOLR-1091.patch
        3 kB
        Yonik Seeley
      2. SOLR-1091.patch
        3 kB
        Yonik Seeley

        Activity

        Hide
        Grant Ingersoll added a comment -

        Bulk close for Solr 1.4

        Show
        Grant Ingersoll added a comment - Bulk close for Solr 1.4
        Hide
        Yonik Seeley added a comment -

        committed suggestions.

        Show
        Yonik Seeley added a comment - committed suggestions.
        Hide
        Hoss Man added a comment -

        reopening because of discssion on list and yonik's new patch.

        Yonik: this looks good, but i would suggest changing the system property to "solr.phps.cesu8" so people don't set it and then later forget what it's for.

        The javadoc are also a little abiguous about what people should set it to, how about...

        In order to support PHP Serialized strings with a proper byte count, This ResponseWriter 
        must know if the Writers passed to it will result in an output of CESU-8 (UTF-8 w/o support 
        for large code points outside of the BMP)
        
        Currently Solr assumes that all Jetty servlet containers (detected using the "jetty.home" 
        system property) use CESU-8 instead of UTF-8 (verified to the current release of 6.1.20).
        
        In installations where Solr auto-detects incorrectly, the Solr Administrator should set the
        "solr.phps.cesu8" system property to either "true" or "false" accordingly.
        
        Show
        Hoss Man added a comment - reopening because of discssion on list and yonik's new patch. Yonik: this looks good, but i would suggest changing the system property to "solr.phps.cesu8" so people don't set it and then later forget what it's for. The javadoc are also a little abiguous about what people should set it to, how about... In order to support PHP Serialized strings with a proper byte count, This ResponseWriter must know if the Writers passed to it will result in an output of CESU-8 (UTF-8 w/o support for large code points outside of the BMP) Currently Solr assumes that all Jetty servlet containers (detected using the "jetty.home" system property) use CESU-8 instead of UTF-8 (verified to the current release of 6.1.20). In installations where Solr auto-detects incorrectly, the Solr Administrator should set the "solr.phps.cesu8" system property to either " true " or " false " accordingly.
        Hide
        Yonik Seeley added a comment -

        Additional patch that allows overriding of the CESU-8 guess with the CESU8 system property.

        Show
        Yonik Seeley added a comment - Additional patch that allows overriding of the CESU-8 guess with the CESU8 system property.
        Hide
        Yonik Seeley added a comment -

        committed.

        Show
        Yonik Seeley added a comment - committed.
        Hide
        Yonik Seeley added a comment -

        Here's a patch that can handle the modified UTF8 that Jetty puts out, as well as speeding up the normal UTF8 case using Lucene's UTF8 encoding.

        modified UTF8 support is switched on if the jetty.home property is set (jetty does this by default).

        Show
        Yonik Seeley added a comment - Here's a patch that can handle the modified UTF8 that Jetty puts out, as well as speeding up the normal UTF8 case using Lucene's UTF8 encoding. modified UTF8 support is switched on if the jetty.home property is set (jetty does this by default).
        Hide
        frank farmer added a comment -

        Thanks for looking in to this. I take it this issue can be closed, and I should go open a ticket against Jetty?

        Show
        frank farmer added a comment - Thanks for looking in to this. I take it this issue can be closed, and I should go open a ticket against Jetty?
        Hide
        Yonik Seeley added a comment -

        Confirmed - it's jetty. The latest version of Jetty exhibits the same behavior - it produces incorrect UT8 for code points outside the BMP. I tried with LucidWorks/Solr (Tomcat based) and it works correctly by only outputting 4 bytes.

        Show
        Yonik Seeley added a comment - Confirmed - it's jetty. The latest version of Jetty exhibits the same behavior - it produces incorrect UT8 for code points outside the BMP. I tried with LucidWorks/Solr (Tomcat based) and it works correctly by only outputting 4 bytes.
        Hide
        Yonik Seeley added a comment -

        Echoing the param from the python writer (which escapes chars outside the ascii range) shows that the internal UTF-16 string after decoding the invalid UTF8 is \udbc0\udc78

        This represents unicode code point 1048696, which encoded into UTF8 should be
        f4 80 81 b8 (4 bytes).

        Thus, I'm thinking that it's perhaps a jetty bug in not being able to handle characters outside the BMP?

        Show
        Yonik Seeley added a comment - Echoing the param from the python writer (which escapes chars outside the ascii range) shows that the internal UTF-16 string after decoding the invalid UTF8 is \udbc0\udc78 This represents unicode code point 1048696, which encoded into UTF8 should be f4 80 81 b8 (4 bytes). Thus, I'm thinking that it's perhaps a jetty bug in not being able to handle characters outside the BMP?
        Hide
        Yonik Seeley added a comment -

        Hmmm, I thought this might be a decoding error. Instead it perhaps looks like the JVM not handling UTF8 encoding the same as Jetty.

        As one can see by this:
        $ curl --trace - 'http://localhost:8983/solr/select?q=abcdef&a=%ED%AF%80%ED%B1%B8&wt=phps'

        The ascii portion of the dump looks like:
        s:4:"......";

        So the JVM is telling us that the string will serialize to 4 bytes, and when Jetty actually does the serialization, it comes out to 6.

        Java's String.getBytes() does have the following warning:

        • <p> The behavior of this method when this string cannot be encoded in
        • the given charset is unspecified.
        Show
        Yonik Seeley added a comment - Hmmm, I thought this might be a decoding error. Instead it perhaps looks like the JVM not handling UTF8 encoding the same as Jetty. As one can see by this: $ curl --trace - 'http://localhost:8983/solr/select?q=abcdef&a=%ED%AF%80%ED%B1%B8&wt=phps' The ascii portion of the dump looks like: s:4:"......"; So the JVM is telling us that the string will serialize to 4 bytes, and when Jetty actually does the serialization, it comes out to 6. Java's String.getBytes() does have the following warning: <p> The behavior of this method when this string cannot be encoded in the given charset is unspecified.
        Hide
        frank farmer added a comment -

        My concern is not that solr do anything specific with this garbled data, only that wt=phps always returns a string that can be run through unserialize() without error.

        Here's the exact case in which I encountered this bug, which may help explain why I reported this issue in the first place:

        1) Somehow, a user inserted the aforementioned sequence of bytes in some user-editable content in my application.
        2) My code blindly passed that data directly into solr (in retrospect, I should probably be filtering anything that's not valid UTF-8)
        3) Users ran queries which included the affected document
        4) My code tried to unserialize() the output, and failed with a PHP error (simply replacing the offending "s:4:" with "s:6:" caused the output to unserialize without issue, however). This caused my users to be unable to retrieve results for many queries.

        Long story short, if you let users insert arbitrary byte sequences into your index (which I'll admit is naive, but I'm sure I'm not the only one who's done this), and you use wt=phps, a malicious user can effectively cause a DoS.

        Again, I don't care about actually getting these bytes back out of solr unmangled. I only care that the output of wt=phps make it through unserialize() without causing a PHP error.

        Show
        frank farmer added a comment - My concern is not that solr do anything specific with this garbled data, only that wt=phps always returns a string that can be run through unserialize() without error. Here's the exact case in which I encountered this bug, which may help explain why I reported this issue in the first place: 1) Somehow, a user inserted the aforementioned sequence of bytes in some user-editable content in my application. 2) My code blindly passed that data directly into solr (in retrospect, I should probably be filtering anything that's not valid UTF-8) 3) Users ran queries which included the affected document 4) My code tried to unserialize() the output, and failed with a PHP error (simply replacing the offending "s:4:" with "s:6:" caused the output to unserialize without issue, however). This caused my users to be unable to retrieve results for many queries. Long story short, if you let users insert arbitrary byte sequences into your index (which I'll admit is naive, but I'm sure I'm not the only one who's done this), and you use wt=phps, a malicious user can effectively cause a DoS. Again, I don't care about actually getting these bytes back out of solr unmangled. I only care that the output of wt=phps make it through unserialize() without causing a PHP error.
        Hide
        frank farmer added a comment -

        As far as I know, the string is garbage. However, PHP's serialize() operates on bytes, not characters. So any 6 byte string of data (whether those bytes are valid unicode, or just binary gibberish) should always be preceded by "s:6:".

        6 bytes in, 6 bytes out

        Show
        frank farmer added a comment - As far as I know, the string is garbage. However, PHP's serialize() operates on bytes, not characters. So any 6 byte string of data (whether those bytes are valid unicode, or just binary gibberish) should always be preceded by "s:6:". 6 bytes in, 6 bytes out
        Hide
        Yonik Seeley added a comment -

        Looking closer at the byte sequence, this really looks like invalid UTF8 since the 8th bit is set on every byte. The decoder is probably just doing the best that it can with this, but the result isn't going to be what you want in any case.

        Show
        Yonik Seeley added a comment - Looking closer at the byte sequence, this really looks like invalid UTF8 since the 8th bit is set on every byte. The decoder is probably just doing the best that it can with this, but the result isn't going to be what you want in any case.
        Hide
        Yonik Seeley added a comment -

        Thanks Frank, do you know what unicode code points this input corresponds to?
        I'm still not sure if those bytes are valid UTF8 (which would explain why you don't get 6 chars back) - if the input is incorrectly encoded, the output may not be what you expect.

        Show
        Yonik Seeley added a comment - Thanks Frank, do you know what unicode code points this input corresponds to? I'm still not sure if those bytes are valid UTF8 (which would explain why you don't get 6 chars back) - if the input is incorrectly encoded, the output may not be what you expect.
        Hide
        frank farmer added a comment -

        http://localhost:8983/solr/select?q=abcdef&a=%ED%AF%80%ED%B1%B8&wt=phps

        Clearly 6 bytes, but again solr's output claims only 4.

        Show
        frank farmer added a comment - http://localhost:8983/solr/select?q=abcdef&a=%ED%AF%80%ED%B1%B8&wt=phps Clearly 6 bytes, but again solr's output claims only 4.
        Hide
        Yonik Seeley added a comment -

        If someone can give a URL that demonstrates incorrect behavior before 1.4 is released, we can look at it.

        Since parameters are echoed back by default, please provide something of the following form to help us reproduce:
        http://localhost:8983/solr/select?q=abcdef&a=problematic_string&wt=phps

        Show
        Yonik Seeley added a comment - If someone can give a URL that demonstrates incorrect behavior before 1.4 is released, we can look at it. Since parameters are echoed back by default, please provide something of the following form to help us reproduce: http://localhost:8983/solr/select?q=abcdef&a=problematic_string&wt=phps
        Hide
        Yonik Seeley added a comment -

        Does anyone know if there is actually a bug here, and if so, how it should be fixed?

        Show
        Yonik Seeley added a comment - Does anyone know if there is actually a bug here, and if so, how it should be fixed?
        Hide
        Donovan Jimenez added a comment - - edited

        the character sequence specified seems to be windows 1252 - based on just getting it to look like what's in the description in a text editor.

        I see it getting the utf-8 byte length now, but is the output writer guaranteed to be outputting in utf-8? sorry if it's a naive question, I do see the content-type is text/plain; charset=utf-8 - does that change the output writer's encoding?

        if it IS guaranteed, then maybe its how the character data is added - i.e. the add xml was interpreted utf-8 but the data was actually windows-1252 or similar.

        Show
        Donovan Jimenez added a comment - - edited the character sequence specified seems to be windows 1252 - based on just getting it to look like what's in the description in a text editor. I see it getting the utf-8 byte length now, but is the output writer guaranteed to be outputting in utf-8? sorry if it's a naive question, I do see the content-type is text/plain; charset=utf-8 - does that change the output writer's encoding? if it IS guaranteed, then maybe its how the character data is added - i.e. the add xml was interpreted utf-8 but the data was actually windows-1252 or similar.
        Hide
        Yonik Seeley added a comment -

        Is this valid unicode? The serialized PHP writer already calculates the size of the UTF-8 encoded string, so it's difficult to see what's going on.

        Show
        Yonik Seeley added a comment - Is this valid unicode? The serialized PHP writer already calculates the size of the UTF-8 encoded string, so it's difficult to see what's going on.
        Hide
        Donovan Jimenez added a comment -

        I haven't look at the java code, but my guess would be its setting string length as the character length (which is encoding dependent) where serialized php actually uses the byte length so it can be binary safe (independent of encodings used for textual strings).

        Show
        Donovan Jimenez added a comment - I haven't look at the java code, but my guess would be its setting string length as the character length (which is encoding dependent) where serialized php actually uses the byte length so it can be binary safe (independent of encodings used for textual strings).

          People

          • Assignee:
            Yonik Seeley
            Reporter:
            frank farmer
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development