CouchDB
  1. CouchDB
  2. COUCHDB-1176

CouchDB accepts data which it cannot replicate (invalid UTF-8 json during replication)

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Critical Critical
    • Resolution: Unresolved
    • Affects Version/s: 1.0.1, 1.0.2
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      CentOS 5.5 64bit

      Description

      CouchDB appears to treat some unicode characters as illegal when parsing escaped unicode values (\uXXXX) during insert or update of a document. These characters can however be inserted to the database by using the UTF-8 encoding instead of escaping. An example value would be an unicode value 0xFFFE which is escaped \uFFFE and as UTF-8 is represented by consecutive bytes with values 0xEF 0xBF and 0xBE.

      Even though the documents are inserted using UTF-8 encoding without errors, couchdb always serves them in the escaped form. This leads us to the actual problem we currently have. If documents containing such unaccepted characters are inserted to couchdb by using UTF-8 encoding, attempt to replicate the database will abort to first of those documents giving an error like this:
      {"error":"json_encode","reason":"{bad_term,{nocatch,{invalid_json,<<\"[{\\\"ok\\\":{\\\"_id\\\":\\\"192058c4f81afc66c5bf883548004331\\\",\\\"_rev\\\":\\\"1-ad1c9dcee520d12abdf948d91e31cf15\\\",\\\"abc\\\":\\\"\\\\ufffe\\\",\\\"_revisions\\\":

      {\\\"start\\\":1,\\\"ids\\\":[\\\"ad1c9dcee520d12abdf948d91e31cf15\\\"]}

      }}]\\n\">>}}}"}

      Here are steps to reproduce:

      curl -X PUT http://localhost:5984/replicationtest_source
      curl -X PUT http://localhost:5984/replicationtest_target

      1. Should fail
        curl -H "Content-Type:application/json" -X POST -d @fffe_escaped.json http://localhost:5984/replicationtest_source
      2. Should succeed
        curl -H "Content-Type:application/json" -X POST -d @fffe_utf8.json http://localhost:5984/replicationtest_source
      3. Should fail to json_encode error related to the previously inserted document
        curl -H "Content-Type:application/json" -X POST -d " {\"source\":\"http://localhost:5984/replicationtest_source\",\"target\":\"replicationtest_target\"}

        " http://localhost:5984/_replicate

      If anyone has a quick fix for this (how to accept "invalid" escaped unicode characters at least during replication), we would be more than happy to test it.

      1. fffe_utf8.json
        0.0 kB
        Jaakko Sipari
      2. fffe_escaped.json
        0.0 kB
        Jaakko Sipari
      3. COUCHDB-1176.patch
        2 kB
        Paul Joseph Davis

        Activity

        Hide
        Nuutti Kotivuori added a comment -

        Looks like that fixes the issue, but I wonder about its performance. There probably is a reason for the fast path decoding of UTF-8 strings.

        For actually fast decoding of UTF-8, you can see the excellent source code by Bjoern Hoehrmann here: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/

        It is in C, but adapting the same approach (DFA) to Erlang is likely to produce really good results, if speed matters here.

        Show
        Nuutti Kotivuori added a comment - Looks like that fixes the issue, but I wonder about its performance. There probably is a reason for the fast path decoding of UTF-8 strings. For actually fast decoding of UTF-8, you can see the excellent source code by Bjoern Hoehrmann here: http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ It is in C, but adapting the same approach (DFA) to Erlang is likely to produce really good results, if speed matters here.
        Hide
        Paul Joseph Davis added a comment -

        I should've noted that amusingly xmerl_ucs:from_utf8([239, 191, 190]) returns [65534] without error. Hence the need to wrap the return in to_utf8 to check roundtrip passage.

        Show
        Paul Joseph Davis added a comment - I should've noted that amusingly xmerl_ucs:from_utf8( [239, 191, 190] ) returns [65534] without error. Hence the need to wrap the return in to_utf8 to check roundtrip passage.
        Hide
        Paul Joseph Davis added a comment -

        This should fix it.

        Show
        Paul Joseph Davis added a comment - This should fix it.
        Hide
        Nuutti Kotivuori added a comment -

        The bug is in mochijson2.erl, where tokenize_string_fast (which is hand-written) allows for invalid UTF-8, where as tokenize_string uses xmerl_ucs:to_utf8 to convert escapes to utf-8. This is directly from the documentation of xmerl:

        %%% UTF-8 support
        %%% Possible errors encoding UTF-8:
        %%% - Non-character values (something other than 0 .. 2^31-1).
        %%% - Surrogate pair code in string.
        %%% - 16#FFFE or 16#FFFF character in string.

        Either the same values should be rejected by tokenize_string_fast, or both places should accept the values.

        Show
        Nuutti Kotivuori added a comment - The bug is in mochijson2.erl, where tokenize_string_fast (which is hand-written) allows for invalid UTF-8, where as tokenize_string uses xmerl_ucs:to_utf8 to convert escapes to utf-8. This is directly from the documentation of xmerl: %%% UTF-8 support %%% Possible errors encoding UTF-8: %%% - Non-character values (something other than 0 .. 2^31-1). %%% - Surrogate pair code in string. %%% - 16#FFFE or 16#FFFF character in string. Either the same values should be rejected by tokenize_string_fast, or both places should accept the values.
        Hide
        Pasi Eronen added a comment -

        Tested also with branches/1.0.x and branches/1.1.x (as of today), with same result.

        Show
        Pasi Eronen added a comment - Tested also with branches/1.0.x and branches/1.1.x (as of today), with same result.
        Hide
        Jaakko Sipari added a comment -

        Here are the files to be used with the curl commands.

        Show
        Jaakko Sipari added a comment - Here are the files to be used with the curl commands.

          People

          • Assignee:
            Unassigned
            Reporter:
            Jaakko Sipari
          • Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development