Details

    • Type: Bug Bug
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 1.1.1
    • Fix Version/s: 1.2, 1.2.1
    • Component/s: Replication
    • Labels:
    • Environment:

      Description

      It seems our replication has stopped, reporting an error
      [emulator] Error in process <0.21599.306> nocatch,{invalid_json,<<0 bytes>>,[

      {couch_util,json_decode,1}

      ,

      {couch_rep_reader,'-open_doc_revs/3-lc$^1/1-1-',1},{couch_rep_reader,'-open_doc_revs/3-lc$^1/1-1-',1}

      ,

      {couch_rep_reader,open_doc_revs,3}

      ,{couch_rep_reader,'spawn_document_request/4-fun-0'...

      It was all working until we upgraded some other couches in our replication "web" from couch 1.0.3 to couch 1.1.1. We then set of database and view compactions, and sometime overnight some of the replication links stopped.

      I have curled the command myself, both as a multipart message and a single json response (with header "Accept:application/json" ) and it can be parsed correctly by Python simplejson - I have attached it here aswell - called "troublecurl-redacted.txt" - though it is 18.8mb. The request takes about 6 seconds.

      I don't quite understand why it is reported as invalid JSON? Other reports similar to this that I googled mentioned blank document ids, but I can't see any of these.

      1. old_rep_qs_encoding.patch
        2 kB
        Filipe Manana
      2. 9c94ed0e23508f4ec3d18f8949c06a5b replicaton from wireshark cut.txt
        324 kB
        Alex Markham
      3. replication error wireshark.txt
        8 kB
        Alex Markham
      4. source couch error.log
        7 kB
        Alex Markham
      5. target couch error.log
        23 kB
        Alex Markham

        Activity

        Hide
        Filipe Manana added a comment -

        This seems similar to COUCHDB-1327.

        Can you try replicating after setting "http_pipeline_size" to 1 (in the .ini config, or via Futon, or REST api)?

        Show
        Filipe Manana added a comment - This seems similar to COUCHDB-1327 . Can you try replicating after setting "http_pipeline_size" to 1 (in the .ini config, or via Futon, or REST api)?
        Hide
        Robert Newson added a comment -

        It seems an attempt is being made to decode an empty erlang binary as JSON, which correctly fails. So the real question is why is that happening?

        Show
        Robert Newson added a comment - It seems an attempt is being made to decode an empty erlang binary as JSON, which correctly fails. So the real question is why is that happening?
        Hide
        Alex Markham added a comment -

        Still happens when [replicator] max_http_pipeline_size=1 set on both source and target of replication (both couchdb 1.1.1)

        Show
        Alex Markham added a comment - Still happens when [replicator] max_http_pipeline_size=1 set on both source and target of replication (both couchdb 1.1.1)
        Hide
        Filipe Manana added a comment -

        Alex, the only cases I've seen this happening are due to the pipeline or documents in the source with an empty ID (or consisting only of white spaces). Or perhaps, there's some firewall in the middle interfering.
        Also, for a quick test, do you think you can try pull replicating with the 1.2.x branch? (it's not an official release, only for testing) That branch has a new replicator.

        thanks

        Show
        Filipe Manana added a comment - Alex, the only cases I've seen this happening are due to the pipeline or documents in the source with an empty ID (or consisting only of white spaces). Or perhaps, there's some firewall in the middle interfering. Also, for a quick test, do you think you can try pull replicating with the 1.2.x branch? (it's not an official release, only for testing) That branch has a new replicator. thanks
        Hide
        Alex Markham added a comment - - edited

        I did a tcp dump while the crash happened and at the time of the stack trace, this tcp call (attached replication error wireshark.txt) is the only one at that time, it seems mochiweb responds with this:

        HTTP/1.1 400 Bad Request
        Server: MochiWeb/1.0 (Any of you quaids got a smint?)
        Date: Wed, 16 Nov 2011 11:20:27 GMT
        Content-Length: 0

        Then there is a TCP reset

        But I can't see any 400 responses listed in the log of the source couch

        Edit: When I curl the request - as seen in "replication error wireshark.txt" I get a 400 error, until I reduce the size of the query by 5 characters (by shortening one of the open_revs ids) and then it succeeds - is there some kind of request size limit being hit in Mochiweb, before it gets to couchdb proper?

        Show
        Alex Markham added a comment - - edited I did a tcp dump while the crash happened and at the time of the stack trace, this tcp call (attached replication error wireshark.txt) is the only one at that time, it seems mochiweb responds with this: HTTP/1.1 400 Bad Request Server: MochiWeb/1.0 (Any of you quaids got a smint?) Date: Wed, 16 Nov 2011 11:20:27 GMT Content-Length: 0 Then there is a TCP reset But I can't see any 400 responses listed in the log of the source couch Edit: When I curl the request - as seen in "replication error wireshark.txt" I get a 400 error, until I reduce the size of the query by 5 characters (by shortening one of the open_revs ids) and then it succeeds - is there some kind of request size limit being hit in Mochiweb, before it gets to couchdb proper?
        Hide
        Alex Markham added a comment -

        Also attached "9c94ed0e23508f4ec3d18f8949c06a5b replicaton from wireshark cut.txt" which looks like the last replication transfer before the mochiweb 400 error. It is a large 19mb request, but I've snipped out the middle bit.

        This request is ~3 seconds before the stack trace. Could the 2x blank lines at the bottom, after the json close, but before the 0 be the zero bytes to look for?

        1
        ]
        1

        0

        Show
        Alex Markham added a comment - Also attached "9c94ed0e23508f4ec3d18f8949c06a5b replicaton from wireshark cut.txt" which looks like the last replication transfer before the mochiweb 400 error. It is a large 19mb request, but I've snipped out the middle bit. This request is ~3 seconds before the stack trace. Could the 2x blank lines at the bottom, after the json close, but before the 0 be the zero bytes to look for? 1 ] 1 0
        Hide
        Filipe Manana added a comment -

        Alex good observation.
        The last 1 is a chunk which contains only a \n character. This is intentionally added by couch for all json responses:

        https://github.com/apache/couchdb/blob/1.1.x/src/couchdb/couch_httpd.erl#L654

        I'm unsure this would be the cause for your problem (chunks are separated by \r\n). If you're able to build from source, you can try commenting the line which sends the chunk containing only the \n. This change would be on the source database server.

        Show
        Filipe Manana added a comment - Alex good observation. The last 1 is a chunk which contains only a \n character. This is intentionally added by couch for all json responses: https://github.com/apache/couchdb/blob/1.1.x/src/couchdb/couch_httpd.erl#L654 I'm unsure this would be the cause for your problem (chunks are separated by \r\n). If you're able to build from source, you can try commenting the line which sends the chunk containing only the \n. This change would be on the source database server.
        Hide
        Alex Markham added a comment -

        I've just seen that the replication request which mochiweb responds with the "400 Bad request" error is 8197 bytes. When I remove 5 chars (to make it exactly 8kb then it works.
        Is there a setting for the max mochiweb request size? Or otherwise a limit to stop the couch replicator sending a request larger than 8192 bytes?

        Show
        Alex Markham added a comment - I've just seen that the replication request which mochiweb responds with the "400 Bad request" error is 8197 bytes. When I remove 5 chars (to make it exactly 8kb then it works. Is there a setting for the max mochiweb request size? Or otherwise a limit to stop the couch replicator sending a request larger than 8192 bytes?
        Hide
        Filipe Manana added a comment -

        Alex, that 8197 byte length is for what exactly?
        I selected the whole HTTP request region in Emacs (starting from GET up to the blank line after Content-Length: 0) and Emacs tells me that there are 7124 characters in the request (including the CRLFs separating headers).

        You're issue however happened before in the past. Can you try the following patch:

        Index: src/couchdb/couch_rep_reader.erl
        ===================================================================
        — src/couchdb/couch_rep_reader.erl (revision 1177549)
        +++ src/couchdb/couch_rep_reader.erl (working copy)
        @@ -177,7 +177,7 @@
        hd(State#state.opened_seqs).

        split_revlist(Rev,

        {[CurrentAcc|Rest], BaseLength, Length}

        ) ->

        • case Length+size(Rev)+3 > 8192 of
          + case Length+size(Rev)+3 > 8000 of
          false ->
          {[[Rev|CurrentAcc] | Rest], BaseLength, Length+size(Rev)+3}

          ;
          true ->

        Pasted at http://friendpaste.com/3DVTw8HgpvcUsOECEO1eoI as well.

        Thanks for debugging this

        Show
        Filipe Manana added a comment - Alex, that 8197 byte length is for what exactly? I selected the whole HTTP request region in Emacs (starting from GET up to the blank line after Content-Length: 0) and Emacs tells me that there are 7124 characters in the request (including the CRLFs separating headers). You're issue however happened before in the past. Can you try the following patch: Index: src/couchdb/couch_rep_reader.erl =================================================================== — src/couchdb/couch_rep_reader.erl (revision 1177549) +++ src/couchdb/couch_rep_reader.erl (working copy) @@ -177,7 +177,7 @@ hd(State#state.opened_seqs). split_revlist(Rev, {[CurrentAcc|Rest], BaseLength, Length} ) -> case Length+size(Rev)+3 > 8192 of + case Length+size(Rev)+3 > 8000 of false -> {[[Rev|CurrentAcc] | Rest], BaseLength, Length+size(Rev)+3} ; true -> Pasted at http://friendpaste.com/3DVTw8HgpvcUsOECEO1eoI as well. Thanks for debugging this
        Hide
        Alex Markham added a comment -

        the 8197 bytes included everything in the curl, so the 'hostname:5984/database/doc?args'
        I'll try this patch, thanks for the help!

        Show
        Alex Markham added a comment - the 8197 bytes included everything in the curl, so the 'hostname:5984/database/doc?args' I'll try this patch, thanks for the help!
        Hide
        Alex Markham added a comment -

        Hi Felipe, that patch fixed it and now replication works on all our nodes again. Our couchdb 1.1.1 is now patched for this issue and COUCHDB-1333 and seems to be working smoothly.
        Thanks for the help!

        Show
        Alex Markham added a comment - Hi Felipe, that patch fixed it and now replication works on all our nodes again. Our couchdb 1.1.1 is now patched for this issue and COUCHDB-1333 and seems to be working smoothly. Thanks for the help!
        Hide
        Filipe Manana added a comment -

        Alex, that's great news
        I found that the problem is mochiweb (our http server) doesn't allow request lines to exceed 8192 characters. We were accounting for this, but ignoring the extra characters account by "GET " and the trailing " HTTP/1.1\r\n" in the first request line. So my final patch is now this:

        http://friendpaste.com/1TWqKb1Ac2hmKYh7VgNMLO

        Index: src/couchdb/couch_rep_reader.erl
        ===================================================================
        — src/couchdb/couch_rep_reader.erl (revision 1177549)
        +++ src/couchdb/couch_rep_reader.erl (working copy)
        @@ -177,7 +177,7 @@
        hd(State#state.opened_seqs).

        split_revlist(Rev,

        {[CurrentAcc|Rest], BaseLength, Length}

        ) ->

        • case Length+size(Rev)+3 > 8192 of
          + case Length+size(Rev)+3 >= 8192 of
          false ->
          {[[Rev|CurrentAcc] | Rest], BaseLength, Length+size(Rev)+3}

          ;
          true ->
          @@ -214,7 +214,9 @@
          %% MochiWeb into multiple requests
          BaseQS = [

          {revs,true}

          ,

          {latest,true}

          ,

          {att_encoding_info,true}

          ],
          BaseReq = DbS#http_db

          {resource=encode_doc_id(DocId), qs=BaseQS}

          ,

        • BaseLength = length(couch_rep_httpc:full_url(BaseReq) ++ "&open_revs=[]"),
          + BaseLength = length(
          + "GET " ++ couch_rep_httpc:full_url(BaseReq) ++
          + "&open_revs=[]" ++ " HTTP/1.1\r\n"),

        {RevLists, _, _}

        = lists:foldl(fun split_revlist/2,

        {[[]], BaseLength, BaseLength}

        , couch_doc:revs_to_strs(Revs)),

        Thanks for testing and helping debug it.

        Show
        Filipe Manana added a comment - Alex, that's great news I found that the problem is mochiweb (our http server) doesn't allow request lines to exceed 8192 characters. We were accounting for this, but ignoring the extra characters account by "GET " and the trailing " HTTP/1.1\r\n" in the first request line. So my final patch is now this: http://friendpaste.com/1TWqKb1Ac2hmKYh7VgNMLO Index: src/couchdb/couch_rep_reader.erl =================================================================== — src/couchdb/couch_rep_reader.erl (revision 1177549) +++ src/couchdb/couch_rep_reader.erl (working copy) @@ -177,7 +177,7 @@ hd(State#state.opened_seqs). split_revlist(Rev, {[CurrentAcc|Rest], BaseLength, Length} ) -> case Length+size(Rev)+3 > 8192 of + case Length+size(Rev)+3 >= 8192 of false -> {[[Rev|CurrentAcc] | Rest], BaseLength, Length+size(Rev)+3} ; true -> @@ -214,7 +214,9 @@ %% MochiWeb into multiple requests BaseQS = [ {revs,true} , {latest,true} , {att_encoding_info,true} ], BaseReq = DbS#http_db {resource=encode_doc_id(DocId), qs=BaseQS} , BaseLength = length(couch_rep_httpc:full_url(BaseReq) ++ "&open_revs=[]"), + BaseLength = length( + "GET " ++ couch_rep_httpc:full_url(BaseReq) ++ + "&open_revs=[]" ++ " HTTP/1.1\r\n"), {RevLists, _, _} = lists:foldl(fun split_revlist/2, {[[]], BaseLength, BaseLength} , couch_doc:revs_to_strs(Revs)), Thanks for testing and helping debug it.
        Hide
        Filipe Manana added a comment -

        Fix applied to master, 1.2.x and 1.1.x

        Show
        Filipe Manana added a comment - Fix applied to master, 1.2.x and 1.1.x
        Hide
        Alex Markham added a comment -

        That 2nd patch doesn't work for me, I get the invalid json error again - I think as it doesn't include the other extra args after open_revs, such as '&revs=true&latest=true&att_encoding_info=true'

        Show
        Alex Markham added a comment - That 2nd patch doesn't work for me, I get the invalid json error again - I think as it doesn't include the other extra args after open_revs, such as '&revs=true&latest=true&att_encoding_info=true'
        Hide
        Robert Newson added a comment -

        actually it does (they're baked in by couch_rep_httpc:full_url/1), I think the mistake is later when accounting for ?open_revs=[].

        Show
        Robert Newson added a comment - actually it does (they're baked in by couch_rep_httpc:full_url/1), I think the mistake is later when accounting for ?open_revs=[].
        Hide
        Filipe Manana added a comment -

        Although not visible in your capture log, a possible mistake here is that the code is incrementing by +1 for the following characters: "[", "]", ",", "\"".
        Some of them must be percent encoded, others like double quote normally are (http://en.wikipedia.org/wiki/Percent-encoding).
        We should increment each by +3 and not +1.

        Alex, new patch (on top of the previous one):

        http://friendpaste.com/3cPf7tFlNOeTZedlmgfDu2

        Show
        Filipe Manana added a comment - Although not visible in your capture log, a possible mistake here is that the code is incrementing by +1 for the following characters: " [", "] ", ",", "\"". Some of them must be percent encoded, others like double quote normally are ( http://en.wikipedia.org/wiki/Percent-encoding ). We should increment each by +3 and not +1. Alex, new patch (on top of the previous one): http://friendpaste.com/3cPf7tFlNOeTZedlmgfDu2
        Hide
        Filipe Manana added a comment -

        One thing I noticed, is that couch_rep_httpc:full_url/1 is not percent encoding values in query string.
        It should, specially for those which are JSON like ?open_revs.

        RFC 3986 says so:
        http://tools.ietf.org/html/rfc3986#section-2.2

        This won't fix Alex's problem however. But proxies and other middleware might have trouble with this, it's better to follow the spec. Patch attached to address this.

        For Alex's problem, perhaps it's just better to lower the limit from 8192 to something significantly lower (6000, 7000).

        Show
        Filipe Manana added a comment - One thing I noticed, is that couch_rep_httpc:full_url/1 is not percent encoding values in query string. It should, specially for those which are JSON like ?open_revs. RFC 3986 says so: http://tools.ietf.org/html/rfc3986#section-2.2 This won't fix Alex's problem however. But proxies and other middleware might have trouble with this, it's better to follow the spec. Patch attached to address this. For Alex's problem, perhaps it's just better to lower the limit from 8192 to something significantly lower (6000, 7000).
        Hide
        Adam Kocoloski added a comment -

        +1 on lowering the limit to something significantly below 8k. This ticket is at least the third time we've had to address a replication bug related to too-long URLs for docs with lots of revisions. The performance penalty for making multiple requests unnecessarily is insignificant.

        Show
        Adam Kocoloski added a comment - +1 on lowering the limit to something significantly below 8k. This ticket is at least the third time we've had to address a replication bug related to too-long URLs for docs with lots of revisions. The performance penalty for making multiple requests unnecessarily is insignificant.
        Hide
        Filipe Manana added a comment -

        Reduced the limit to 7000 on all active branches and applied the percent encoding of query string values on master and 1.2.x only.

        Show
        Filipe Manana added a comment - Reduced the limit to 7000 on all active branches and applied the percent encoding of query string values on master and 1.2.x only.

          People

          • Assignee:
            Unassigned
            Reporter:
            Alex Markham
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development