Uploaded image for project: 'CouchDB'
  1. CouchDB
  2. COUCHDB-2626

Explain N last replication failures

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • None
    • None
    • Replication
    • None

    Description

      It becomes a quite popular question: "I run a replication for over 9K documents, but in the end replication says that there was N document writes failures. How can I get those documents ids and the reason why?". The common answer is to parse the logs for the errors. Not cool.

      The idea is to include into replication stats list of document ids which were failed to store and the reason why. This could looks like:

      {
          "history": [
              {
                  "doc_write_failures": 2,
                  "doc_write_failures_explained": {
                     "foo": {
                         "1-abc": {
                            "error": "forbidden",
                            "reason": "bad field bar"
                          },
                         "1-cde": {
                            "error": "forbidden",
                            "reason": "bad field baz"
                          }
                      }
                   },
                  "docs_read": 10,
                  "docs_written": 10,
                  "end_last_seq": 28,
                  "end_time": "Sun, 11 Aug 2013 20:38:50 GMT",
                  "missing_checked": 10,
                  "missing_found": 10,
                  "recorded_seq": 28,
                  "session_id": "142a35854a08e205c47174d91b1f9628",
                  "start_last_seq": 1,
                  "start_time": "Sun, 11 Aug 2013 20:38:50 GMT"
              }
          ],
          "ok": true,
          "replication_id_version": 3,
          "session_id": "142a35854a08e205c47174d91b1f9628",
          "source_last_seq": 28
      }
      

      E.g. just add a mapping with document ids which is a mapping of revisions to the error info.

      However, we shouldn't collect all the failure explanations - you may easily have thousands failures because of bug in validate_doc_update function and in this case checkpoint documents will cause too heavy footprint. To avoid this, number of stored explanations could be limited by some configurable number, like 50, and actually keep only these N last failures. This usually enough to understand the source of problem, fix it and rerun replication.

      Attachments

        Activity

          People

            Unassigned Unassigned
            kxepal Alexander Shorin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: