[COUCHDB-2626] Explain N last replication failures - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: None
Fix Version/s: None
Component/s: Replication
Labels:
None

Description

It becomes a quite popular question: "I run a replication for over 9K documents, but in the end replication says that there was N document writes failures. How can I get those documents ids and the reason why?". The common answer is to parse the logs for the errors. Not cool.

The idea is to include into replication stats list of document ids which were failed to store and the reason why. This could looks like:

{
    "history": [
        {
            "doc_write_failures": 2,
            "doc_write_failures_explained": {
               "foo": {
                   "1-abc": {
                      "error": "forbidden",
                      "reason": "bad field bar"
                    },
                   "1-cde": {
                      "error": "forbidden",
                      "reason": "bad field baz"
                    }
                }
             },
            "docs_read": 10,
            "docs_written": 10,
            "end_last_seq": 28,
            "end_time": "Sun, 11 Aug 2013 20:38:50 GMT",
            "missing_checked": 10,
            "missing_found": 10,
            "recorded_seq": 28,
            "session_id": "142a35854a08e205c47174d91b1f9628",
            "start_last_seq": 1,
            "start_time": "Sun, 11 Aug 2013 20:38:50 GMT"
        }
    ],
    "ok": true,
    "replication_id_version": 3,
    "session_id": "142a35854a08e205c47174d91b1f9628",
    "source_last_seq": 28
}

E.g. just add a mapping with document ids which is a mapping of revisions to the error info.

However, we shouldn't collect all the failure explanations - you may easily have thousands failures because of bug in validate_doc_update function and in this case checkpoint documents will cause too heavy footprint. To avoid this, number of stored explanations could be limited by some configurable number, like 50, and actually keep only these N last failures. This usually enough to understand the source of problem, fix it and rerun replication.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: Alexander Shorin

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 25/Feb/15 23:09

Updated:: 02/Sep/18 07:02

Resolved:: 02/Sep/18 07:02