Uploaded image for project: 'Hadoop HDFS'
  1. Hadoop HDFS
  2. HDFS-14603 Über-JIRA: HDFS RBF stabilization phase II
  3. HDFS-15885

RBF: Data loss when Router setup connection timeout

    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • None
    • None
    • rbf
    • None

    Description

      I have met one corner case which could loss data recently, it is very similar to HDFS-15079.
      Considering the following case:
      A. Client send `create` RPC request to Router A at first, then Router A try to setup new connection for this RPC request to NameNode but not setup connection successfully in time.
      B. Client failover to Router B because request timeout (60s at default IIRC).
      C. Router B run normally (include RPC `create` and `complete`) and return to Client.
      D. After a while (more than 10min), Router A is back working and send `create` to NameNode again, then this file is overwrite and data loss.
      I have to state, we have replaced the ClientId and CallId of RPC with Client's id at Router side rather that generated by Router in my deployment.
      After deep dig, we found that setup connection will cost very long time when meet some network issues. At the worst case, it will take (60 * 3 + 45 * 20) * 5 seconds (far greater than 10min - RetryCache expiry time) for setup connections which is related with `maxRetriesOnSocketTimeouts`,
      `connectionTimeout`, `maxRetriesOnSasl` and `rpcTimeout`. In this case, it will not covered by `RetryCache` (10min by default) at NameNode side.
      IMO, we should to offer the basic configuration suggestion for Router (especially for RPC layer) to avoid Data Loss case again.

      Attachments

        Activity

          People

            Unassigned Unassigned
            hexiaoqiao Xiaoqiao He
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated: