Details

    • Type: Wish
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      We have seen a number of times users failing to get an ensemble up because the snapshot transfer times out. We should be able to do better than this and calibrate initLimit dynamically. I was thinking concretely that we could have servers increasing the initLimit value (e.g., doubling or increments of 1) upon socket timeouts. The tricky part here is that we need both ends of the communication to increase it.

        Activity

        Hide
        fanster.z Jacky007 added a comment -

        In ZAB v1.0:
        FOLLOWERINFO --> Leader
        LEADERINFO <-- Leader
        ACK EPOCH --> Leader
        do send snap/diff/trunc
        UPTODATE <-- Leader

        We can calibrate initLimit by Leader, and tell peers by LEADERINFO.

        Show
        fanster.z Jacky007 added a comment - In ZAB v1.0: FOLLOWERINFO --> Leader LEADERINFO <-- Leader ACK EPOCH --> Leader do send snap/diff/trunc UPTODATE <-- Leader We can calibrate initLimit by Leader, and tell peers by LEADERINFO.
        Hide
        fpj Flavio Junqueira added a comment -

        Sounds like a good way of propagating it. To change the value of initLimit, I was thinking that either we have a backoff mechanism or we use the size of the snapshot and assume some minimum amount of bandwidth. Right now I think I prefer the exponential backoff option. Any thoughts?

        Show
        fpj Flavio Junqueira added a comment - Sounds like a good way of propagating it. To change the value of initLimit, I was thinking that either we have a backoff mechanism or we use the size of the snapshot and assume some minimum amount of bandwidth. Right now I think I prefer the exponential backoff option. Any thoughts?
        Hide
        hdeng Hongchao Deng added a comment -

        Backoff at each timeout, and locally?

        This is my understanding and I am giving a problematic case here:

        Cluster: A, B, C with default initLimit = 1s
        1. A is talking to B and after some time they both have initLimit = 3s
        2. A is talking to C, but now they have different initLimits

        Am I misunderstanding anything here?

        Show
        hdeng Hongchao Deng added a comment - Backoff at each timeout, and locally? This is my understanding and I am giving a problematic case here: Cluster: A, B, C with default initLimit = 1s 1. A is talking to B and after some time they both have initLimit = 3s 2. A is talking to C, but now they have different initLimits Am I misunderstanding anything here?
        Hide
        fanster.z Jacky007 added a comment -

        I don't like complicated mechanism. initLimit + (snapshot+log) * failedPeersNum / bandwidth works in our internal environment.

        Show
        fanster.z Jacky007 added a comment - I don't like complicated mechanism. initLimit + (snapshot+log) * failedPeersNum / bandwidth works in our internal environment.
        Hide
        fpj Flavio Junqueira added a comment -

        Jacky007, do you have a patch you're willing to share? It'd be great to depart from what you have.

        Hongchao Deng, the prospective leader increases the initLimit value as Jacky007 describes in his comment and propagates it. I was trying to propose a scheme that doesn't require estimating the transfer time and that can adjust dynamically.

        Show
        fpj Flavio Junqueira added a comment - Jacky007 , do you have a patch you're willing to share? It'd be great to depart from what you have. Hongchao Deng , the prospective leader increases the initLimit value as Jacky007 describes in his comment and propagates it. I was trying to propose a scheme that doesn't require estimating the transfer time and that can adjust dynamically.

          People

          • Assignee:
            Unassigned
            Reporter:
            fpj Flavio Junqueira
          • Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:

              Development