Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: QuorumJournalManager (HDFS-3077)
    • Component/s: None
    • Labels:
      None

      Description

      This is a potential optimization that we can add to the JournalNode: when one of the nodes is lagging behind the others (eg because its local disk is slower or there was a network blip), it receives edits after they've been committed to a majority. It can tell this because the committed txid included in the request info is higher than the highest txid in the actual batch to be written. In this case, we know that this batch has already been fsynced to a quorum of nodes, so we can skip the fsync() on the laggy node, helping it to catch back up.

      1. hdfs-3885.txt
        11 kB
        Todd Lipcon

        Activity

        Hide
        stepinto Chao Shi added a comment -

        A similar one to save network latency: sync logs to lagging node in a larger batch. I guess a batch of 512K or 1MB should be much efficient.

        Note that this can also work for uncommitted transactions. Imagine this with 3 JNs:
        Tx1 is committed by JN1 and JN2. QJM is writing Tx2. JN3 is lagging. So we have tx1 and tx2 in its queue. We can send them to JN3 in a batch.

        To implement the above idea, it needs more changes to current code structure, which simply uses a single threaded executor as the queue.

        Show
        stepinto Chao Shi added a comment - A similar one to save network latency: sync logs to lagging node in a larger batch. I guess a batch of 512K or 1MB should be much efficient. Note that this can also work for uncommitted transactions. Imagine this with 3 JNs: Tx1 is committed by JN1 and JN2. QJM is writing Tx2. JN3 is lagging. So we have tx1 and tx2 in its queue. We can send them to JN3 in a batch. To implement the above idea, it needs more changes to current code structure, which simply uses a single threaded executor as the queue.
        Hide
        tlipcon Todd Lipcon added a comment -

        Another good idea, Chao.

        I'm going to quickly implement the simpler idea described in the "description" field of this JIRA, since it's trivial to do so. We can then open another JIRA to actually aggregate larger batches "in the queue" to avoid the round trip times.

        Show
        tlipcon Todd Lipcon added a comment - Another good idea, Chao. I'm going to quickly implement the simpler idea described in the "description" field of this JIRA, since it's trivial to do so. We can then open another JIRA to actually aggregate larger batches "in the queue" to avoid the round trip times.
        Hide
        tlipcon Todd Lipcon added a comment -

        It wasn't easy to figure out how to write a unit test for this change, but I verified as follows:

        • Started a 3-node QJM cluster
        • strace -efdatasync,write -f <pid of one JN>
        • write lots of txns to the NN. This shows a lot of fdatasync and write calls, mostly alternating (write a chunk, fsync, write a chunk, fsync, etc)
        • kill -STOPped that JN for 10-15 seconds
        • kill -CONT that JN
        • saw a bunch of write() with no fdatasync calls while it was still catching up. After it caught up, it started syncing again.

        I also verified that it caught up much faster with this change in place.

        Show
        tlipcon Todd Lipcon added a comment - It wasn't easy to figure out how to write a unit test for this change, but I verified as follows: Started a 3-node QJM cluster strace -efdatasync,write -f <pid of one JN> write lots of txns to the NN. This shows a lot of fdatasync and write calls, mostly alternating (write a chunk, fsync, write a chunk, fsync, etc) kill -STOPped that JN for 10-15 seconds kill -CONT that JN saw a bunch of write() with no fdatasync calls while it was still catching up. After it caught up, it started syncing again. I also verified that it caught up much faster with this change in place.
        Hide
        atm Aaron T. Myers added a comment -

        +1, patch looks great, and the testing you did sounds good.

        Show
        atm Aaron T. Myers added a comment - +1, patch looks great, and the testing you did sounds good.
        Hide
        tlipcon Todd Lipcon added a comment -

        Committed to branch, thanks for the review.

        Show
        tlipcon Todd Lipcon added a comment - Committed to branch, thanks for the review.

          People

          • Assignee:
            tlipcon Todd Lipcon
            Reporter:
            tlipcon Todd Lipcon
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development