Hadoop Common
  1. Hadoop Common
  2. HADOOP-1306

DFS Scalability: Reduce the number of getAdditionalBlock RPCs on the namenode

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Duplicate
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      One of the most-frequently-invoked RPCs in the namenode is the addBlock() RPC. The DFSClient uses this RPC to allocate one more block for a file that it is currently operating upon. The scalability of the namenode will improve if we can decrease the number of addBlock() RPCs. One idea that we want to discuss here is to make addBlock() return more than one block. This proposal came out of a discussion I had with Ben Reed.

      Let's say that addBlock() returns n blocks for the file. The namenode already tracks these blocks using the pendingCreates data structure. The client guarantees that these n blocks will be used in order. The client also guarantees that if it cannot use a block (dues to whatever reason), it will inform the namenode using the abandonBlock() RPC. These RPCs are already supported.

      Another possible optimization : since the namenode has to allocate n blocks for a file, should it use the same set of datanodes for this set of blocks? My proposal is that if n is a small number (e.g. 3), it is prudent to allocate the same set of datanodes to host all replicas for this set of blocks. This will reduce the CPU spent in chooseTargets().

      1. fineGrainLocks3.patch
        42 kB
        dhruba borthakur

        Activity

        Hide
        dhruba borthakur added a comment -

        The slowness of getAdditionalBlock RPC has been addressed by HADOOP-1269, HADOOP-1187, HADOOP-1149 AND HADOOP-1073

        Show
        dhruba borthakur added a comment - The slowness of getAdditionalBlock RPC has been addressed by HADOOP-1269 , HADOOP-1187 , HADOOP-1149 AND HADOOP-1073
        Hide
        Hairong Kuang added a comment -

        Just did a quick look at the patch. Should we make all methods of NetworkTopology not synchronized after a read/write lock is introduced?

        Show
        Hairong Kuang added a comment - Just did a quick look at the patch. Should we make all methods of NetworkTopology not synchronized after a read/write lock is introduced?
        Hide
        Konstantin Shvachko added a comment -
        1. Indentation in BlocksMap is wrong in most cases.
        2. noClone is not used in the constructor. And the constructor itself is not used anywhere.
          NodeIterator(DatanodeDescriptor[] nodes, boolean noClone) {
            arr = nodes;
          }
        3. I do not understand how BlocksMap.nodeIterator() work. The entries are not protected by
          the lock. Even though you clone them what happens if the original array is modified by
          another thread?
        4. I think removal of version consistency on the DataNode was not intended for this patch.
        5. Three unused imports in DatanodeDescriptor (only one of which is new though).
        6. ConcurrentSkipListMap is a Java 6 class. Have we officially switched to Java 6?
        7. I praise introduction of PendingCreates that merges everything pending into one class.
        8. An example of things I do not understand from PendingCreates.java
          boolean removeBlock(UTF8 file, Block b) { }}
            FileUnderConstruction v =  pendingCreates.get(file);
            if (v != null) {
              synchronized (v) {
              ......removing block here........
              }
            }
          }

          So what happens if another thread completes or deletes pending file v after obtaining v but before entering synchronized section on v?
          Even if it works fine in this case I don't think this is the right direction, because supporting that kind of tricks will be hard.

        Show
        Konstantin Shvachko added a comment - Indentation in BlocksMap is wrong in most cases. noClone is not used in the constructor. And the constructor itself is not used anywhere. NodeIterator(DatanodeDescriptor[] nodes, boolean noClone) { arr = nodes; } I do not understand how BlocksMap.nodeIterator() work. The entries are not protected by the lock. Even though you clone them what happens if the original array is modified by another thread? I think removal of version consistency on the DataNode was not intended for this patch. Three unused imports in DatanodeDescriptor (only one of which is new though). ConcurrentSkipListMap is a Java 6 class. Have we officially switched to Java 6? I praise introduction of PendingCreates that merges everything pending into one class. An example of things I do not understand from PendingCreates.java boolean removeBlock(UTF8 file, Block b) { }} FileUnderConstruction v = pendingCreates.get(file); if (v != null ) { synchronized (v) { ......removing block here........ } } } So what happens if another thread completes or deletes pending file v after obtaining v but before entering synchronized section on v? Even if it works fine in this case I don't think this is the right direction, because supporting that kind of tricks will be hard.
        Hide
        dhruba borthakur added a comment -

        Merged patch with latest trunk.

        Show
        dhruba borthakur added a comment - Merged patch with latest trunk.
        Hide
        dhruba borthakur added a comment -

        Here is a first version of changing the locking so that getAdditionalBlock and addStoredBlock occur without any global locks. I have seen that randownwriter and dfsio that used to fail on a 1000 node cluster now runs successfully with this patch.

        1. NetworkTopology has reader/writer locks. This map hardly changes but is used very frequently. Now, multiple open() calls can proceed in parallel.

        2. The pending blocks and pending files are put into a new class called pendingCreates.java. This helps locking them together.

        3. The BlocksMap is protected by a reader/writer lock.

        4. In the common case (when the file is still in pendingCreates), addStordBlock() does not acquire the global fsnamesystem lock.

        5. The datanodeMap was already using a lock object associated with it to protect modifications to it. Make sure that this check is done in all places where the datanodeMap is modified.

        6. The Host2NodesMap has its own read/write lock. This will be merged in with the datanodeMap when we go to a much finer locking model in future.

        This patch is for code review purposes only. Some additional locking is needed for processReport (still to be done) but I would like some comments on the changes I have made.

        I would have liked a more-finer grain locking model that allows all filesystem-methods to be highly-concurrent. But that approach was deemed too-complex for the short term. I am putting out this patch to get feedback on whether this medium-term approach is acceptable.

        Show
        dhruba borthakur added a comment - Here is a first version of changing the locking so that getAdditionalBlock and addStoredBlock occur without any global locks. I have seen that randownwriter and dfsio that used to fail on a 1000 node cluster now runs successfully with this patch. 1. NetworkTopology has reader/writer locks. This map hardly changes but is used very frequently. Now, multiple open() calls can proceed in parallel. 2. The pending blocks and pending files are put into a new class called pendingCreates.java. This helps locking them together. 3. The BlocksMap is protected by a reader/writer lock. 4. In the common case (when the file is still in pendingCreates), addStordBlock() does not acquire the global fsnamesystem lock. 5. The datanodeMap was already using a lock object associated with it to protect modifications to it. Make sure that this check is done in all places where the datanodeMap is modified. 6. The Host2NodesMap has its own read/write lock. This will be merged in with the datanodeMap when we go to a much finer locking model in future. This patch is for code review purposes only. Some additional locking is needed for processReport (still to be done) but I would like some comments on the changes I have made. I would have liked a more-finer grain locking model that allows all filesystem-methods to be highly-concurrent. But that approach was deemed too-complex for the short term. I am putting out this patch to get feedback on whether this medium-term approach is acceptable.
        Hide
        Raghu Angadi added a comment -

        This will be useful. Couple of thoughts:

        1. requiring abandonBlock() is not necessary. fileComplete() can provide actual number of blocks used and Namenode deletes the rest. This allows Namnode to provide more blocks at a time without worrying about extra cost due to abandonBlock().
        2. we can make fileCreate() to provide a set of blocks avoiding one more addBlock().
        Show
        Raghu Angadi added a comment - This will be useful. Couple of thoughts: requiring abandonBlock() is not necessary. fileComplete() can provide actual number of blocks used and Namenode deletes the rest. This allows Namnode to provide more blocks at a time without worrying about extra cost due to abandonBlock(). we can make fileCreate() to provide a set of blocks avoiding one more addBlock().

          People

          • Assignee:
            Unassigned
            Reporter:
            dhruba borthakur
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development