Hadoop HDFS
  1. Hadoop HDFS
  2. HDFS-385

Design a pluggable interface to place replicas of blocks in HDFS

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.21.0, 1.2.0, 1-win
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      New experimental API BlockPlacementPolicy allows investigating alternate rules for locating block replicas.

      Description

      The current HDFS code typically places one replica on local rack, the second replica on remote random rack and the third replica on a random node of that remote rack. This algorithm is baked in the NameNode's code. It would be nice to make the block placement algorithm a pluggable interface. This will allow experimentation of different placement algorithms based on workloads, availability guarantees and failure models.

      1. BlockPlacementPluggable.txt
        16 kB
        dhruba borthakur
      2. BlockPlacementPluggable2.txt
        61 kB
        dhruba borthakur
      3. BlockPlacementPluggable3.txt
        70 kB
        dhruba borthakur
      4. BlockPlacementPluggable4.txt
        79 kB
        dhruba borthakur
      5. BlockPlacementPluggable4.txt
        76 kB
        dhruba borthakur
      6. BlockPlacementPluggable5.txt
        79 kB
        dhruba borthakur
      7. BlockPlacementPluggable6.txt
        77 kB
        dhruba borthakur
      8. BlockPlacementPluggable7.txt
        77 kB
        dhruba borthakur
      9. blockplacementpolicy2-branch-1.patch
        85 kB
        Sumadhur Reddy Bolli
      10. blockplacementpolicy2-branch-1-win.patch
        84 kB
        Sumadhur Reddy Bolli
      11. blockplacementpolicy3-branch-1.patch
        85 kB
        Sumadhur Reddy Bolli
      12. blockplacementpolicy3-branch-1-win.patch
        84 kB
        Sumadhur Reddy Bolli
      13. blockplacementpolicy-branch-1.patch
        28 kB
        Sumadhur Reddy Bolli
      14. blockplacementpolicy-branch-1-win.patch
        27 kB
        Sumadhur Reddy Bolli
      15. rat094.txt
        0.4 kB
        stack

        Issue Links

          Activity

          dhruba borthakur created issue -
          Hide
          Lohit Vijayarenu added a comment -

          It would be good if this could also be used while re-balancing. Even better if there is an option to change policy of a file.

          Show
          Lohit Vijayarenu added a comment - It would be good if this could also be used while re-balancing. Even better if there is an option to change policy of a file.
          Hide
          dhruba borthakur added a comment -

          Another alternative would be to allow an application to specify that a set of files have the same "block affinity". HDFS will try to allocate blocks of these files in a few set of datanode(s).

          Show
          dhruba borthakur added a comment - Another alternative would be to allow an application to specify that a set of files have the same "block affinity". HDFS will try to allocate blocks of these files in a few set of datanode(s).
          dhruba borthakur made changes -
          Field Original Value New Value
          Assignee dhruba borthakur [ dhruba ]
          Hide
          dhruba borthakur added a comment -

          This patch makes the HDFS block placement algorithm pluggable. The Namenode uses the name of the configuration parameter called dfs.block.replicator.classname to find the name of the class that chooses block locations. The default policy remains the same as before and can be found in ReplicationTargetChooser.java.

          A new block placement policy has to implement the interface specified in BlockPlacementInterface.java

          I would have liked the interface to specify the name of the file for which the block belongs to. But, the name of the file cannot be materialized cheaply when the block manager is doing block replication.

          Show
          dhruba borthakur added a comment - This patch makes the HDFS block placement algorithm pluggable. The Namenode uses the name of the configuration parameter called dfs.block.replicator.classname to find the name of the class that chooses block locations. The default policy remains the same as before and can be found in ReplicationTargetChooser.java. A new block placement policy has to implement the interface specified in BlockPlacementInterface.java I would have liked the interface to specify the name of the file for which the block belongs to. But, the name of the file cannot be materialized cheaply when the block manager is doing block replication.
          dhruba borthakur made changes -
          Attachment BlockPlacementPluggable.txt [ 12407075 ]
          Hide
          Hong Tang added a comment -

          @dhruba

          any specific block replication policies you might be interested in experimenting with?

          additionally, Ben Reed pointed me to this paper in NSDI 09 where the authors described a scheme on how to allow applications to express storage cues (some of them can be replication policies):

          Flexible, Wide-Area Storage for Distributed Systems with WheelFS
          Jeremy Stribling, MIT CSAIL; Yair Sovran, New York University; Irene Zhang and Xavid Pretzer, MIT CSAIL; Jinyang Li, New York University; M. Frans Kaashoek and Robert Morris, MIT CSAIL http://www.usenix.org/events/nsdi09/tech/full_papers/stribling/stribling.pdf

          Show
          Hong Tang added a comment - @dhruba any specific block replication policies you might be interested in experimenting with? additionally, Ben Reed pointed me to this paper in NSDI 09 where the authors described a scheme on how to allow applications to express storage cues (some of them can be replication policies): Flexible, Wide-Area Storage for Distributed Systems with WheelFS Jeremy Stribling, MIT CSAIL; Yair Sovran, New York University; Irene Zhang and Xavid Pretzer, MIT CSAIL; Jinyang Li, New York University; M. Frans Kaashoek and Robert Morris, MIT CSAIL http://www.usenix.org/events/nsdi09/tech/full_papers/stribling/stribling.pdf
          Hide
          dhruba borthakur added a comment -

          @Hong: thanks for the paper. I will look at it shortly. Thanks.

          > any specific block replication policies you might be interested in experimenting with?

          I am interested on co-locating blocks from two specified data-sets in the same set of datanode(s).

          Show
          dhruba borthakur added a comment - @Hong: thanks for the paper. I will look at it shortly. Thanks. > any specific block replication policies you might be interested in experimenting with? I am interested on co-locating blocks from two specified data-sets in the same set of datanode(s).
          Hide
          Stefan Will added a comment -

          Here's something I'd love to have: A way to lock a complete replica of a set of files to one or more nodes. Together with HADOOP-4801 this would allow Lucene indexes to be efficiently served right out of hdfs, rather than having to create yet another copy on the local disk.

          Show
          Stefan Will added a comment - Here's something I'd love to have: A way to lock a complete replica of a set of files to one or more nodes. Together with HADOOP-4801 this would allow Lucene indexes to be efficiently served right out of hdfs, rather than having to create yet another copy on the local disk.
          dhruba borthakur made changes -
          Link This issue is related to HADOOP-4801 [ HADOOP-4801 ]
          Hide
          dhruba borthakur added a comment - - edited

          @Stefan: I completely agree with you. This patch should enable researchers to experiment with various modes of HDFS block placement without changing code-hdfs code. I plan on using this to co-locate blocks from hdfs datasets that are frequently scanner together in a small number of datanodes so that such a "join" operation gets better node/rack locality.

          Show
          dhruba borthakur added a comment - - edited @Stefan: I completely agree with you. This patch should enable researchers to experiment with various modes of HDFS block placement without changing code-hdfs code. I plan on using this to co-locate blocks from hdfs datasets that are frequently scanner together in a small number of datanodes so that such a "join" operation gets better node/rack locality.
          Hide
          Tom White added a comment -

          Dhruba, These look like good changes - glad to see this moving forward. More comments below:

          • Can BlockPlacementInterface be an abstract class? I would also change its name to not have the "Interface" suffix, something like ReplicationPolicy, or BlockPlacementPolicy. ReplicationTargetChooser could be renamed something like DoubleRackReplicationPolicy or DoubleRackBlockPlacementPolicy or similar, to better describe its role.
          • Why doesn't ReplicationPolicy simply pass through verifyBlockPlacement()? It seems odd that it's doing extra work here.
          • BlockPlacementInterface#chooseTarget(). Make excludedNodes a List<DatanodeDescriptor>. Implementations may choose to turn it into a map if they need to, but for the interface, it should just be a list, shouldn't it?
          • For future evolution, can we pass a Configuration to the initialize() method, rather than the considerLoad boolean?
          • Rather than passing the full FSNamesystem to the initialize method, it would be preferable to create an interface for the part that the block placement strategy needs. Something like FSNamespaceStats, which only needs getTotalLoad() for the moment. I think this is an acceptable use of an interface, since it only used by developers writing a new block placement strategy. There's a similar situtation for job scheduling in MapReduce: JobTracker implements the package-private TaskTrackerManager interface so that TaskScheduler doesn't have to pull in the whole JobTracker. This helps a lot with testing.
          • These changes should make it possible to unit test ReplicationTargetChooser directly. This could be another Jira.
          Show
          Tom White added a comment - Dhruba, These look like good changes - glad to see this moving forward. More comments below: Can BlockPlacementInterface be an abstract class? I would also change its name to not have the "Interface" suffix, something like ReplicationPolicy, or BlockPlacementPolicy. ReplicationTargetChooser could be renamed something like DoubleRackReplicationPolicy or DoubleRackBlockPlacementPolicy or similar, to better describe its role. Why doesn't ReplicationPolicy simply pass through verifyBlockPlacement()? It seems odd that it's doing extra work here. BlockPlacementInterface#chooseTarget(). Make excludedNodes a List<DatanodeDescriptor>. Implementations may choose to turn it into a map if they need to, but for the interface, it should just be a list, shouldn't it? For future evolution, can we pass a Configuration to the initialize() method, rather than the considerLoad boolean? Rather than passing the full FSNamesystem to the initialize method, it would be preferable to create an interface for the part that the block placement strategy needs. Something like FSNamespaceStats, which only needs getTotalLoad() for the moment. I think this is an acceptable use of an interface, since it only used by developers writing a new block placement strategy. There's a similar situtation for job scheduling in MapReduce: JobTracker implements the package-private TaskTrackerManager interface so that TaskScheduler doesn't have to pull in the whole JobTracker. This helps a lot with testing. These changes should make it possible to unit test ReplicationTargetChooser directly. This could be another Jira.
          Hide
          Hairong Kuang added a comment -

          Dhruba, what's the implication of a pluggable block placement policy to balancer, under-replicated, and over-replicated blocks? Currently they all assumes there is only one replication policy. Is it possible that a block is created with one replication policy and then later in its life use a different policy to handle its over/under-replicaton and balancing? Or we need to persist their initial placement policy on disk.

          Show
          Hairong Kuang added a comment - Dhruba, what's the implication of a pluggable block placement policy to balancer, under-replicated, and over-replicated blocks? Currently they all assumes there is only one replication policy. Is it possible that a block is created with one replication policy and then later in its life use a different policy to handle its over/under-replicaton and balancing? Or we need to persist their initial placement policy on disk.
          Hide
          dhruba borthakur added a comment -

          @Tom: thanks for the comments. I will work on it.

          @Hairong: It might be nice if we can have a replication-policy per file. But this feature can be handled inside the plugin code. The plugin code can look at the name of the file (or file metadata stored externally) and decide on a replication policy. My vote is not to store the replication policy persistently in the namenode metadata. In the current code, the initial block placement, balancer and handling over-replciation/under-replication should all be using the new block-placement interface. I will look at this angle some more.

          Show
          dhruba borthakur added a comment - @Tom: thanks for the comments. I will work on it. @Hairong: It might be nice if we can have a replication-policy per file. But this feature can be handled inside the plugin code. The plugin code can look at the name of the file (or file metadata stored externally) and decide on a replication policy. My vote is not to store the replication policy persistently in the namenode metadata. In the current code, the initial block placement, balancer and handling over-replciation/under-replication should all be using the new block-placement interface. I will look at this angle some more.
          Hide
          dhruba borthakur added a comment -

          @Hairong: fsck uses verifyReplication() to ensure whether replicaas are placed according to the configured policy, and this method is part of the proposed BlockPlacement policy interface. Balancer does not use the configured policy but ensures that the number of unique racks for the block is not reduced. Thus, both these components should work ok with a externally configured replication policy, isn't it?

          Show
          dhruba borthakur added a comment - @Hairong: fsck uses verifyReplication() to ensure whether replicaas are placed according to the configured policy, and this method is part of the proposed BlockPlacement policy interface. Balancer does not use the configured policy but ensures that the number of unique racks for the block is not reduced. Thus, both these components should work ok with a externally configured replication policy, isn't it?
          Hide
          Jingkei Ly added a comment -

          Hi Dhruba. I'm really glad you raised this JIRA because I need something like this in-order to be able to co-locate blocks based on the filename of the file the blocks belong to - I think similar to what you mentioned you would like to do. However, I'm not sure how you would do this with the interface you are proposing, as BlockPlacementInterface#chooseTarget() isn't passed anything to identify the block or filename that is being written - am I missing something? How would this be done?

          Show
          Jingkei Ly added a comment - Hi Dhruba. I'm really glad you raised this JIRA because I need something like this in-order to be able to co-locate blocks based on the filename of the file the blocks belong to - I think similar to what you mentioned you would like to do. However, I'm not sure how you would do this with the interface you are proposing, as BlockPlacementInterface#chooseTarget() isn't passed anything to identify the block or filename that is being written - am I missing something? How would this be done?
          Hide
          dhruba borthakur added a comment -

          I would like to figure out how to expose the filename in the API. It can be easily done when this API is invoked at the time when a new block is being allocated to a file. However, when the replicator takes a under-replicated block and tries to get a new replica location for that block, it is costly to the find the filename.

          The filename can be deduced by traversing the INode-tree that is maintained by the namenode. But it is a costly operation beccause one has to traverse the entire branch starting from the specified INOde to the root. One option is to pass in the INode into the block-placement algorithm. f the algorithm needs the complete path name of the file in question, then it has to do the costly operation of generating the full path name of the file. However, this makes the interface kind-of less elegant. I am still debating how to do it right.

          Show
          dhruba borthakur added a comment - I would like to figure out how to expose the filename in the API. It can be easily done when this API is invoked at the time when a new block is being allocated to a file. However, when the replicator takes a under-replicated block and tries to get a new replica location for that block, it is costly to the find the filename. The filename can be deduced by traversing the INode-tree that is maintained by the namenode. But it is a costly operation beccause one has to traverse the entire branch starting from the specified INOde to the root. One option is to pass in the INode into the block-placement algorithm. f the algorithm needs the complete path name of the file in question, then it has to do the costly operation of generating the full path name of the file. However, this makes the interface kind-of less elegant. I am still debating how to do it right.
          Hide
          Hairong Kuang added a comment -

          > Balancer does not use the configured policy but ensures that the number of unique racks for the block is not reduced. Thus, both these components should work ok with a externally configured replication policy, isn't it?
          Replication policies may have other requirements other than number of racks, for example co-location. I don't think balancer handles this right now.

          Show
          Hairong Kuang added a comment - > Balancer does not use the configured policy but ensures that the number of unique racks for the block is not reduced. Thus, both these components should work ok with a externally configured replication policy, isn't it? Replication policies may have other requirements other than number of racks, for example co-location. I don't think balancer handles this right now.
          Hide
          dhruba borthakur added a comment -

          > Replication policies may have other requirements other than number of racks,

          Of course, this is true. But this depends on what type of replication policy one can come up with and I would like to leave this for a future time. "co-location" of blocks would typically be based on their pathnames, and the balancer could be extended to invoke the same BlockPlacement policy interface to adhere to the policy.

          However, in this patch, I would like to expose the name of the file via the BlockPlacement policy interface. Any ideas here?

          Show
          dhruba borthakur added a comment - > Replication policies may have other requirements other than number of racks, Of course, this is true. But this depends on what type of replication policy one can come up with and I would like to leave this for a future time. "co-location" of blocks would typically be based on their pathnames, and the balancer could be extended to invoke the same BlockPlacement policy interface to adhere to the policy. However, in this patch, I would like to expose the name of the file via the BlockPlacement policy interface. Any ideas here?
          Hide
          Jingkei Ly added a comment -

          > However, in this patch, I would like to expose the name of the file via the BlockPlacement policy interface. Any ideas here?

          I think having two versions of BlockPlacementInterface#chooseTarget() would be the most efficient - one that accepts the filename (to be called from FSNamesystem#getAdditionalBlock()) and another that accepts the INode (to be called from FSNamesystem#computeReplicationWorkForBlock()). As you said, it does make the interface rather inelegant, though.

          An alternative is to pass the Block object to chooseTarget() and let the plugin-code look up the INode itself in the FSNamesystem map - not particularly efficient, but perhaps plugin-code could cache INodes to filenames to mitigate it a bit.

          Show
          Jingkei Ly added a comment - > However, in this patch, I would like to expose the name of the file via the BlockPlacement policy interface. Any ideas here? I think having two versions of BlockPlacementInterface#chooseTarget() would be the most efficient - one that accepts the filename (to be called from FSNamesystem#getAdditionalBlock()) and another that accepts the INode (to be called from FSNamesystem#computeReplicationWorkForBlock()). As you said, it does make the interface rather inelegant, though. An alternative is to pass the Block object to chooseTarget() and let the plugin-code look up the INode itself in the FSNamesystem map - not particularly efficient, but perhaps plugin-code could cache INodes to filenames to mitigate it a bit.
          Hide
          Hairong Kuang added a comment -

          > I would like to leave this for a future time
          I do not think ignoring balancer in this design is a good idea. Yahoo runs balancer almost 24/7 on all clusters. We do not want balancer to break any of the favorite block placement policy configured on the cluster.

          Dhruba, I think this jira makes a big change to the core of DFS. It's non-trival at all. If we ever want to make this change, we should have a design document first, explaining the semantic of pluggable placement policy and how all block placement related components should change to support the feature.

          Show
          Hairong Kuang added a comment - > I would like to leave this for a future time I do not think ignoring balancer in this design is a good idea. Yahoo runs balancer almost 24/7 on all clusters. We do not want balancer to break any of the favorite block placement policy configured on the cluster. Dhruba, I think this jira makes a big change to the core of DFS. It's non-trival at all. If we ever want to make this change, we should have a design document first, explaining the semantic of pluggable placement policy and how all block placement related components should change to support the feature.
          Hide
          Owen O'Malley added a comment -

          I agree with Hairong. I think any answer that ignores the rebalancer is broken. Especially since in order to be stable under the rebalancer, hdfs probably needs to store metadata with the files or blocks that is placement specific.

          -1.

          Show
          Owen O'Malley added a comment - I agree with Hairong. I think any answer that ignores the rebalancer is broken. Especially since in order to be stable under the rebalancer, hdfs probably needs to store metadata with the files or blocks that is placement specific. -1.
          Hide
          dhruba borthakur added a comment -

          > Especially since in order to be stable under the rebalancer

          Oh guys, you are going too far! I am talking of faster cycle of innovation and iteration. A pluggable interface allows the hadoop community to try experiments with newer methods of block placement. Once such a placement algorithm proves beneficial and helpful, does the related questions of "how to make the balancer work with the new placement policy" come into my mind. If experiments prove that there isn't any viable alternative pluggable policy, then the question of "does the balancer work with a pluggable policy" is moot.

          > hdfs probably needs to store metadata with the files or blocks

          I do not like this approach. It makes hdfs heavy, clunky and difficult to maintain. Have you seen what happened to other file system that tried to do everything inside it, e.g. DCE-DFS? It is possible that HDFS might allow generic blobs to be stored stored with files (aka extended file attributes) where application specific data can be stored. But it should be disassociated from a "requirement" that archival-policy has to be stored with file meta-data.

          Again folks, I agree completely with you that a "finished product" needs to encompass the "balancer". But to start experimenting to figure out whether a different placement policy is beneificial at all, I need the pluggability feature, otherwise I have to keep changing my hadoop source code every time I want to experiment. My experiments will probably take 3 to six months, especially because I want to benchmark results at large scale.

          For installations that go with the default policy, there is no impact at all.

          Show
          dhruba borthakur added a comment - > Especially since in order to be stable under the rebalancer Oh guys, you are going too far! I am talking of faster cycle of innovation and iteration. A pluggable interface allows the hadoop community to try experiments with newer methods of block placement. Once such a placement algorithm proves beneficial and helpful, does the related questions of "how to make the balancer work with the new placement policy" come into my mind. If experiments prove that there isn't any viable alternative pluggable policy, then the question of "does the balancer work with a pluggable policy" is moot. > hdfs probably needs to store metadata with the files or blocks I do not like this approach. It makes hdfs heavy, clunky and difficult to maintain. Have you seen what happened to other file system that tried to do everything inside it, e.g. DCE-DFS? It is possible that HDFS might allow generic blobs to be stored stored with files (aka extended file attributes) where application specific data can be stored. But it should be disassociated from a "requirement" that archival-policy has to be stored with file meta-data. Again folks, I agree completely with you that a "finished product" needs to encompass the "balancer". But to start experimenting to figure out whether a different placement policy is beneificial at all, I need the pluggability feature, otherwise I have to keep changing my hadoop source code every time I want to experiment. My experiments will probably take 3 to six months, especially because I want to benchmark results at large scale. For installations that go with the default policy, there is no impact at all.
          Hide
          Jingkei Ly added a comment -

          > hdfs probably needs to store metadata with the files or blocks
          Instead of storing replication policy metadata with the blocks, could you keep the responsibility for this back with the replica placement plugin-code?

          So assuming the balancer is also updated to use the block placement interface; if a cluster has to support multiple replication policies, it could be the plugin-code's responsiblity to decide which policy to use based on the file owner/permissions/filename for the block. One advantage is that all the replication code for the cluster is encompassed in one place.

          Show
          Jingkei Ly added a comment - > hdfs probably needs to store metadata with the files or blocks Instead of storing replication policy metadata with the blocks, could you keep the responsibility for this back with the replica placement plugin-code? So assuming the balancer is also updated to use the block placement interface; if a cluster has to support multiple replication policies, it could be the plugin-code's responsiblity to decide which policy to use based on the file owner/permissions/filename for the block. One advantage is that all the replication code for the cluster is encompassed in one place.
          Hide
          dhruba borthakur added a comment -

          After sleeping over it, I think it is necessary to ensure that the balancer does at least the bare minimum to work elegantly with an external block placement policy.

          > if a cluster has to support multiple replication policies, it could be the plugin-code's responsiblity to decide which policy to use based on the file owner/permissions/filename for the block

          That's my plan. One of my ideas is to change th block placement policy for a file directory based on access patterns. The plugin wil analyze a set of past access patterns (stored in an external db) to decide what type of placement is "currently" best for a dataset.

          Show
          dhruba borthakur added a comment - After sleeping over it, I think it is necessary to ensure that the balancer does at least the bare minimum to work elegantly with an external block placement policy. > if a cluster has to support multiple replication policies, it could be the plugin-code's responsiblity to decide which policy to use based on the file owner/permissions/filename for the block That's my plan. One of my ideas is to change th block placement policy for a file directory based on access patterns. The plugin wil analyze a set of past access patterns (stored in an external db) to decide what type of placement is "currently" best for a dataset.
          Hide
          Konstantin Shvachko added a comment -

          > The plugin wil analyze a set of past access patterns (stored in an external db)

          What external db? Dhruba, could you please elaborate what you are trying to do. May be in a form of some document.
          Mixing different placement policies in the same (instance of) file system is not a good idea imo. The policies may contradict each other.
          I would rather allow to format a file system with a specific policy and then keep it constant for the lifespan of the system.
          This gives enough space for experimenting with different policies.
          But I agree that the balancer and the fsck should be policy aware.

          Show
          Konstantin Shvachko added a comment - > The plugin wil analyze a set of past access patterns (stored in an external db) What external db? Dhruba, could you please elaborate what you are trying to do. May be in a form of some document. Mixing different placement policies in the same (instance of) file system is not a good idea imo. The policies may contradict each other. I would rather allow to format a file system with a specific policy and then keep it constant for the lifespan of the system. This gives enough space for experimenting with different policies. But I agree that the balancer and the fsck should be policy aware.
          Hide
          dhruba borthakur added a comment -

          > What external db? Dhruba, could you please elaborate what you

          We have deployed the scripts from HADOOP-3708 to collect (online) job logs into a mysql DB. This DB also contains the hive query that the job makes. It is easy to see what datasets are being used together in most queries. My idea is to dynamically co-locate blocks based on these access patterns. I will present those ideas in a separate JIRA (once this JIRA gets through)

          > I would rather allow to format a file system with a specific policy and then keep it constant for the lifespan of the system.

          That would be a fine goal for your cluster. However, I would like the API to be more flexible than that. Does it sound reasonable?

          Show
          dhruba borthakur added a comment - > What external db? Dhruba, could you please elaborate what you We have deployed the scripts from HADOOP-3708 to collect (online) job logs into a mysql DB. This DB also contains the hive query that the job makes. It is easy to see what datasets are being used together in most queries. My idea is to dynamically co-locate blocks based on these access patterns. I will present those ideas in a separate JIRA (once this JIRA gets through) > I would rather allow to format a file system with a specific policy and then keep it constant for the lifespan of the system. That would be a fine goal for your cluster. However, I would like the API to be more flexible than that. Does it sound reasonable?
          Hide
          dhruba borthakur added a comment -

          I incorporated all of Tom's comments. Thanks Tom.

          This patch ensures that the following code paths conform to the configured block placement policy:
          1. Allocation of a new block for a file
          2. deletion of excess replicas of a block
          3. creation of new replicas of a block
          4. movement of blocks triggered by the balancer
          5. verification of block placement by fsck

          Show
          dhruba borthakur added a comment - I incorporated all of Tom's comments. Thanks Tom. This patch ensures that the following code paths conform to the configured block placement policy: 1. Allocation of a new block for a file 2. deletion of excess replicas of a block 3. creation of new replicas of a block 4. movement of blocks triggered by the balancer 5. verification of block placement by fsck
          dhruba borthakur made changes -
          Attachment BlockPlacementPluggable2.txt [ 12409493 ]
          Hide
          Jingkei Ly added a comment -

          I have a use case for controlling the replication targets of a block based on filename at file creation time - is it possible to accommodate this in your proposed API?

          Show
          Jingkei Ly added a comment - I have a use case for controlling the replication targets of a block based on filename at file creation time - is it possible to accommodate this in your proposed API?
          Hide
          Tom White added a comment -

          Hi Dhruba,

          A couple more comments:

          > BlockPlacementInterface#chooseTarget(). Make excludedNodes a List<DatanodeDescriptor>. Implementations may choose to turn it into a map if they need to, but for the interface, it should just be a list, shouldn't it?

          I think you missed this change.

          I'm not convinced that ReplicationPolicyChooser is needed. Couldn't we add a static method (e.g. getInstance()) to BlockPlacementPolicy to construct a BlockPlacementPolicy from the dfs.block.replicator.classname property? We can add an overloaded chooseTarget() method to BlockPlacementPolicy which doesn't take a chosenNodes argument (BTW this is misspelt as "choosenNodes" in BlockPlacementPolicy).

          Show
          Tom White added a comment - Hi Dhruba, A couple more comments: > BlockPlacementInterface#chooseTarget(). Make excludedNodes a List<DatanodeDescriptor>. Implementations may choose to turn it into a map if they need to, but for the interface, it should just be a list, shouldn't it? I think you missed this change. I'm not convinced that ReplicationPolicyChooser is needed. Couldn't we add a static method (e.g. getInstance()) to BlockPlacementPolicy to construct a BlockPlacementPolicy from the dfs.block.replicator.classname property? We can add an overloaded chooseTarget() method to BlockPlacementPolicy which doesn't take a chosenNodes argument (BTW this is misspelt as "choosenNodes" in BlockPlacementPolicy).
          Owen O'Malley made changes -
          Project Hadoop Common [ 12310240 ] HDFS [ 12310942 ]
          Key HADOOP-3799 HDFS-385
          Component/s dfs [ 12310710 ]
          Hide
          dhruba borthakur added a comment -

          Thanks Tom for the comments. I removed ReplicationTargetChooser.java and added a static method BlockPlacementPolicy.getInstance(). Also, I am passing in the filename to the BlockPlacementPolicy.chooseTarget() when a block is newly created for a file.

          Can you pl explain on how I can replace the HashMap with a List for excludeNodes. (most implementations of a block placement policy would anyway have to check if a node already belongs to this list before inserting it to the excludeList)

          Show
          dhruba borthakur added a comment - Thanks Tom for the comments. I removed ReplicationTargetChooser.java and added a static method BlockPlacementPolicy.getInstance(). Also, I am passing in the filename to the BlockPlacementPolicy.chooseTarget() when a block is newly created for a file. Can you pl explain on how I can replace the HashMap with a List for excludeNodes. (most implementations of a block placement policy would anyway have to check if a node already belongs to this list before inserting it to the excludeList)
          dhruba borthakur made changes -
          Attachment BlockPlacementPluggable3.txt [ 12411508 ]
          Hide
          Tom White added a comment -

          > Can you pl explain on how I can replace the HashMap with a List for excludeNodes. (most implementations of a block placement policy would anyway have to check if a node already belongs to this list before inserting it to the excludeList)

          I was suggesting this since it is bad style to use a concrete implementation (HashMap) over an interface (Map). It was not clear to me why this was a Map, since the key and value are always the same object - this is really just a Set. From an API design point of view, I feel there should be some symmetry between List<DatanodeDescriptor> choosenNodes (still misspelt BTW) and HashMap<Node, Node> excludedNodes.

          Looking at this more, however, the callers of BlockPlacementPolicy.chooseTarget() always pass in a null excludedNodes, so I would suggest that it can be removed from the public interface.

          Show
          Tom White added a comment - > Can you pl explain on how I can replace the HashMap with a List for excludeNodes. (most implementations of a block placement policy would anyway have to check if a node already belongs to this list before inserting it to the excludeList) I was suggesting this since it is bad style to use a concrete implementation (HashMap) over an interface (Map). It was not clear to me why this was a Map, since the key and value are always the same object - this is really just a Set. From an API design point of view, I feel there should be some symmetry between List<DatanodeDescriptor> choosenNodes (still misspelt BTW) and HashMap<Node, Node> excludedNodes. Looking at this more, however, the callers of BlockPlacementPolicy.chooseTarget() always pass in a null excludedNodes, so I would suggest that it can be removed from the public interface.
          Hide
          Jingkei Ly added a comment -

          Also, I am passing in the filename to the BlockPlacementPolicy.chooseTarget() when a block is newly created for a file.

          I presume this is to accomodate the use case of specifying block location based on filename? I noticed that you pass null for the filename in BlockManager.computeReplicationWorkForBlock() because it is costly to extract the full file path - however, this would mean a replication scheme based on filename specified at file creation-time would be undone if a block finds itself under/over-replicated (and similarly for the Balancer).

          To lift this limitation, would it make sense to let BlockPlacementPolicy.chooseTarget() accept a Block, and provide access to BlockMap.getINode() in FSClusterStats so that a user can go through the expensive filename extraction phase if they wish?

          In my use case, I would like to place blocks based on only the final part of the file name (so part-00001 goes to DataNode 1), so I think to retrieve the name in my case would be just BlockMap.getINode().getLocalName() (although the visibility of those methods is a whole separate problem).

          Show
          Jingkei Ly added a comment - Also, I am passing in the filename to the BlockPlacementPolicy.chooseTarget() when a block is newly created for a file. I presume this is to accomodate the use case of specifying block location based on filename? I noticed that you pass null for the filename in BlockManager.computeReplicationWorkForBlock() because it is costly to extract the full file path - however, this would mean a replication scheme based on filename specified at file creation-time would be undone if a block finds itself under/over-replicated (and similarly for the Balancer). To lift this limitation, would it make sense to let BlockPlacementPolicy.chooseTarget() accept a Block, and provide access to BlockMap.getINode() in FSClusterStats so that a user can go through the expensive filename extraction phase if they wish? In my use case, I would like to place blocks based on only the final part of the file name (so part-00001 goes to DataNode 1), so I think to retrieve the name in my case would be just BlockMap.getINode().getLocalName() (although the visibility of those methods is a whole separate problem).
          Hide
          dhruba borthakur added a comment -

          @Jingkei: would it work if I add another method BlockPlacementPolicy.chooseTarget() that takes a INode instead of a pathname?

          Show
          dhruba borthakur added a comment - @Jingkei: would it work if I add another method BlockPlacementPolicy.chooseTarget() that takes a INode instead of a pathname?
          Hide
          dhruba borthakur added a comment -

          Incorporated Tom's review comments.

          Added a new method BlockPlacementPolicy so that one can chooseTargets by passing in the iNode instead of the filename.

          Show
          dhruba borthakur added a comment - Incorporated Tom's review comments. Added a new method BlockPlacementPolicy so that one can chooseTargets by passing in the iNode instead of the filename.
          dhruba borthakur made changes -
          Attachment BlockPlacementPluggable4.txt [ 12411722 ]
          Hide
          dhruba borthakur added a comment -

          Ready for review. This introduced a new interface FSInodeName that allows the pluggable API to retrieve the name of a file.

          Show
          dhruba borthakur added a comment - Ready for review. This introduced a new interface FSInodeName that allows the pluggable API to retrieve the name of a file.
          dhruba borthakur made changes -
          Attachment BlockPlacementPluggable4.txt [ 12411733 ]
          Hide
          Jingkei Ly added a comment -

          would it work if I add another method BlockPlacementPolicy.chooseTarget() that takes a INode instead of a pathname?

          I'm a little concerned that the Balancer and Fsck will contradict a policy based on filename because BlockPlacementPolicy.isValidMove() and BlockPlacementPolicy.verifyBlockPlacement() don't have a way of materialising the Block's filename. I think there is a similar problem with FSNamesystem.chooseExcessReplicates(), which calls BlockPlacementPolicy.isValidDelete().

          Show
          Jingkei Ly added a comment - would it work if I add another method BlockPlacementPolicy.chooseTarget() that takes a INode instead of a pathname? I'm a little concerned that the Balancer and Fsck will contradict a policy based on filename because BlockPlacementPolicy.isValidMove() and BlockPlacementPolicy.verifyBlockPlacement() don't have a way of materialising the Block's filename. I think there is a similar problem with FSNamesystem.chooseExcessReplicates(), which calls BlockPlacementPolicy.isValidDelete().
          Hide
          dhruba borthakur added a comment -

          > I'm a little concerned that the Balancer and Fsck will contradict a policy based on filename because BlockPlacementPolicy.isValidMove() and BlockPlacementPolicy.verifyBlockPlacement()

          I agree that there isn't an elegant way of materializing a filename during a rebalance operation. One workaround is for you to run a fsck to find the mapping from blocks to a file. Then you can use this information in your modified Balancer to do what is appropriate for you. You can also use the tool in HADOOP-5019 for this purpose.

          Show
          dhruba borthakur added a comment - > I'm a little concerned that the Balancer and Fsck will contradict a policy based on filename because BlockPlacementPolicy.isValidMove() and BlockPlacementPolicy.verifyBlockPlacement() I agree that there isn't an elegant way of materializing a filename during a rebalance operation. One workaround is for you to run a fsck to find the mapping from blocks to a file. Then you can use this information in your modified Balancer to do what is appropriate for you. You can also use the tool in HADOOP-5019 for this purpose.
          Hide
          dhruba borthakur added a comment -

          @Tom: would appreciate it a lot if you can review it one more time. Thanks.

          @Jingkei: can you use HDFS-207 to address the issue you have raised?

          Show
          dhruba borthakur added a comment - @Tom: would appreciate it a lot if you can review it one more time. Thanks. @Jingkei: can you use HDFS-207 to address the issue you have raised?
          Hide
          Jingkei Ly added a comment -

          can you use HDFS-207 to address the issue you have raised?

          Yes, I think your patch combined with the changes proposed in HDFS-207 would provide an acceptable workaround. Thanks.

          Show
          Jingkei Ly added a comment - can you use HDFS-207 to address the issue you have raised? Yes, I think your patch combined with the changes proposed in HDFS-207 would provide an acceptable workaround. Thanks.
          Hide
          Tom White added a comment -

          Dhruba, the latest patch addresses my earlier comments. Thanks.

          Show
          Tom White added a comment - Dhruba, the latest patch addresses my earlier comments. Thanks.
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Fix Version/s 0.21.0 [ 12314046 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12411733/BlockPlacementPluggable4.txt
          against trunk revision 790733.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/4/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/4/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/4/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/4/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12411733/BlockPlacementPluggable4.txt against trunk revision 790733. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/4/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/4/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/4/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/4/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          This patch is ready for commit. I would like to see if anybody else would like to review this JIRA.

          Show
          dhruba borthakur added a comment - This patch is ready for commit. I would like to see if anybody else would like to review this JIRA.
          Hide
          Hong Tang added a comment -

          Some minor nits:

          • In class BlockPlacementPolicy, the javadoc for chooseTarget contains an unused parameter excludedNodes.
          • There are two versions of abstract chooseTarget(), is it possible to provide a default implementation for chooseTarget(FSInodeName srcInode, int numOfReplicas, DatanodeDescriptor writer, List<DatanodeDescriptor> chosenNodes, long blocksize) on top of chooseTarget(String srcPath, int numOfReplicas, DatanodeDescriptor writer, List<DatanodeDescriptor> chosenNodes, long blocksize) ?
          • There is asymmetry for chooseTarget which takes a list of DatanodeDescriptor and returns an array of DatanodeDescriptor. Why not returning a List too?
          Show
          Hong Tang added a comment - Some minor nits: In class BlockPlacementPolicy, the javadoc for chooseTarget contains an unused parameter excludedNodes. There are two versions of abstract chooseTarget(), is it possible to provide a default implementation for chooseTarget(FSInodeName srcInode, int numOfReplicas, DatanodeDescriptor writer, List<DatanodeDescriptor> chosenNodes, long blocksize) on top of chooseTarget(String srcPath, int numOfReplicas, DatanodeDescriptor writer, List<DatanodeDescriptor> chosenNodes, long blocksize) ? There is asymmetry for chooseTarget which takes a list of DatanodeDescriptor and returns an array of DatanodeDescriptor. Why not returning a List too?
          Hide
          Matei Zaharia added a comment -

          Hey Dhruba, a couple of quick questions:

          • Why is excludedNodes a HashMap? A HashSet should perform at least as well and use less memory.
          • Why is verifyBlockPlacement part of the abstract API? It seems to be something that can be checked regardless of the replication policy as long as you know the network topology. Is this just in case the policy does not want to use the existing NetworkTopology class?
          Show
          Matei Zaharia added a comment - Hey Dhruba, a couple of quick questions: Why is excludedNodes a HashMap? A HashSet should perform at least as well and use less memory. Why is verifyBlockPlacement part of the abstract API? It seems to be something that can be checked regardless of the replication policy as long as you know the network topology. Is this just in case the policy does not want to use the existing NetworkTopology class?
          Hide
          Tsz Wo Nicholas Sze added a comment -

          > Why is excludedNodes a HashMap? A HashSet should perform at least as well and use less memory.

          It is unbelievable that java.util.HashSet is indeed a HashMap. I learned this some time ago. The following is copied from the HashSet API: "This class implements the Set interface, backed by a hash table (actually a HashMap instance)..."

          Show
          Tsz Wo Nicholas Sze added a comment - > Why is excludedNodes a HashMap? A HashSet should perform at least as well and use less memory. It is unbelievable that java.util.HashSet is indeed a HashMap. I learned this some time ago. The following is copied from the HashSet API : "This class implements the Set interface, backed by a hash table (actually a HashMap instance)..."
          Hide
          Matei Zaharia added a comment -

          Wow, pretty sad

          Show
          Matei Zaharia added a comment - Wow, pretty sad
          Hide
          dhruba borthakur added a comment -

          @Hong: * javadoc for chooseTarget contains an unused parameter : I wiill fix.

          • two versions of chooseTarget. I will make the default as you suggested.
          • asymmetry for chooseTarget which takes a list of DatanodeDescriptor: I would like to leave it as it is because lots of code in the namenode depends on this behaviour. Is this ok with you?

          @Matei:
          HashMap vs HashSet, not much difference as explained by Nicholas. Also, the last path does not have any excludedNodes parameters

          • verifyBlockPlacement part of the abstract API? This method is used by fsck to verify that the block satisfies the placement policy.
          Show
          dhruba borthakur added a comment - @Hong: * javadoc for chooseTarget contains an unused parameter : I wiill fix. two versions of chooseTarget. I will make the default as you suggested. asymmetry for chooseTarget which takes a list of DatanodeDescriptor: I would like to leave it as it is because lots of code in the namenode depends on this behaviour. Is this ok with you? @Matei: HashMap vs HashSet, not much difference as explained by Nicholas. Also, the last path does not have any excludedNodes parameters verifyBlockPlacement part of the abstract API? This method is used by fsck to verify that the block satisfies the placement policy.
          Hide
          Hong Tang added a comment -

          @Dhruba, Fine with me (obviously you have more context than I do to justify this).

          Show
          Hong Tang added a comment - @Dhruba, Fine with me (obviously you have more context than I do to justify this).
          Hide
          dhruba borthakur added a comment -

          Cancelling patch to incorporate Hong's review comments.

          Show
          dhruba borthakur added a comment - Cancelling patch to incorporate Hong's review comments.
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          dhruba borthakur added a comment -

          Incorporated Hong's review comments. I did not change the "asymmetry for chooseTarget which takes a list of DatanodeDescriptor" because this list is actually generated from the BlockManager and represents the datanodes where the block currently resides. It will one additional memory allocation if I change this list to a array.

          Show
          dhruba borthakur added a comment - Incorporated Hong's review comments. I did not change the "asymmetry for chooseTarget which takes a list of DatanodeDescriptor" because this list is actually generated from the BlockManager and represents the datanodes where the block currently resides. It will one additional memory allocation if I change this list to a array.
          dhruba borthakur made changes -
          Attachment BlockPlacementPluggable5.txt [ 12412900 ]
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          dhruba borthakur added a comment -

          +1 overall.
          [exec]
          [exec] +1 @author. The patch does not contain any @author tags.
          [exec]
          [exec] +1 tests included. The patch appears to include 3 new or modified tests.
          [exec]
          [exec] +1 javadoc. The javadoc tool did not generate any warning messages.
          [exec]
          [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
          [exec]
          [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings.
          [exec]
          [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.

          Show
          dhruba borthakur added a comment - +1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] +1 tests included. The patch appears to include 3 new or modified tests. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings. [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings.
          Hide
          Matei Zaharia added a comment -

          Thanks for your response Dhruba. With verifyBlockPlacement, I meant that I was wondering whether any subclass will have a different implementation than the one in the default policy. It seems that the method could be defined in BlockPlacementPolicy rather than being made abstract, so that it can be shared by policies.

          Show
          Matei Zaharia added a comment - Thanks for your response Dhruba. With verifyBlockPlacement, I meant that I was wondering whether any subclass will have a different implementation than the one in the default policy. It seems that the method could be defined in BlockPlacementPolicy rather than being made abstract, so that it can be shared by policies.
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12412900/BlockPlacementPluggable5.txt
          against trunk revision 790733.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          -1 core tests. The patch failed core unit tests.

          -1 contrib tests. The patch failed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/8/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/8/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/8/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/8/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12412900/BlockPlacementPluggable5.txt against trunk revision 790733. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. -1 contrib tests. The patch failed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/8/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/8/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/8/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-vesta.apache.org/8/console This message is automatically generated.
          Hide
          Sanjay Radia added a comment -

          Dhruba
          what is the usage model? Is it along the following lines:

          Say I start with policy 1 plugged into NN and Balancer.

          Later I change to policy 2 for both NN and Balancer.
          Newly created files will follow policy 2.
          Existing files will retain policy 1. Fsck will report violations for policy 2 for the old files; correct?
          If one starts the balancer (with policy 2) then it will move blocks around so that old files match policy 2.

          It would be an admin error to configure NN and Balancer with different policies; correct? There is no check for this; correct?

          Q. The policy manager is global to the file system. Can it have its own config to to do different policies for different subtrees?

          It would be useful to get the usage model documented.

          Show
          Sanjay Radia added a comment - Dhruba what is the usage model? Is it along the following lines: Say I start with policy 1 plugged into NN and Balancer. Later I change to policy 2 for both NN and Balancer. Newly created files will follow policy 2. Existing files will retain policy 1. Fsck will report violations for policy 2 for the old files; correct? If one starts the balancer (with policy 2) then it will move blocks around so that old files match policy 2. It would be an admin error to configure NN and Balancer with different policies; correct? There is no check for this; correct? Q. The policy manager is global to the file system. Can it have its own config to to do different policies for different subtrees? It would be useful to get the usage model documented.
          Hide
          dhruba borthakur added a comment -

          The usage model it to define a policy for the entire cluster when you create the cluster. This is especially useful when you have an HDFS instance on Amazon EB2 instance for example. This is not intended to be dynamic in any shape or form for a specified cluster.

          > Existing files will retain policy 1. Fsck will report violations for policy 2 for the old files; correct?
          Correct.

          > It would be an admin error to configure NN and Balancer with different policies; correct? There is no check for this; correct?
          Correct.

          > Q. The policy manager is global to the file system. Can it have its own config to to do different policies for different subtrees?
          Sure can. I do not have a use-case for now that needs different policies for different files. But when it is required, we can always do that.

          You could also use this to co-locate blocks of the same fle in the same set of datanodes. But here again, I do not see different policies for different files.

          Show
          dhruba borthakur added a comment - The usage model it to define a policy for the entire cluster when you create the cluster. This is especially useful when you have an HDFS instance on Amazon EB2 instance for example. This is not intended to be dynamic in any shape or form for a specified cluster. > Existing files will retain policy 1. Fsck will report violations for policy 2 for the old files; correct? Correct. > It would be an admin error to configure NN and Balancer with different policies; correct? There is no check for this; correct? Correct. > Q. The policy manager is global to the file system. Can it have its own config to to do different policies for different subtrees? Sure can. I do not have a use-case for now that needs different policies for different files. But when it is required, we can always do that. You could also use this to co-locate blocks of the same fle in the same set of datanodes. But here again, I do not see different policies for different files.
          Hide
          Hong Tang added a comment -

          @sanjay, my understanding is that in the scenario you described, policy 2 has to be compatible with policy 1. This could mean two things: (1) more general constraints being added to policy 2 and policy 2 does not violate any constraints of policy 1, eg. I want to have a copy of this file close to an existing file (and obey the general constraints). (2) specialize on individual files or directories by user hints: such as I want this file to be replicated twice but no need to be on two different racks.

          The paper I cited earlier offers an interesting way of expressing per-file or directory hints. But we may consider supporting it through something similar to mac's extended attributes.

          Flexible, Wide-Area Storage for Distributed Systems with WheelFS
          Jeremy Stribling, MIT CSAIL; Yair Sovran, New York University; Irene Zhang and Xavid Pretzer, MIT CSAIL; Jinyang Li, New York University; M. Frans Kaashoek and Robert Morris, MIT CSAIL http://www.usenix.org/events/nsdi09/tech/full_papers/stribling/stribling.pdf

          Show
          Hong Tang added a comment - @sanjay, my understanding is that in the scenario you described, policy 2 has to be compatible with policy 1. This could mean two things: (1) more general constraints being added to policy 2 and policy 2 does not violate any constraints of policy 1, eg. I want to have a copy of this file close to an existing file (and obey the general constraints). (2) specialize on individual files or directories by user hints: such as I want this file to be replicated twice but no need to be on two different racks. The paper I cited earlier offers an interesting way of expressing per-file or directory hints. But we may consider supporting it through something similar to mac's extended attributes. Flexible, Wide-Area Storage for Distributed Systems with WheelFS Jeremy Stribling, MIT CSAIL; Yair Sovran, New York University; Irene Zhang and Xavid Pretzer, MIT CSAIL; Jinyang Li, New York University; M. Frans Kaashoek and Robert Morris, MIT CSAIL http://www.usenix.org/events/nsdi09/tech/full_papers/stribling/stribling.pdf
          Hide
          Konstantin Shvachko added a comment -

          Dhruba, could you please clarify this:
          If I don't want to play tricks with replication policy, can I consider your patch as a mere refactoring? I think this is important, because we can talk about consistency of different replication policies mixed together, but this is more an experimental exercise. What really matters that we guarantee semantical backward compatibility under the current policy.
          As I commented here before I would prefer to keep replication policy constant for an instance of a file system, that is allow changing it at the formatting time only. But this is optional for me as long as the existing policy works as it used to.

          Show
          Konstantin Shvachko added a comment - Dhruba, could you please clarify this: If I don't want to play tricks with replication policy, can I consider your patch as a mere refactoring? I think this is important, because we can talk about consistency of different replication policies mixed together, but this is more an experimental exercise. What really matters that we guarantee semantical backward compatibility under the current policy. As I commented here before I would prefer to keep replication policy constant for an instance of a file system, that is allow changing it at the formatting time only. But this is optional for me as long as the existing policy works as it used to.
          Hide
          Sanjay Radia added a comment -

          >The usage model it to define a policy for the entire cluster when you create the cluster. This is especially useful when you have an HDFS instance on Amazon EB2 instance for example. This is not intended to be dynamic in any shape or form for a specified cluster.

          Given the above should the system record the policy in the fsImage to prevent it from being changed?
          Similarly should the balancer check to see if it has the same policy as the NN?
          In the past folks have complained that hadoop is too easy to misconfigure.

          >> I'm a little concerned that the Balancer and Fsck will contradict a policy based on filename because BlockPlacementPolicy.isValidMove() and BlockPlacementPolicy.verifyBlockPlacement()
          >I agree that there isn't an elegant way of materializing a filename during a rebalance operation. One workaround is for you to run a fsck to find the mapping from blocks to a file. Then you can use this information in your modified Balancer to do what is appropriate for you. You can also use the tool in HADOOP-5019 for this purpose.

          These are hard problems which indicate that this work is experiemental and that it will be a while before we figure out the right APIs.
          However the experimentation is useful and as long it does not impact the base code in a negative way, we should be able to add such features to hadoop after careful review.
          We should mark such new experimental APIs as "unstable" so that we are free to change them down the road.

          Show
          Sanjay Radia added a comment - >The usage model it to define a policy for the entire cluster when you create the cluster. This is especially useful when you have an HDFS instance on Amazon EB2 instance for example. This is not intended to be dynamic in any shape or form for a specified cluster. Given the above should the system record the policy in the fsImage to prevent it from being changed? Similarly should the balancer check to see if it has the same policy as the NN? In the past folks have complained that hadoop is too easy to misconfigure. >> I'm a little concerned that the Balancer and Fsck will contradict a policy based on filename because BlockPlacementPolicy.isValidMove() and BlockPlacementPolicy.verifyBlockPlacement() >I agree that there isn't an elegant way of materializing a filename during a rebalance operation. One workaround is for you to run a fsck to find the mapping from blocks to a file. Then you can use this information in your modified Balancer to do what is appropriate for you. You can also use the tool in HADOOP-5019 for this purpose. These are hard problems which indicate that this work is experiemental and that it will be a while before we figure out the right APIs. However the experimentation is useful and as long it does not impact the base code in a negative way, we should be able to add such features to hadoop after careful review. We should mark such new experimental APIs as "unstable" so that we are free to change them down the road.
          Hide
          dhruba borthakur added a comment -

          I agree that that this API might have to evolve over time. We should mark it as "unstable" in bold letters.

          > In the past folks have complained that hadoop is too easy to misconfigure.

          The default policy should work well for 99/9% people out there. Only a system admin can change the default policy. And one has to write Java code to implement a new policy... making it even tougher for most people to change policy.

          >Given the above should the system record the policy in the fsImage to prevent it from being changed? Similarly should the balancer check to see if it has the same policy as the NN?
          This can be done. This is mostly to reduce configuration errors, right? If so, can we defer it till we see it being a problem?

          > However the experimentation is useful and as long it does not impact the base code in a negative way, we should be able to add such features to hadoop after careful review.

          Thanks. Please review the code and provide some feedback if you so desire.

          Show
          dhruba borthakur added a comment - I agree that that this API might have to evolve over time. We should mark it as "unstable" in bold letters. > In the past folks have complained that hadoop is too easy to misconfigure. The default policy should work well for 99/9% people out there. Only a system admin can change the default policy. And one has to write Java code to implement a new policy... making it even tougher for most people to change policy. >Given the above should the system record the policy in the fsImage to prevent it from being changed? Similarly should the balancer check to see if it has the same policy as the NN? This can be done. This is mostly to reduce configuration errors, right? If so, can we defer it till we see it being a problem? > However the experimentation is useful and as long it does not impact the base code in a negative way, we should be able to add such features to hadoop after careful review. Thanks. Please review the code and provide some feedback if you so desire.
          Hide
          Robert Chansler added a comment -

          Experiments with replica placement are important, and I want to read the resulting papers. But a production implementation of an alternate policy has implications beyond just writing a class that implements a new interface. Perhaps we need something like a Black Box Warning to remind folk that using this feature to store real bytes is:

          • Dangerous
          • Really dangerous
          • Not something that will be supported in the future if it stands in the way of other development
          • Not something where others can be expected to volunteer help either to relieve the dangers or cope with the consequences.
          Show
          Robert Chansler added a comment - Experiments with replica placement are important, and I want to read the resulting papers. But a production implementation of an alternate policy has implications beyond just writing a class that implements a new interface. Perhaps we need something like a Black Box Warning to remind folk that using this feature to store real bytes is: Dangerous Really dangerous Not something that will be supported in the future if it stands in the way of other development Not something where others can be expected to volunteer help either to relieve the dangers or cope with the consequences.
          Hide
          dhruba borthakur added a comment -

          I am all for marking this API as experimental and can change anytime.

          I will also open a JIRA (and patch) for exposing a fileid for a HDFS file

          Show
          dhruba borthakur added a comment - I am all for marking this API as experimental and can change anytime. I will also open a JIRA (and patch) for exposing a fileid for a HDFS file
          Hide
          Sanjay Radia added a comment -

          Let me explain a little more on why I proposed the API to be experimental.
          For example, the filename parameter couples the Name space layer more closely to the block layer. This will make separating the block management out more difficult.
          FileIds (Hadoop-487) would be preferable.
          The other reason for making this API experimental is the impact on HDFS internal structure - we already find the block management code to be
          very messy and are trying to figure out how to change it while keeping the NN to be stable and reliable.

          HDFS-487 (fileIds) may not get resolved in a short time.
          If 487 is not settled, how can we better future proof the current 385 proposal (but still mark it experimental)
          One thought that comes to mind is changing the filename parameter to a cookie.
          Initially the cookie would be implemented using the filename and later could be changed to be the fileId.
          Thoughts on this approach?

          Show
          Sanjay Radia added a comment - Let me explain a little more on why I proposed the API to be experimental. For example, the filename parameter couples the Name space layer more closely to the block layer. This will make separating the block management out more difficult. FileIds (Hadoop-487) would be preferable. The other reason for making this API experimental is the impact on HDFS internal structure - we already find the block management code to be very messy and are trying to figure out how to change it while keeping the NN to be stable and reliable. HDFS-487 (fileIds) may not get resolved in a short time. If 487 is not settled, how can we better future proof the current 385 proposal (but still mark it experimental) One thought that comes to mind is changing the filename parameter to a cookie. Initially the cookie would be implemented using the filename and later could be changed to be the fileId. Thoughts on this approach?
          Hide
          Sanjay Radia added a comment -

          >>The usage model it to define a policy for the entire cluster when you create the cluster. This is especially useful when you have an HDFS instance on Amazon EB2 instance for example. This is not intended to be dynamic in any shape or form for a specified cluster.

          >Given the above should the system record the policy in the fsImage to prevent it from being changed?

          This is a little trickier than I originally thought. If the policy is fixed then it should be recorded in the fsimage at creation time - indeed the policy class name could be the policy name.
          On further thinking, we may want to allow compatible policies.
          So further thought is required on how to record the policy while allowing compatible changes.

          Shall we postpond the recording of the policy to another future Jira?

          Show
          Sanjay Radia added a comment - >>The usage model it to define a policy for the entire cluster when you create the cluster. This is especially useful when you have an HDFS instance on Amazon EB2 instance for example. This is not intended to be dynamic in any shape or form for a specified cluster. >Given the above should the system record the policy in the fsImage to prevent it from being changed? This is a little trickier than I originally thought. If the policy is fixed then it should be recorded in the fsimage at creation time - indeed the policy class name could be the policy name. On further thinking, we may want to allow compatible policies. So further thought is required on how to record the policy while allowing compatible changes. Shall we postpond the recording of the policy to another future Jira?
          Hide
          dhruba borthakur added a comment -

          The cookie approach has a major disadvantage. Let's assume that the cookie is implemented as a pathname for hadoop 0.21. Then somebody builds an app using this api that parses the cookie as a hdfs pathname. For hadoop 0.22, we change the cookie to be a fileid. The earlier application compiles against the hadoop 0.22 release, but at runtime the app will fail because it is unable to parse the cookie as a pathname. Instead, if we change the API signature for hadoop 0.22 to reflect that the pathname is not available anymore (instead a fileid is avilable), then the app will fail at compile time itself, which might be better than failing at runtime. No?

          Show
          dhruba borthakur added a comment - The cookie approach has a major disadvantage. Let's assume that the cookie is implemented as a pathname for hadoop 0.21. Then somebody builds an app using this api that parses the cookie as a hdfs pathname. For hadoop 0.22, we change the cookie to be a fileid. The earlier application compiles against the hadoop 0.22 release, but at runtime the app will fail because it is unable to parse the cookie as a pathname. Instead, if we change the API signature for hadoop 0.22 to reflect that the pathname is not available anymore (instead a fileid is avilable), then the app will fail at compile time itself, which might be better than failing at runtime. No?
          Hide
          dhruba borthakur added a comment -

          > Shall we postpond the recording of the policy to another future Jira?

          +1. We should mark this API as experimental in nature and avoid persisting it in the fsimage.

          Show
          dhruba borthakur added a comment - > Shall we postpond the recording of the policy to another future Jira? +1. We should mark this API as experimental in nature and avoid persisting it in the fsimage.
          Hide
          Hairong Kuang added a comment -

          Sorry for giving comments at this late stage. I was out of the country in the past few weeks.
          1. FsInodeName#getLocalName should return the full path name of the file not the name of the last component of the path. Better rename the method name to be getFullPathName. The implementation of this interface in INode should return the full path name as well. Also to allow FsInodeName to be able to return more information about a file in the future, better change the interface name to be FsInodeInfo.
          2. FsInodeName should also be a parameter to verifyBlockPlacement and isValidMove. This makes the interface consistent and makes it easier to get a file name for a placement policy like colocation.
          3. Would it provide better readability if the name of the class BlockPlacementPolicyDefault is changed to FirstLocalTwoRemoteBlockPlacementPolicy?
          4. For the default policy, isValidDelete should not be empty. It should implement the deletion policy in trunk. So other placement policy can override the default deletion policy.
          5. Even with the inValidMove API, the default balancer does not work with the colocation placement policy because it moves one block a time. I'd like to propose to remove isValidMove API until we could figure out a general balancer implementation. For this round, the default balancer could check with NameNode in the very beginning if NameNode runs the default block placement policy. If no, it stops to run.

          Show
          Hairong Kuang added a comment - Sorry for giving comments at this late stage. I was out of the country in the past few weeks. 1. FsInodeName#getLocalName should return the full path name of the file not the name of the last component of the path. Better rename the method name to be getFullPathName. The implementation of this interface in INode should return the full path name as well. Also to allow FsInodeName to be able to return more information about a file in the future, better change the interface name to be FsInodeInfo. 2. FsInodeName should also be a parameter to verifyBlockPlacement and isValidMove. This makes the interface consistent and makes it easier to get a file name for a placement policy like colocation. 3. Would it provide better readability if the name of the class BlockPlacementPolicyDefault is changed to FirstLocalTwoRemoteBlockPlacementPolicy? 4. For the default policy, isValidDelete should not be empty. It should implement the deletion policy in trunk. So other placement policy can override the default deletion policy. 5. Even with the inValidMove API, the default balancer does not work with the colocation placement policy because it moves one block a time. I'd like to propose to remove isValidMove API until we could figure out a general balancer implementation. For this round, the default balancer could check with NameNode in the very beginning if NameNode runs the default block placement policy. If no, it stops to run.
          Hide
          Owen O'Malley added a comment -

          By the way, now that the balancer has been addressed, I remove my -1.

          Show
          Owen O'Malley added a comment - By the way, now that the balancer has been addressed, I remove my -1.
          Hide
          dhruba borthakur added a comment -

          Incorporate review comments.

          Show
          dhruba borthakur added a comment - Incorporate review comments.
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          dhruba borthakur added a comment -

          1. FsInodeName#getLocalName should return the full path name of the file not the name of the last component of the path. Better rename the method name to be getFullPathName. The implementation of this interface in INode should return the full path name as well. Also to allow FsInodeName to be able to return more information about a file in the future, better change the interface name to be FsInodeInfo.
          Done

          2. FsInodeName should also be a parameter to verifyBlockPlacement and isValidMove. This makes the interface consistent and makes it easier to get a file name for a placement policy like colocation.
          Done.

          3. Would it provide better readability if the name of the class BlockPlacementPolicyDefault is changed to FirstLocalTwoRemoteBlockPlacementPolicy?
          I left it as BlockPlacementPolicy and BlockPlacementPolicyDefault just because if we decide to change the default policy later on, we would not want to rename the file itself. Please let me know if this is acceptable to you.

          4. For the default policy, isValidDelete should not be empty. It should implement the deletion policy in trunk. So other placement policy can override the default deletion policy.
          Done

          5. Even with the inValidMove API, the default balancer does not work with the colocation placement policy because it moves one block a time. I'd like to propose to remove isValidMove API until we could figure out a general balancer implementation. For this round, the default balancer could check with NameNode in the very beginning if NameNode runs the default block placement policy. If no, it stops to run.
          Done. I made the Balancer check that the configured policy is BlockPlacementPolicyDefault. I stayed away from adding an API to the NN at present, especially because this API is experimental in nature.

          Would appreciate it if you can review it one more time. Thanks.

          Show
          dhruba borthakur added a comment - 1. FsInodeName#getLocalName should return the full path name of the file not the name of the last component of the path. Better rename the method name to be getFullPathName. The implementation of this interface in INode should return the full path name as well. Also to allow FsInodeName to be able to return more information about a file in the future, better change the interface name to be FsInodeInfo. Done 2. FsInodeName should also be a parameter to verifyBlockPlacement and isValidMove. This makes the interface consistent and makes it easier to get a file name for a placement policy like colocation. Done. 3. Would it provide better readability if the name of the class BlockPlacementPolicyDefault is changed to FirstLocalTwoRemoteBlockPlacementPolicy? I left it as BlockPlacementPolicy and BlockPlacementPolicyDefault just because if we decide to change the default policy later on, we would not want to rename the file itself. Please let me know if this is acceptable to you. 4. For the default policy, isValidDelete should not be empty. It should implement the deletion policy in trunk. So other placement policy can override the default deletion policy. Done 5. Even with the inValidMove API, the default balancer does not work with the colocation placement policy because it moves one block a time. I'd like to propose to remove isValidMove API until we could figure out a general balancer implementation. For this round, the default balancer could check with NameNode in the very beginning if NameNode runs the default block placement policy. If no, it stops to run. Done. I made the Balancer check that the configured policy is BlockPlacementPolicyDefault. I stayed away from adding an API to the NN at present, especially because this API is experimental in nature. Would appreciate it if you can review it one more time. Thanks.
          dhruba borthakur made changes -
          Attachment BlockPlacementPluggable6.txt [ 12418319 ]
          Hide
          dhruba borthakur added a comment -

          will some kind folks please review this one? thanks.

          Show
          dhruba borthakur added a comment - will some kind folks please review this one? thanks.
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          -1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12418319/BlockPlacementPluggable6.txt
          against trunk revision 811493.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          -1 javadoc. The javadoc tool appears to have generated 1 warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/13/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/13/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/13/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/13/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418319/BlockPlacementPluggable6.txt against trunk revision 811493. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. -1 javadoc. The javadoc tool appears to have generated 1 warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/13/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/13/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/13/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/13/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          Fix javadoc warnings.

          Show
          dhruba borthakur added a comment - Fix javadoc warnings.
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Tags fb
          Hide
          dhruba borthakur added a comment -

          Fix javadoc warnings.

          Show
          dhruba borthakur added a comment - Fix javadoc warnings.
          dhruba borthakur made changes -
          Attachment BlockPlacementPluggable7.txt [ 12418882 ]
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          dhruba borthakur added a comment -

          dunno why hadoopQA tests did not trigger. Cancelling to trigger HudsonQA

          Show
          dhruba borthakur added a comment - dunno why hadoopQA tests did not trigger. Cancelling to trigger HudsonQA
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Open [ 1 ]
          Hide
          dhruba borthakur added a comment -

          Trigger HudsonQA

          Show
          dhruba borthakur added a comment - Trigger HudsonQA
          dhruba borthakur made changes -
          Status Open [ 1 ] Patch Available [ 10002 ]
          Hide
          Hadoop QA added a comment -

          +1 overall. Here are the results of testing the latest attachment
          http://issues.apache.org/jira/secure/attachment/12418882/BlockPlacementPluggable7.txt
          against trunk revision 813101.

          +1 @author. The patch does not contain any @author tags.

          +1 tests included. The patch appears to include 3 new or modified tests.

          +1 javadoc. The javadoc tool did not generate any warning messages.

          +1 javac. The applied patch does not increase the total number of javac compiler warnings.

          +1 findbugs. The patch does not introduce any new Findbugs warnings.

          +1 release audit. The applied patch does not increase the total number of release audit warnings.

          +1 core tests. The patch passed core unit tests.

          +1 contrib tests. The patch passed contrib unit tests.

          Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/testReport/
          Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
          Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/artifact/trunk/build/test/checkstyle-errors.html
          Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/console

          This message is automatically generated.

          Show
          Hadoop QA added a comment - +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12418882/BlockPlacementPluggable7.txt against trunk revision 813101. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Checkstyle results: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/artifact/trunk/build/test/checkstyle-errors.html Console output: http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/21/console This message is automatically generated.
          Hide
          dhruba borthakur added a comment -

          This patch is ready to be committed. This has been reviewed by Tom, Sanjay and Hairong, but if somebody wants to review it again, now is the time to do so. Thanks.

          Show
          dhruba borthakur added a comment - This patch is ready to be committed. This has been reviewed by Tom, Sanjay and Hairong, but if somebody wants to review it again, now is the time to do so. Thanks.
          Hide
          Sanjay Radia added a comment -

          Reviewed the patch.
          +1

          Show
          Sanjay Radia added a comment - Reviewed the patch. +1
          Hide
          dhruba borthakur added a comment -

          I will commit this in a day.

          Show
          dhruba borthakur added a comment - I will commit this in a day.
          Hide
          dhruba borthakur added a comment -

          I just committed this.

          Show
          dhruba borthakur added a comment - I just committed this.
          dhruba borthakur made changes -
          Status Patch Available [ 10002 ] Resolved [ 5 ]
          Hadoop Flags [Reviewed]
          Release Note An experimental API that allows an module external to HDFS to specify the policy that HDFS should use to allocate new blocks replicas of a file.
          Resolution Fixed [ 1 ]
          Tags fb
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk-Commit #34 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/34/)
          . Add support for an API that allows a module external
          to HDFS to specify how HDFS blocks should be placed. (dhruba)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk-Commit #34 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk-Commit/34/ ) . Add support for an API that allows a module external to HDFS to specify how HDFS blocks should be placed. (dhruba)
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Hdfs-trunk #84 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/84/)
          . Add support for an API that allows a module external
          to HDFS to specify how HDFS blocks should be placed. (dhruba)

          Show
          Hudson added a comment - Integrated in Hadoop-Hdfs-trunk #84 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Hdfs-trunk/84/ ) . Add support for an API that allows a module external to HDFS to specify how HDFS blocks should be placed. (dhruba)
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #29 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/29/)
          . Add support for an API that allows a module external
          to HDFS to specify how HDFS blocks should be placed. (dhruba)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h5.grid.sp2.yahoo.net #29 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h5.grid.sp2.yahoo.net/29/ ) . Add support for an API that allows a module external to HDFS to specify how HDFS blocks should be placed. (dhruba)
          Konstantin Shvachko made changes -
          Link This issue relates to HDFS-629 [ HDFS-629 ]
          Hide
          Robert Chansler added a comment -

          Is it possible this change led to the difficulty reported in HDFS-623?

          Show
          Robert Chansler added a comment - Is it possible this change led to the difficulty reported in HDFS-623 ?
          Hide
          Hudson added a comment -

          Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #11 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/11/)

          Show
          Hudson added a comment - Integrated in Hdfs-Patch-h2.grid.sp2.yahoo.net #11 (See http://hudson.zones.apache.org/hudson/job/Hdfs-Patch-h2.grid.sp2.yahoo.net/11/ )
          Hide
          Robert Chansler added a comment -

          Editorial pass over all release notes prior to publication of 0.21.

          Show
          Robert Chansler added a comment - Editorial pass over all release notes prior to publication of 0.21.
          Robert Chansler made changes -
          Release Note An experimental API that allows an module external to HDFS to specify the policy that HDFS should use to allocate new blocks replicas of a file. New experimental API BlockPlacementPolicy allows investigating alternate rules for locating block replicas.
          Tom White made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Jeff Hammerbacher made changes -
          Link This issue relates to HDFS-1451 [ HDFS-1451 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Created HDFS-1629 to slightly change BlockPlacementPolicy. See if you like to review the patch.

          Show
          Tsz Wo Nicholas Sze added a comment - Created HDFS-1629 to slightly change BlockPlacementPolicy . See if you like to review the patch.
          Harsh J made changes -
          Link This issue is duplicated by HDFS-3564 [ HDFS-3564 ]
          Tsz Wo Nicholas Sze made changes -
          Link This issue relates to HDFS-3564 [ HDFS-3564 ]
          Tsz Wo Nicholas Sze made changes -
          Link This issue is duplicated by HDFS-3564 [ HDFS-3564 ]
          Hide
          Sumadhur Reddy Bolli added a comment -

          Patches to port the pluggable interface to branch-1 and branch-1-win.

          Show
          Sumadhur Reddy Bolli added a comment - Patches to port the pluggable interface to branch-1 and branch-1-win.
          Sumadhur Reddy Bolli made changes -
          Attachment blockplacementpolicy-branch-1.patch [ 12536257 ]
          Attachment blockplacementpolicy-branch-1-win.patch [ 12536258 ]
          Sumadhur Reddy Bolli made changes -
          Link This issue relates to HDFS-3649 [ HDFS-3649 ]
          Hide
          stack added a comment -

          Patch for 0.94 (Build #359 failed because of this)

          Show
          stack added a comment - Patch for 0.94 (Build #359 failed because of this)
          stack made changes -
          Attachment rat094.txt [ 12537666 ]
          Hide
          stack added a comment -

          Ignore above attachment. Wrong issue (I don't seem to have delete perms).

          Show
          stack added a comment - Ignore above attachment. Wrong issue (I don't seem to have delete perms).
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Hi Suma, BlockPlacementPolicyDefault is not found in the patch. Forgot to add the new files?

          Show
          Tsz Wo Nicholas Sze added a comment - Hi Suma, BlockPlacementPolicyDefault is not found in the patch. Forgot to add the new files?
          Hide
          Sumadhur Reddy Bolli added a comment -

          Sorry. I missed newly added files in the previous patches. Submitted new patches with all files. Please review.

          Show
          Sumadhur Reddy Bolli added a comment - Sorry. I missed newly added files in the previous patches. Submitted new patches with all files. Please review.
          Sumadhur Reddy Bolli made changes -
          Attachment blockplacementpolicy2-branch-1.patch [ 12538415 ]
          Attachment blockplacementpolicy2-branch-1-win.patch [ 12538416 ]
          Hide
          Tsz Wo Nicholas Sze added a comment -

          Thanks for the update. Some comments:

          • Add "srcPath" as a parameter of FSNamesystem.chooseDatanode(..) and then pass "path" in NamenodeWebHdfsMethods.chooseDatanode(..).
          • Add the following comments to FSNamesystem.computeReplicationWorkForBlock(..) as in the original patch.
                 // choose replication targets: NOT HOLDING THE GLOBAL LOCK
            +    // It is costly to extract the filename for which chooseTargets is called,
            +    // so for now we pass in the Inode itself.
            
          • Please remove the tabs since we do not use tabs in Hadoop. Indentation is two spaces.

          Please also run all the unit tests (ant test-core) since Jenkins doesn't work for branch-1. Thanks!

          Show
          Tsz Wo Nicholas Sze added a comment - Thanks for the update. Some comments: Add "srcPath" as a parameter of FSNamesystem.chooseDatanode(..) and then pass "path" in NamenodeWebHdfsMethods.chooseDatanode(..). Add the following comments to FSNamesystem.computeReplicationWorkForBlock(..) as in the original patch. // choose replication targets: NOT HOLDING THE GLOBAL LOCK + // It is costly to extract the filename for which chooseTargets is called, + // so for now we pass in the Inode itself. Please remove the tabs since we do not use tabs in Hadoop. Indentation is two spaces. Please also run all the unit tests (ant test-core) since Jenkins doesn't work for branch-1. Thanks!
          Hide
          Sumadhur Reddy Bolli added a comment -

          Thanks for the feedback Nicholas. Addressed all the comments in the new patches. Please review.

          Show
          Sumadhur Reddy Bolli added a comment - Thanks for the feedback Nicholas. Addressed all the comments in the new patches. Please review.
          Sumadhur Reddy Bolli made changes -
          Attachment blockplacementpolicy3-branch-1.patch [ 12538967 ]
          Attachment blockplacementpolicy3-branch-1-win.patch [ 12538968 ]
          Hide
          Sumadhur Reddy Bolli added a comment -

          I also ran all the unit tests and ensured that they pass.

          Show
          Sumadhur Reddy Bolli added a comment - I also ran all the unit tests and ensured that they pass.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          +1 the branch-1 patch looks good.

          Show
          Tsz Wo Nicholas Sze added a comment - +1 the branch-1 patch looks good.
          Hide
          Tsz Wo Nicholas Sze added a comment -

          I have committed the branch-1 and branch-1-win patches. Thanks, Suma!

          Show
          Tsz Wo Nicholas Sze added a comment - I have committed the branch-1 and branch-1-win patches. Thanks, Suma!
          Suresh Srinivas made changes -
          Fix Version/s 1.2.0 [ 12321657 ]
          Fix Version/s 1-win [ 12320362 ]

            People

            • Assignee:
              dhruba borthakur
              Reporter:
              dhruba borthakur
            • Votes:
              4 Vote for this issue
              Watchers:
              36 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development