Details
-
Improvement
-
Status: Closed
-
Minor
-
Resolution: Fixed
-
None
-
None
-
Reviewed
Description
Balancer moves blocks between Datanode(Ver. <2.6 ).
Balancer moves blocks between StorageGroups ( introduced by HDFS-6584) , in the new version(Ver. >=2.6) .
function
class DBlock extends Locations<StorageGroup> DBlock.isLocatedOn(StorageGroup loc)
is flawed, may causes 2 replicas ends in same node after running balance.
For example:
We have 2 nodes. Each node has two storages.
We have (DN0, SSD), (DN0, DISK), (DN1, SSD), (DN1, DISK).
We have a block with ONE_SSD storage policy.
The block has 2 replicas. They are in (DN0,SSD) and (DN1,DISK).
Replica in (DN0,SSD) should not be moved to (DN1,SSD) after running Balancer.
Otherwise DN1 has 2 replicas.
--------------
UPDATE(Thanks szetszwo for pointing it out):
This bug will NOT causes 2 replicas end in same node after running balance, thanks to Datanode rejecting it.
We see a lot of ERROR when running test.
2015-04-27 10:08:15,809 ERROR datanode.DataNode (DataXceiver.java:run(277)) - host1.foo.com:59537:DataXceiver error processing REPLACE_BLOCK operation src: /127.0.0.1:52532 dst: /127.0.0.1:59537 org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-264794661-9.96.1.34-1430100451121:blk_1073741825_1001 already exists in state FINALIZED and thus cannot be created. at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.createTemporary(FsDatasetImpl.java:1447) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.<init>(BlockReceiver.java:186) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.replaceBlock(DataXceiver.java:1158) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.opReplaceBlock(Receiver.java:229) at org.apache.hadoop.hdfs.protocol.datatransfer.Receiver.processOp(Receiver.java:77) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:250) at java.lang.Thread.run(Thread.java:722)
The Balancer runs 5~20 times iterations in the test, before it exits.
It's ineffecient.
Balancer should not schedule it in the first place, even though it'll failed anyway. In the test, it should exit after 5 times iteration.