Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
Description
Upgrade versions:
Pre upgrade hash: https://github.com/apache/ozone/commit/6ee6c357678676661ebb3181a56622c79b487bc1
Post upgrade Hash:
https://github.com/apache/ozone/commit/46b6f3def1d84ca769affb4d3f0d84dece6e8567
Scenario:
Write a EC file(5GB) RS-3-2-1024K policy(in this case) before upgrade, after upgrade, shut down either 2 Parity nodes(this case) or 2 Data nodes, as the policy supports tolerating 2 DN failure. Check if reconstruction happens after sometime.
Observed Behavior:
1. Data was successfully written pre-upgrade using Freon.
File name: o3://ozone1711558189/ec-construct-vol/ec-construct-buck/ec-construction/0
2. Post upgrade Stop two of the DNs, in this case the Parity nodes that we obtained from one of the containers that was storing the above file's data.
ozone admin container info 1004 --json 2024-03-27 21:35:15,065|INFO|MainThread|machine.py:232 - run()||GUID=183f2d10-e3a7-407f-adb5-b87f3e3af53b|Exit Code: 0 2024-03-27 21:35:15,098|INFO|MainThread|ozone.py:723 - find_ec_data_parity_hosts()|parity hosts: ['DN-4', 'DN-3'] 2024-03-27 21:35:15,098|INFO|MainThread|ozone.py:724 - find_ec_data_parity_hosts()|data hosts: ['DN-8', 'DN-5', 'DN-1']
2024-03-27 21:35:15,311|INFO|MainThread|cm_apilib.py:1214 - stopComponent()|Initiating stop of OZONE_DATANODE at host DN-4 2024-03-27 21:35:15,349|INFO|MainThread|cm_apilib.py:1218 - stopComponent()|Command name = Stop , ID = 2860 2024-03-27 21:35:15,580|INFO|MainThread|cm_apilib.py:1214 - stopComponent()|Initiating stop of OZONE_DATANODE at host DN-3 2024-03-27 21:35:15,609|INFO|MainThread|cm_apilib.py:1218 - stopComponent()|Command name = Stop , ID = 2862
Node DN-3 and DN-4 are stopped.
3. Read file's data(Online Reconstruction) and compute checksum, -> That matched.
4. Wait for Reconstruction to happen, test waited for 20 Minutes, but Still only 3 DNs were present even after 20 minutes:
['DN-5', 'DN-1', 'DN-8']
Infact still after 10 hours(At the time of writing), there are still 3 DNs only:
date Thu Mar 28 08:39:16 UTC 2024 ozone admin container info 1004 --json { "containerInfo" : { "state" : "CLOSED", "stateEnterTime" : "2024-03-27T18:43:51.934Z", "replicationConfig" : { "data" : 3, "parity" : 2, "ecChunkSize" : 1048576, "codec" : "RS", "requiredNodes" : 5, "replicationType" : "EC" }, "usedBytes" : 1342177280, "numberOfKeys" : 5, "lastUsed" : "2024-03-28T08:39:24.535189Z", "owner" : "om1", "containerID" : 1004, "deleteTransactionId" : 0, "sequenceId" : 0, "deleted" : false, "open" : false }, "pipeline" : { "id" : { "id" : "73532c14-40ac-4924-9353-2f18ab0d63f2" }, "replicationConfig" : { "data" : 3, "parity" : 2, "ecChunkSize" : 1048576, "codec" : "RS", "requiredNodes" : 5, "replicationType" : "EC" }, "nodesInOrder" : [ { "level" : 0, "cost" : 0, "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880", "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880", "ipAddress" : "10.140.37.12", "hostName" : "DN-5", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : -662262523, "networkLocation" : "/default", "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880", "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880", "numOfLeaves" : 1 }, { "level" : 0, "cost" : 0, "uuid" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "uuidString" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "ipAddress" : "10.140.40.9", "hostName" : "DN-1", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : -1387859873, "networkLocation" : "/default", "networkName" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "networkFullPath" : "/default/d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "numOfLeaves" : 1 }, { "level" : 0, "cost" : 0, "uuid" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "uuidString" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "ipAddress" : "10.140.137.128", "hostName" : "DN-8", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : 1098159392, "networkLocation" : "/default", "networkName" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "networkFullPath" : "/default/ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "numOfLeaves" : 1 } ], "creationTimestamp" : "2024-03-28T08:39:24.480Z", "stateEnterTime" : "2024-03-28T08:39:24.545517Z", "leaderNode" : { "level" : 0, "cost" : 0, "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880", "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880", "ipAddress" : "10.140.37.12", "hostName" : "DN-5", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : -662262523, "networkLocation" : "/default", "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880", "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880", "numOfLeaves" : 1 }, "firstNode" : { "level" : 0, "cost" : 0, "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880", "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880", "ipAddress" : "10.140.37.12", "hostName" : "DN-5", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : -662262523, "networkLocation" : "/default", "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880", "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880", "numOfLeaves" : 1 }, "closestNode" : { "level" : 0, "cost" : 0, "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880", "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880", "ipAddress" : "10.140.37.12", "hostName" : "DN-5", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : -662262523, "networkLocation" : "/default", "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880", "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880", "numOfLeaves" : 1 }, "allocationTimeout" : false, "healthy" : true, "pipelineState" : "ALLOCATED", "nodes" : [ { "level" : 0, "cost" : 0, "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880", "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880", "ipAddress" : "10.140.37.12", "hostName" : "DN-5", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : -662262523, "networkLocation" : "/default", "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880", "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880", "numOfLeaves" : 1 }, { "level" : 0, "cost" : 0, "uuid" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "uuidString" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "ipAddress" : "10.140.40.9", "hostName" : "DN-1", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : -1387859873, "networkLocation" : "/default", "networkName" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "networkFullPath" : "/default/d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "numOfLeaves" : 1 }, { "level" : 0, "cost" : 0, "uuid" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "uuidString" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "ipAddress" : "10.140.137.128", "hostName" : "DN-8", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : 1098159392, "networkLocation" : "/default", "networkName" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "networkFullPath" : "/default/ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "numOfLeaves" : 1 } ], "empty" : false, "type" : "EC" }, "replicas" : [ { "containerID" : 1004, "state" : "CLOSED", "datanodeDetails" : { "level" : 0, "cost" : 0, "uuid" : "6179347f-5824-41d4-b722-f1dbc5f14880", "uuidString" : "6179347f-5824-41d4-b722-f1dbc5f14880", "ipAddress" : "10.140.37.12", "hostName" : "DN-5z", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : -662262523, "networkLocation" : "/default", "networkName" : "6179347f-5824-41d4-b722-f1dbc5f14880", "networkFullPath" : "/default/6179347f-5824-41d4-b722-f1dbc5f14880", "numOfLeaves" : 1 }, "placeOfBirth" : "6179347f-5824-41d4-b722-f1dbc5f14880", "sequenceId" : 0, "keyCount" : 5, "bytesUsed" : 1342177280, "replicaIndex" : 2 }, { "containerID" : 1004, "state" : "CLOSED", "datanodeDetails" : { "level" : 0, "cost" : 0, "uuid" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "uuidString" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "ipAddress" : "10.140.40.9", "hostName" : "DN-1", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : -1387859873, "networkLocation" : "/default", "networkName" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "networkFullPath" : "/default/d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "numOfLeaves" : 1 }, "placeOfBirth" : "d8afb52b-5f4c-4d94-9286-7c3cfd6c315c", "sequenceId" : 0, "keyCount" : 5, "bytesUsed" : 1342177280, "replicaIndex" : 3 }, { "containerID" : 1004, "state" : "CLOSED", "datanodeDetails" : { "level" : 0, "cost" : 0, "uuid" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "uuidString" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "ipAddress" : "10.140.137.128", "hostName" : "DN-8", "ports" : [ { "name" : "HTTPS", "value" : 9883 }, { "name" : "CLIENT_RPC", "value" : 9864 }, { "name" : "REPLICATION", "value" : 9886 }, { "name" : "RATIS", "value" : 9858 }, { "name" : "RATIS_ADMIN", "value" : 9857 }, { "name" : "RATIS_SERVER", "value" : 9856 }, { "name" : "STANDALONE", "value" : 9859 } ], "setupTime" : 0, "persistedOpState" : "IN_SERVICE", "persistedOpStateExpiryEpochSec" : 0, "initialVersion" : 0, "currentVersion" : 1, "decommissioned" : false, "maintenance" : false, "signature" : 1098159392, "networkLocation" : "/default", "networkName" : "ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "networkFullPath" : "/default/ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e", "numOfLeaves" : 1 }, "placeOfBirth" : "711656cf-a99e-4b2c-8c35-f015ee94889c", "sequenceId" : 0, "keyCount" : 5, "bytesUsed" : 1342177280, "replicaIndex" : 1 } ] }
Checked the SCM Logs, it is still sending reconstructECContainersCommand,
2024-03-28 08:36:56,748 INFO [Under Replicated Processor]-org.apache.hadoop.hdds.scm.container.replication.ReplicationManager: Sending command [reconstructECContainersCommand: containerID: 1004, replicationConfig: EC{rs-3-2-1024k}, sources: [ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e(DN-8/10.140.137.128) replicaIndex: 1, 6179347f-5824-41d4-b722-f1dbc5f14880(DN-5/10.140.37.12) replicaIndex: 2, d8afb52b-5f4c-4d94-9286-7c3cfd6c315c(DN-1/10.140.40.9) replicaIndex: 3], targets: [572ed33d-a834-4d80-be35-7b1b19c8bd74(DN-7/10.140.234.130), 711656cf-a99e-4b2c-8c35-f015ee94889c(DN-2/10.140.45.129)], missingIndexes: [4, 5]] for container ContainerInfo{id=#1004, state=CLOSED, stateEnterTime=2024-03-27T18:43:51.934Z, pipelineID=PipelineID=53f5587f-9e6c-465d-a0cb-b82d10c227d3, owner=om1} to 572ed33d-a834-4d80-be35-7b1b19c8bd74(DN-7/10.140.234.130) with datanode deadline 1711615886747 and scm deadline 1711615916747
Checked one of the Target DN DN-7, its throwing below warnings.
2024-03-28 08:37:14,982 WARN [ContainerReplicationThread-5]-org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask: FAILED reconstructECContainersCommand: containerID=1004, replication=rs-3-2-1024k, missingIndexes=[4, 5], sources={1=ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e(DN-8/10.140.137.128), 2=6179347f-5824-41d4-b722-f1dbc5f14880(DN-5/10.140.37.12), 3=d8afb52b-5f4c-4d94-9286-7c3cfd6c315c(DN-1/10.140.40.9)}, targets={4=572ed33d-a834-4d80-be35-7b1b19c8bd74(DN-7/10.140.234.130), 5=711656cf-a99e-4b2c-8c35-f015ee94889c(DN-2/10.140.45.129)} after 10639 ms java.io.IOException: None of the block data have checksum which means 2(parity)+1 blocks are not present at org.apache.hadoop.hdds.scm.storage.ECBlockOutputStream.executePutBlock(ECBlockOutputStream.java:156) at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECBlockGroup(ECReconstructionCoordinator.java:325) at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinator.reconstructECContainerGroup(ECReconstructionCoordinator.java:171) at org.apache.hadoop.ozone.container.ec.reconstruction.ECReconstructionCoordinatorTask.runTask(ECReconstructionCoordinatorTask.java:68) at org.apache.hadoop.ozone.container.replication.ReplicationSupervisor$TaskRunner.run(ReplicationSupervisor.java:359) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) 2024-03-28 08:37:14,982 WARN [ContainerReplicationThread-5]-org.apache.hadoop.ozone.container.replication.ReplicationSupervisor: Failed FAILED reconstructECContainersCommand: containerID=1004, replication=rs-3-2-1024k, missingIndexes=[4, 5], sources={1=ef7ae3e9-5ec3-49d6-9b93-1c687009bc1e(DN-8/10.140.137.128), 2=6179347f-5824-41d4-b722-f1dbc5f14880(DN-5/10.140.37.12), 3=d8afb52b-5f4c-4d94-9286-7c3cfd6c315c(DN-1/10.140.40.9)}, targets={4=572ed33d-a834-4d80-be35-7b1b19c8bd74(DN-7/10.140.234.130), 5=711656cf-a99e-4b2c-8c35-f015ee94889c(DN-2/10.140.45.129)}
Expected Behavior: Reconstruction should have happened
Note: This is fairly reproducible everytime.
cc: siddhant
Attachments
Issue Links
- links to