[HDDS-5971] [disabled] TestHDDSUpgrade fails to allocate pipeline after finalization - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: None
Fix Version/s: None
Component/s: None
Labels:
None

Description

TestHDDSUpgrade is frequently hitting maven global test timeout threshold (about 1 hr), causing integration (filesystem-hdds) to fail. The class's junit timeout is set to 11000000ms (3 hrs+).

I've seen this at least 3 times recently for new PR CI runs. Need to investigate why some test cases can become stuck for so long. I ran the test class locally with IntelliJ and it finished in 5 min 55 sec:

CC avijayan erose

Failing run:

https://github.com/apache/ozone/runs/4160837361

Found this I the above run's artifact bundle: No healthy node found to allocate container ?

org.apache.hadoop.hdds.upgrade.TestHDDSUpgrade-output.txt

2021-11-10 04:46:13,552 [Time-limited test] INFO  upgrade.UpgradeFinalizer (SCMUpgradeFinalizer.java:postFinalizeUpgrade(115)) - Waiting for at least one open pipeline after SCM finalization.
2021-11-10 04:46:18,553 [Time-limited test] INFO  upgrade.UpgradeFinalizer (SCMUpgradeFinalizer.java:postFinalizeUpgrade(115)) - Waiting for at least one open pipeline after SCM finalization.
2021-11-10 04:46:18,569 [RatisPipelineUtilsThread - 0] ERROR scm.SCMCommonPlacementPolicy (SCMCommonPlacementPolicy.java:filterNodesWithSpace(171)) - Unable to find enough nodes that meet the space requirement of 1073741824 bytes for metadata and 5368709120 bytes for data in healthy node set. Required 3. Found 2.
2021-11-10 04:46:23,553 [Time-limited test] INFO  upgrade.UpgradeFinalizer (SCMUpgradeFinalizer.java:postFinalizeUpgrade(115)) - Waiting for at least one open pipeline after SCM finalization.
2021-11-10 04:46:24,033 [ReplicationMonitor] ERROR scm.SCMCommonPlacementPolicy (SCMCommonPlacementPolicy.java:chooseDatanodes(140)) - No healthy node found to allocate container.
2021-11-10 04:46:24,033 [ReplicationMonitor] WARN  container.ReplicationManager (ReplicationManager.java:handleUnderReplicatedContainer(1199)) - Exception while replicating container 2.
org.apache.hadoop.hdds.scm.exceptions.SCMException: No healthy node found to allocate container.
	at org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:141)
	at org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRandom.chooseDatanodes(SCMContainerPlacementRandom.java:78)
	at org.apache.hadoop.hdds.scm.container.ReplicationManager.handleUnderReplicatedContainer(ReplicationManager.java:1163)
	at org.apache.hadoop.hdds.scm.container.ReplicationManager.processContainer(ReplicationManager.java:519)
	at java.util.ArrayList.forEach(ArrayList.java:1259)
	at org.apache.hadoop.hdds.scm.container.ReplicationManager.processAll(ReplicationManager.java:369)
	at org.apache.hadoop.hdds.scm.container.ReplicationManager.run(ReplicationManager.java:383)
	at java.lang.Thread.run(Thread.java:748)
2021-11-10 04:46:24,033 [ReplicationMonitor] INFO  container.ReplicationManager (ReplicationManager.java:processAll(371)) - Replication Monitor Thread took 3 milliseconds for processing 2 containers.
2021-11-10 04:46:28,554 [Time-limited test] INFO  upgrade.UpgradeFinalizer (SCMUpgradeFinalizer.java:postFinalizeUpgrade(115)) - Waiting for at least one open pipeline after SCM finalization.
2021-11-10 04:46:33,556 [Time-limited test] INFO  upgrade.UpgradeFinalizer (SCMUpgradeFinalizer.java:postFinalizeUpgrade(115)) - Waiting for at least one open pipeline after SCM finalization.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

screenshot-1.jpg
11/Nov/21 03:39
67 kB
Siyao Meng
4390545403-it-filesystem-hdds.zip
02/Dec/21 05:55
332 kB
Siyao Meng

Activity

People

Assignee:: Unassigned

Reporter:: Siyao Meng

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Nov/21 03:35

Updated:: 17/Jan/24 13:16