Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Cannot Reproduce
-
None
-
None
-
None
Description
15 node physical cluster. All Datanodes are up and running.
Client using 16 threads attempting to write 16000 x 10MB+ files using the FsStress utility
(https://github.com/arp7/FsPerfTest) fails with the following error.
This is an intermittent issue.
Server side exceptions
19/04/22 10:13:32 ERROR io.KeyOutputStream: Try to allocate more blocks for write failed, already allocated 0 blocks for this write. 19/04/18 14:33:23 WARN io.KeyOutputStream: Encountered exception java.io.IOException: Unexpected Storage Container Exception: java.util.concurrent.CompletionException: java.util.concurrent.CompletionException: org.apache.ratis.protocol.AlreadyClosedException: SlidingWindow$Client client-ADE7F801D3AD->RAFT is closed.. The last committed block length is 0, uncommitted data length is 10485760 retry count 0
Client side exceptions
FAILED org.apache.ratis.protocol.NotLeaderException: Server c6e64cc4-91e9-4b36-83e4-6d84a4e71b7f is not the leader (f44c1413-0847-45e3-982d-ac3aec15dffc:10.17.200.23:9858). Request must be sent to leader., logIndex=0, commits[c6e64cc4-91e9-4b36-83e4-6d84a4e71b7f:c131161, 287eccfb-8461-419a-8732-529d042380b3:c131161, f44c1413-0847-45e3-982d-ac3aec15dffc:c131161]
In the case of small key sizes (<1MB) and big key sizes with single thread, the above client side exceptions are infrequent. However, in the case of multithreaded 10MB+ size keys, the exceptions occur about 50% of the time and eventually cause write failures. I have attached one such failed pipeline logs.
Datanode Logs.zip