Details
Description
Hello there.
I've build spark 3.2.2 for my cluster which has hadoop 3.1.2 and scala 2.12 (pom.xml is attached).
build script:
cd spark && \
./build/mvn -Pyarn -Dhadoop.version=3.1.2 -Pscala-2.12 -Phive -Phive-thriftserver -DskipTests clean package
It was working fine but a few applications has got strage error and warning form time to time.
It always looks like datanode connection lost and shuffle reading issues.
2022-11-16 22:18:25,423 ERROR server.TransportChannelHandler: Connection to s00abd02node9.company.com/10.x.y.163:35143 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.shuffle.io.connectionTimeout if this is wrong. 2022-11-16 22:18:25,423 ERROR client.TransportResponseHandler: Still have 5 requests outstanding when connection from s00abd02node9.company.com/10.x.y.163:35143 is closed 2022-11-16 22:18:25,423 WARN netty.NettyBlockTransferService: Error while trying to get the host local dirs for [16] 2022-11-16 22:18:25,425 ERROR storage.ShuffleBlockFetcherIterator: Error occurred while fetching host local blocks
So when it happend application will go to retry and fail after 2nd start.
Can anybody help?