Details
-
Improvement
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
2.4.0
-
None
Description
There are a couple of code paths in the SASL fallback handling that result in misleading exceptions printed to logs. One of them is if a timeout occurs during authentication; for example:
19/03/15 11:21:37 WARN crypto.AuthClientBootstrap: New auth protocol failed, trying SASL. java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:258) at org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:105) at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:79) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:262) at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:192) at org.apache.spark.network.shuffle.ExternalShuffleClient.lambda$fetchBlocks$0(ExternalShuffleClient.java:100) at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141) ... Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task. at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276) at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96) at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:254) ... 38 more 19/03/15 11:21:38 WARN server.TransportChannelHandler: Exception in connection from vc1033.halxg.cloudera.com/10.17.216.43:7337 java.lang.IllegalArgumentException: Frame length should be positive: -3702202170875367528 at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
The IllegalArgumentException shouldn't happen, it only happens because the code is ignoring the time out and retrying, at which point the remote side is in a different state and thus doesn't expect the message.
The same line that prints that exception can result in a noisy log message when the remote side (e.g. an old shuffle service) does not understand the new auth protocol. Since it's a warning it seems like something is wrong, when it's just doing what's expected.
Attachments
Issue Links
- links to