Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-27219

Misleading exceptions in transport code's SASL fallback path

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.4.0
    • 3.0.0
    • Spark Core
    • None

    Description

      There are a couple of code paths in the SASL fallback handling that result in misleading exceptions printed to logs. One of them is if a timeout occurs during authentication; for example:

      19/03/15 11:21:37 WARN crypto.AuthClientBootstrap: New auth protocol failed, trying SASL.
      java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
              at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
              at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:258)
              at org.apache.spark.network.crypto.AuthClientBootstrap.doSparkAuth(AuthClientBootstrap.java:105)
              at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:79)
              at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:262)
              at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:192)
              at org.apache.spark.network.shuffle.ExternalShuffleClient.lambda$fetchBlocks$0(ExternalShuffleClient.java:100)
              at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
      ...
      Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
              at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
              at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
              at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:254)
              ... 38 more
      19/03/15 11:21:38 WARN server.TransportChannelHandler: Exception in connection from vc1033.halxg.cloudera.com/10.17.216.43:7337
      java.lang.IllegalArgumentException: Frame length should be positive: -3702202170875367528
              at org.spark_project.guava.base.Preconditions.checkArgument(Preconditions.java:119)
      

      The IllegalArgumentException shouldn't happen, it only happens because the code is ignoring the time out and retrying, at which point the remote side is in a different state and thus doesn't expect the message.

      The same line that prints that exception can result in a noisy log message when the remote side (e.g. an old shuffle service) does not understand the new auth protocol. Since it's a warning it seems like something is wrong, when it's just doing what's expected.

      Attachments

        Issue Links

          Activity

            People

              vanzin Marcelo Masiero Vanzin
              vanzin Marcelo Masiero Vanzin
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: