Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20799

Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.1.1
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Environment:

      Hadoop 2.8.0 binaries

      Description

      We are getting the following exception:

      org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.

      Combining following factors will cause it:

      • Use S3
      • Use format ORC
      • Don't apply a partitioning on de data
      • Embed AWS credentials in the path

      The problem is in the PartitioningAwareFileIndex def allFiles()

      leafDirToChildrenFiles.get(qualifiedPath)
                .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
                .getOrElse(Array.empty)
      

      leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the qualifiedPath contains the path WITH credentials.
      So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data is read and the schema cannot be defined.

      Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore.

      Workaround:
      Move the AWS credentials from the path to the SparkSession

      SparkSession.builder
      	.config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
      	.config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
      

        Issue Links

          Activity

          Hide
          stevel@apache.org Steve Loughran added a comment -

          Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore.

          It probably will stop working at some point in the future as putting secrets in the URIs is too dangerous: everything logs them assuming they aren't sensitive data. the S3xLoginHelper not only warns you, it does a best-effort attempt to strip out the secrets from the public URI, hence the logs and the messages telling you off.

          Prior to Hadoop 2.8, the sole defensible use case of secrets in URIs was it was the only way to have different logins on different buckets. In Hadoop 2.8 we added the ability to configure any of the fs.s3a. options on a per-bucket basis, including the secret logins, endpoints, and other important values

          I see what may be happening; in which case it probably constitutes a hadoop regression: if the filesystem's URI is converted to a string it will have these stripped, so if something is going path -> URI -> String ->path the secrets will be lost.

          If you are seeing this stack trace, it means you are using Hadoop 2.8 or something else with the HADOOP-3733 patch in it. What version of Hadoop (or HDP, CDH..) are you using? If it is based on the full Apache 2.8 release, you get

          1. per-bucket config to allow you to configure each bucket separately
          2. the ability to use JCEKS files to keep the secrets out the configs
          3. session token support.

          Accordingly, if you state the version, I may be able to look @ what's happening in a bit more detail

          Show
          stevel@apache.org Steve Loughran added a comment - Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore. It probably will stop working at some point in the future as putting secrets in the URIs is too dangerous: everything logs them assuming they aren't sensitive data. the S3xLoginHelper not only warns you, it does a best-effort attempt to strip out the secrets from the public URI, hence the logs and the messages telling you off. Prior to Hadoop 2.8, the sole defensible use case of secrets in URIs was it was the only way to have different logins on different buckets. In Hadoop 2.8 we added the ability to configure any of the fs.s3a. options on a per-bucket basis, including the secret logins, endpoints, and other important values I see what may be happening; in which case it probably constitutes a hadoop regression: if the filesystem's URI is converted to a string it will have these stripped, so if something is going path -> URI -> String ->path the secrets will be lost. If you are seeing this stack trace, it means you are using Hadoop 2.8 or something else with the HADOOP-3733 patch in it. What version of Hadoop (or HDP, CDH..) are you using? If it is based on the full Apache 2.8 release, you get per-bucket config to allow you to configure each bucket separately the ability to use JCEKS files to keep the secrets out the configs session token support. Accordingly, if you state the version, I may be able to look @ what's happening in a bit more detail
          Hide
          jzijlstra Jork Zijlstra added a comment -

          Hi Steve,

          Thanks for the quick response. We indeed don't need the credentials anymore to be on the path.

          I indeed forgot to mention the version we are running. We are using Spark 2.1.1 with indeed Hadoop 2.8.0
          Any other information you need?

          Regards, Jork

          Show
          jzijlstra Jork Zijlstra added a comment - Hi Steve, Thanks for the quick response. We indeed don't need the credentials anymore to be on the path. I indeed forgot to mention the version we are running. We are using Spark 2.1.1 with indeed Hadoop 2.8.0 Any other information you need? Regards, Jork
          Hide
          dongjoon Dongjoon Hyun added a comment -

          Hi, Jork Zijlstra. What about Parquet?

          Show
          dongjoon Dongjoon Hyun added a comment - Hi, Jork Zijlstra . What about Parquet?
          Hide
          jzijlstra Jork Zijlstra added a comment -

          Dongjoon Hyun
          I don't know since we don't use Parquet files. But I can off course generate one from the orc. Will try this tomorrow and let you know.

          Show
          jzijlstra Jork Zijlstra added a comment - Dongjoon Hyun I don't know since we don't use Parquet files. But I can off course generate one from the orc. Will try this tomorrow and let you know.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          If what I think is happening is, then it's the security tightening of HADOOP-3733 which has stopped this. It is sort-of-a-regression, but as it has a security benefit "stops leaking your secrets through logs" Its not something we want to revert. Anyway, it never worked if you had a "/" in your secret key, so the sole reason it worked for you in the past is that you don't (see: I know something about your secret credentials

          Hadoop 2.8 is way better for S3A support all round, so I'd encourage you to stay and play. In particular,

          1. switch from s3n:// to s3a:// for your URLs, to get the new high-performance client
          2. try setting fs.s3a.experimental.fadvise=random in your settings and you should expect to see a significant speedup in ORC input.

          If the use case here is that you want to use separate credentials for a specific bucket, you can use per-bucket config now

          fs.s3a.bucket.site-2.access.key=my access key
          fs.s3a.bucket.site-2.access.secret=my access secret
          

          then when you refer to s3a://site-2/path , the specific key & secret for that bucket are picked up. This is why you shouldn't need to use inline secrets at all

          Show
          stevel@apache.org Steve Loughran added a comment - If what I think is happening is, then it's the security tightening of HADOOP-3733 which has stopped this. It is sort-of-a-regression, but as it has a security benefit "stops leaking your secrets through logs" Its not something we want to revert. Anyway, it never worked if you had a "/" in your secret key, so the sole reason it worked for you in the past is that you don't (see: I know something about your secret credentials Hadoop 2.8 is way better for S3A support all round, so I'd encourage you to stay and play. In particular, switch from s3n:// to s3a:// for your URLs, to get the new high-performance client try setting fs.s3a.experimental.fadvise=random in your settings and you should expect to see a significant speedup in ORC input. If the use case here is that you want to use separate credentials for a specific bucket, you can use per-bucket config now fs.s3a.bucket.site-2.access.key=my access key fs.s3a.bucket.site-2.access.secret=my access secret then when you refer to s3a://site-2/path , the specific key & secret for that bucket are picked up. This is why you shouldn't need to use inline secrets at all
          Hide
          jzijlstra Jork Zijlstra added a comment - - edited

          Hi Dongjoon Hyun,

          Sorry that is took some time to test the Parquet file. Our spark cluster for the notebook got updated to spark 2.1.1 but it wouldn't play nice with the notebook version. Especially when using s3a path. Using s3n paths I could generated the no partition Parquet file.

          It also seems to be a problem with Parquet files. It throws the same error.

          Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;

          Steve Loughran
          Thanks for the settings. I'm trying to get the notebook to play nice with s3a path and playing and exploring the options now.

          Don't you mean

          fs.s3a.bucket.site-2.access.key=my access key
          fs.s3a.bucket.site-2.secret.key=my access secret
          

          Regards, jork

          Show
          jzijlstra Jork Zijlstra added a comment - - edited Hi Dongjoon Hyun , Sorry that is took some time to test the Parquet file. Our spark cluster for the notebook got updated to spark 2.1.1 but it wouldn't play nice with the notebook version. Especially when using s3a path. Using s3n paths I could generated the no partition Parquet file. It also seems to be a problem with Parquet files. It throws the same error. Exception in thread "main" org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.; Steve Loughran Thanks for the settings. I'm trying to get the notebook to play nice with s3a path and playing and exploring the options now. Don't you mean fs.s3a.bucket.site-2.access.key=my access key fs.s3a.bucket.site-2.secret.key=my access secret Regards, jork

            People

            • Assignee:
              Unassigned
              Reporter:
              jzijlstra Jork Zijlstra
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development