Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20799

Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Minor
    • Resolution: Won't Fix
    • 2.1.1
    • None
    • SQL
    • None
    • Hadoop 2.8.0 binaries

    Description

      We are getting the following exception:

      org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.

      Combining following factors will cause it:

      • Use S3
      • Use format ORC
      • Don't apply a partitioning on de data
      • Embed AWS credentials in the path

      The problem is in the PartitioningAwareFileIndex def allFiles()

      leafDirToChildrenFiles.get(qualifiedPath)
                .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
                .getOrElse(Array.empty)
      

      leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the qualifiedPath contains the path WITH credentials.
      So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data is read and the schema cannot be defined.

      Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore.

      Workaround:
      Move the AWS credentials from the path to the SparkSession

      SparkSession.builder
      	.config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
      	.config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
      

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              jzijlstra Jork Zijlstra
              Votes:
              1 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: