Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20799

Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: 2.1.1
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Environment:

      Hadoop 2.8.0 binaries

      Description

      We are getting the following exception:

      org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.

      Combining following factors will cause it:

      • Use S3
      • Use format ORC
      • Don't apply a partitioning on de data
      • Embed AWS credentials in the path

      The problem is in the PartitioningAwareFileIndex def allFiles()

      leafDirToChildrenFiles.get(qualifiedPath)
                .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
                .getOrElse(Array.empty)
      

      leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the qualifiedPath contains the path WITH credentials.
      So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data is read and the schema cannot be defined.

      Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore.

      Workaround:
      Move the AWS credentials from the path to the SparkSession

      SparkSession.builder
      	.config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
      	.config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
      

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                jzijlstra Jork Zijlstra
              • Votes:
                1 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: