Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-20799

Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Minor
    • Resolution: Won't Fix
    • Affects Version/s: 2.1.1
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None
    • Environment:

      Hadoop 2.8.0 binaries

      Description

      We are getting the following exception:

      org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.

      Combining following factors will cause it:

      • Use S3
      • Use format ORC
      • Don't apply a partitioning on de data
      • Embed AWS credentials in the path

      The problem is in the PartitioningAwareFileIndex def allFiles()

      leafDirToChildrenFiles.get(qualifiedPath)
                .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
                .getOrElse(Array.empty)
      

      leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the qualifiedPath contains the path WITH credentials.
      So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data is read and the schema cannot be defined.

      Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore.

      Workaround:
      Move the AWS credentials from the path to the SparkSession

      SparkSession.builder
      	.config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
      	.config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
      

        Attachments

        Issue Links

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              jzijlstra Jork Zijlstra

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment