Details
-
Bug
-
Status: Closed
-
Minor
-
Resolution: Won't Fix
-
2.1.1
-
None
-
None
-
Hadoop 2.8.0 binaries
Description
We are getting the following exception:
org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.
Combining following factors will cause it:
- Use S3
- Use format ORC
- Don't apply a partitioning on de data
- Embed AWS credentials in the path
The problem is in the PartitioningAwareFileIndex def allFiles()
leafDirToChildrenFiles.get(qualifiedPath) .orElse { leafFiles.get(qualifiedPath).map(Array(_)) } .getOrElse(Array.empty)
leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the qualifiedPath contains the path WITH credentials.
So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data is read and the schema cannot be defined.
Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore.
Workaround:
Move the AWS credentials from the path to the SparkSession
SparkSession.builder .config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId}) .config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})
Attachments
Issue Links
- is broken by
-
HADOOP-14439 regression: secret stripping from S3x URIs breaks some downstream code
- Resolved
-
HADOOP-3733 "s3:" URLs break when Secret Key contains a slash, even if encoded
- Resolved
- is related to
-
HADOOP-14833 Remove s3a user:secret authentication
- Resolved