[SPARK-20799] Unable to infer schema for ORC/Parquet on S3N when secrets are in the URL - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Won't Fix
Affects Version/s: 2.1.1
Fix Version/s: None
Component/s: SQL
Labels:
None
Environment:

Hadoop 2.8.0 binaries

Description

We are getting the following exception:

org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.

Combining following factors will cause it:

Use S3
Use format ORC
Don't apply a partitioning on de data
Embed AWS credentials in the path

The problem is in the PartitioningAwareFileIndex def allFiles()

leafDirToChildrenFiles.get(qualifiedPath)
          .orElse { leafFiles.get(qualifiedPath).map(Array(_)) }
          .getOrElse(Array.empty)

leafDirToChildrenFiles uses the path WITHOUT credentials as its key while the qualifiedPath contains the path WITH credentials.
So leafDirToChildrenFiles.get(qualifiedPath) doesn't find any files, so no data is read and the schema cannot be defined.

Spark does output the S3xLoginHelper:90 - The Filesystem URI contains login details. This is insecure and may be unsupported in future., but this should not mean that it shouldn't work anymore.

Workaround:
Move the AWS credentials from the path to the SparkSession

SparkSession.builder
	.config("spark.hadoop.fs.s3n.awsAccessKeyId", {awsAccessKeyId})
	.config("spark.hadoop.fs.s3n.awsSecretAccessKey", {awsSecretAccessKey})

Attachments

Issue Links

is broken by

HADOOP-14439 regression: secret stripping from S3x URIs breaks some downstream code

Resolved

HADOOP-3733 "s3:" URLs break when Secret Key contains a slash, even if encoded

Resolved

is related to

HADOOP-14833 Remove s3a user:secret authentication

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Jork Zijlstra

Votes:: 1 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 18/May/17 12:58

Updated:: 12/Sep/18 19:40

Resolved:: 12/Sep/18 19:40