[SPARK-21797] spark cannot read partitioned data in S3 that are partly in glacier - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Not A Problem
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: Spark Core
Labels:
- glacier
- partitions
- read
- s3
Environment:

Amazon EMR

Description

I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in AWS Glacier to save some money. For instance, we have...

s3://my-bucket/my-dataset/dt=2017-07-01/    [in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-09/    [in glacier]
s3://my-bucket/my-dataset/dt=2017-07-10/    [not in glacier]
...
s3://my-bucket/my-dataset/dt=2017-07-24/    [not in glacier]

I want to read this dataset, but only a subset of date that are not yet in glacier, eg:

val from = "2017-07-15"
val to = "2017-08-24"
val path = "s3://my-bucket/my-dataset/"
val X = spark.read.parquet(path).where(col("dt").between(from, to))

Unfortunately, I have the exception

java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)

I seems that spark does not like partitioned dataset when some partitions are in Glacier. I could always read specifically each date, add the column with current date and reduce(_ union _) at the end, but not pretty and it should not be necessary.

Is there any tip to read available data in the datastore even with old data in glacier?

Attachments

Issue Links

is related to

HADOOP-14837 Handle S3A "glacier" data

Open

Activity

People

Assignee:: Unassigned

Reporter:: Boris Clémençon

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 21/Aug/17 13:35

Updated:: 03/Apr/19 05:51

Resolved:: 21/Aug/17 14:10