Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-21797

spark cannot read partitioned data in S3 that are partly in glacier

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments


    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Not A Problem
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: Spark Core
    • Environment:

      Amazon EMR


      I have a dataset in parquet in S3 partitioned by date (dt) with oldest date stored in AWS Glacier to save some money. For instance, we have...

      s3://my-bucket/my-dataset/dt=2017-07-01/    [in glacier]
      s3://my-bucket/my-dataset/dt=2017-07-09/    [in glacier]
      s3://my-bucket/my-dataset/dt=2017-07-10/    [not in glacier]
      s3://my-bucket/my-dataset/dt=2017-07-24/    [not in glacier]

      I want to read this dataset, but only a subset of date that are not yet in glacier, eg:

      val from = "2017-07-15"
      val to = "2017-08-24"
      val path = "s3://my-bucket/my-dataset/"
      val X = spark.read.parquet(path).where(col("dt").between(from, to))

      Unfortunately, I have the exception

      java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: C444D508B6042138)

      I seems that spark does not like partitioned dataset when some partitions are in Glacier. I could always read specifically each date, add the column with current date and reduce(_ union _) at the end, but not pretty and it should not be necessary.

      Is there any tip to read available data in the datastore even with old data in glacier?


        Issue Links



            • Assignee:
              clemencb Boris Clémençon


              • Created:

                Issue deployment