Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-3681

S3Filesystem fails when copying empty files

Details

    • Bug
    • Status: Resolved
    • P2
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • io-java-aws
    • None

    Description

      When executing a simple write on S3 with the direct runner. It breaks sometimes when it ends up trying to write 'empty' shards to S3.

      Pipeline pipeline = Pipeline.create(options);
      pipeline
       .apply("CreateSomeData", Create.of("1", "2", "3"))
       .apply("WriteToFS", TextIO.write().to(options.getOutput()));
      pipeline.run();

      The related exception is:

      Exception in thread "main" org.apache.beam.sdk.Pipeline$PipelineExecutionException: java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 402E99C2F602AD09; S3 Extended Request ID: SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=), S3 Extended Request ID: SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=
          at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:342)
          at org.apache.beam.runners.direct.DirectRunner$DirectPipelineResult.waitUntilFinish(DirectRunner.java:312)
          at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:206)
          at org.apache.beam.runners.direct.DirectRunner.run(DirectRunner.java:62)
          at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
          at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
          at org.apache.beam.samples.ingest.amazon.IngestToS3.main(IngestToS3.java:82)
      Caused by: java.io.IOException: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 402E99C2F602AD09; S3 Extended Request ID: SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=), S3 Extended Request ID: SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=
          at org.apache.beam.sdk.io.aws.s3.S3FileSystem.copy(S3FileSystem.java:563)
          at org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$copy$4(S3FileSystem.java:495)
          at org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$callTasks$8(S3FileSystem.java:642)
          at org.apache.beam.sdk.util.MoreFutures.lambda$supplyAsync$0(MoreFutures.java:100)
          at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
      Caused by: com.amazonaws.services.s3.model.AmazonS3Exception: The XML you provided was not well-formed or did not validate against our published schema (Service: Amazon S3; Status Code: 400; Error Code: MalformedXML; Request ID: 402E99C2F602AD09; S3 Extended Request ID: SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=), S3 Extended Request ID: SDdU8AqW2mfZuG1xcKUSNeHiR0IUKcRCpZ1Wjx7sAor1CdYf8f+0dDIcQpvr3GXgqwsyk5PGWVE=
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.handleErrorResponse(AmazonHttpClient.java:1639)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1304)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1056)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:743)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:717)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:699)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:667)
          at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:649)
          at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:513)
          at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4325)
          at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:4272)
          at com.amazonaws.services.s3.AmazonS3Client.completeMultipartUpload(AmazonS3Client.java:3065)
          at org.apache.beam.sdk.io.aws.s3.S3FileSystem.copy(S3FileSystem.java:561)
          at org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$copy$4(S3FileSystem.java:495)
          at org.apache.beam.sdk.io.aws.s3.S3FileSystem.lambda$callTasks$8(S3FileSystem.java:642)
          at org.apache.beam.sdk.util.MoreFutures.lambda$supplyAsync$0(MoreFutures.java:100)
          at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1626)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at java.lang.Thread.run(Thread.java:748)

       After further investigation I found that the output of FileBasedSink can produce empty files, but the copy method of S3FileSystem breaks when trying to copy an empty file.

       

       

      Attachments

        Issue Links

          Activity

            People

              iemejia Ismaël Mejía
              iemejia Ismaël Mejía
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 4h 10m
                  4h 10m