Uploaded image for project: 'Beam'
  1. Beam
  2. BEAM-2500

Add support for S3 as a Apache Beam FileSystem

    Details

    • Type: Improvement
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: sdk-java-extensions
    • Labels:
      None

      Description

      Note that this is for providing direct integration with S3 as an Apache Beam FileSystem.

      There is already support for using the Hadoop S3 connector by depending on the Hadoop File System module[1], configuring HadoopFileSystemOptions[2] with a S3 configuration[3].

      1: https://github.com/apache/beam/tree/master/sdks/java/io/hadoop-file-system
      2: https://github.com/apache/beam/blob/master/sdks/java/io/hadoop-file-system/src/main/java/org/apache/beam/sdk/io/hdfs/HadoopFileSystemOptions.java#L53
      3: https://wiki.apache.org/hadoop/AmazonS3

      1. hadoop_fs_patch.patch
        7 kB
        Guillaume Balaine

        Issue Links

          Activity

          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user lukecwik opened a pull request:

          https://github.com/apache/beam-site/pull/257

          BEAM-2500 List Amazon S3 File System as a planned I/O.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/lukecwik/incubator-beam-site asf-site

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/beam-site/pull/257.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #257


          commit a1643b1360a3628e6fd6ed6812df5cfb360e8873
          Author: Luke Cwik <lcwik@google.com>
          Date: 2017-06-22T17:52:03Z

          BEAM-2500 List Amazon S3 File System as a planned I/O.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user lukecwik opened a pull request: https://github.com/apache/beam-site/pull/257 BEAM-2500 List Amazon S3 File System as a planned I/O. You can merge this pull request into a Git repository by running: $ git pull https://github.com/lukecwik/incubator-beam-site asf-site Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam-site/pull/257.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #257 commit a1643b1360a3628e6fd6ed6812df5cfb360e8873 Author: Luke Cwik <lcwik@google.com> Date: 2017-06-22T17:52:03Z BEAM-2500 List Amazon S3 File System as a planned I/O.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          It's not clear that the S3 clients from EMR or Apache (S3A) work with Beam, not until you've got the tests. Certainly there's a report on StackOverflow which implies that beam depends on a read(ByteBuffer) operation which is not implemented by S3A (nor indeed, Azure WASB).

          Show
          stevel@apache.org Steve Loughran added a comment - It's not clear that the S3 clients from EMR or Apache (S3A) work with Beam, not until you've got the tests. Certainly there's a report on StackOverflow which implies that beam depends on a read(ByteBuffer) operation which is not implemented by S3A (nor indeed, Azure WASB).
          Hide
          igosuki Guillaume Balaine added a comment -

          I got s3a to work on a simple aggregation job, I just write to s3a text files and include "org.apache.hadoop" % "hadoop-aws" % "2.7.3".
          Is there anything we're missing ? The only trouble I had was in debugging, where my file policy was formatting ':' characters in files which gave a wrong resourceId in beam.

          Show
          igosuki Guillaume Balaine added a comment - I got s3a to work on a simple aggregation job, I just write to s3a text files and include "org.apache.hadoop" % "hadoop-aws" % "2.7.3". Is there anything we're missing ? The only trouble I had was in debugging, where my file policy was formatting ':' characters in files which gave a wrong resourceId in beam.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          Thanks for the update

          ":" is a troublespot char in hadoop filesystems, I think it may nominally be forbidden, because it complicates URLs too much.

          Show
          stevel@apache.org Steve Loughran added a comment - Thanks for the update ":" is a troublespot char in hadoop filesystems, I think it may nominally be forbidden, because it complicates URLs too much.
          Hide
          igosuki Guillaume Balaine added a comment -

          Thanks, that's fine really, the only trouble was that I had to dig in some example code to find it out because no stacktraces pop in Beam. It's just that resolving a ResourceId with such a path from another folder gives you an incomplete URI, where the base path is truncated like :

          (s3a://mybucket/myfolder/somefilename.fmt).resolve(somefilename-12:30-13:30.fmt) -> ResourceId{URI{somefilename-12:30-13:30.fmt}}
          while
          (s3a://mybucket/myfolder/somefilename.fmt).resolve(somefilename-12.30-13.30.fmt) -> ResourceId{URI{instead of s3a://mybucket/myfolder/somefilename-12.30-13.30.fmt}}

          so people need to be aware of their file name policies in beam.

          On another note, reads don't work because S3 input streams don't implement ByteBufferReadable as you mentionned here https://stackoverflow.com/questions/44792884/apache-beam-unable-to-read-text-file-from-s3-using-hadoop-file-system-sdk so I guess fixing that would be enough to resolve this issue.

          Show
          igosuki Guillaume Balaine added a comment - Thanks, that's fine really, the only trouble was that I had to dig in some example code to find it out because no stacktraces pop in Beam. It's just that resolving a ResourceId with such a path from another folder gives you an incomplete URI, where the base path is truncated like : (s3a://mybucket/myfolder/somefilename.fmt).resolve(somefilename-12:30-13:30.fmt) -> ResourceId{URI{somefilename-12:30-13:30.fmt}} while (s3a://mybucket/myfolder/somefilename.fmt).resolve(somefilename-12.30-13.30.fmt) -> ResourceId{URI{instead of s3a://mybucket/myfolder/somefilename-12.30-13.30.fmt}} so people need to be aware of their file name policies in beam. On another note, reads don't work because S3 input streams don't implement ByteBufferReadable as you mentionned here https://stackoverflow.com/questions/44792884/apache-beam-unable-to-read-text-file-from-s3-using-hadoop-file-system-sdk so I guess fixing that would be enough to resolve this issue.
          Hide
          stevel@apache.org Steve Loughran added a comment -
          • Do a patch for that &the ByteBuffer & I'll look@ it...it's not too hard
          • S3A ought to be able to do some escaping of the path, when I think about it; through the normal URL %20 style escaping. Very hard to avoid double escaping though as we are always creating paths, new paths, going from path to s3 key and back again.
          Show
          stevel@apache.org Steve Loughran added a comment - Do a patch for that &the ByteBuffer & I'll look@ it...it's not too hard S3A ought to be able to do some escaping of the path, when I think about it; through the normal URL %20 style escaping. Very hard to avoid double escaping though as we are always creating paths, new paths, going from path to s3 key and back again.
          Hide
          stevel@apache.org Steve Loughran added a comment -

          +also, as well as any fixes to S3A, it'd be sweet if Beam itself added some integration test for this. The more downstream tests the better, as that's where we find differences in expectations.

          Show
          stevel@apache.org Steve Loughran added a comment - +also, as well as any fixes to S3A, it'd be sweet if Beam itself added some integration test for this. The more downstream tests the better, as that's where we find differences in expectations.
          Hide
          igosuki Guillaume Balaine added a comment - - edited

          This patch works for me hadoop_fs_patch.patch. Thanks Steve.

          Show
          igosuki Guillaume Balaine added a comment - - edited This patch works for me hadoop_fs_patch.patch . Thanks Steve.
          Hide
          igosuki Guillaume Balaine added a comment -

          Sorry about that, discovered a bug later down the line, we actually have to set the position in the buffer manually after using the backing array, like this :

          @Override
          public int read(ByteBuffer dst) throws IOException {
          if (closed)

          { throw new IOException("Channel is closed"); }

          int read = 0;
          try

          { read = inputStream.read(dst); }

          catch (UnsupportedOperationException e) {
          // Fallback read
          read = inputStream.read(dst.array());
          if (read > 0)

          { dst.position(dst.position() + read); }

          }
          return read;
          }

          Show
          igosuki Guillaume Balaine added a comment - Sorry about that, discovered a bug later down the line, we actually have to set the position in the buffer manually after using the backing array, like this : @Override public int read(ByteBuffer dst) throws IOException { if (closed) { throw new IOException("Channel is closed"); } int read = 0; try { read = inputStream.read(dst); } catch (UnsupportedOperationException e) { // Fallback read read = inputStream.read(dst.array()); if (read > 0) { dst.position(dst.position() + read); } } return read; }
          Hide
          jbonofre Jean-Baptiste Onofré added a comment -

          Thanks for the update. On my side, I gonna take a look when back from vacation.

          Show
          jbonofre Jean-Baptiste Onofré added a comment - Thanks for the update. On my side, I gonna take a look when back from vacation.
          Hide
          iemejia Ismaël Mejía added a comment -

          I created BEAM-2790 to move the discussion on the issue with the read of S3 using HadoopFileSystem into its own thread. Remember this JIRA was created for the 'native' (non hadoop-based) implementation of S3.
          I basically applied a minimal version of the patch that Guillaume Balaine contributed (thanks!), if any further issue with the Hadoop based implementation please report in the new JIRA or create a new issue.

          Show
          iemejia Ismaël Mejía added a comment - I created BEAM-2790 to move the discussion on the issue with the read of S3 using HadoopFileSystem into its own thread. Remember this JIRA was created for the 'native' (non hadoop-based) implementation of S3. I basically applied a minimal version of the patch that Guillaume Balaine contributed (thanks!), if any further issue with the Hadoop based implementation please report in the new JIRA or create a new issue.
          Hide
          jmarble Jacob Marble added a comment - - edited

          I'm interested in implementing S3 support. Not being familiar Beam internals, and without committing myself to anything, perhaps someone can comment on my research notes.

          GCS is probably a good template. Implement FileSystem, ResourceId, FileSystemRegistrar, PipelineOptions, PipelineOptionsRegistrar:
          https://github.com/apache/beam/tree/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp

          For interacting with S3, this is probably the preferred SDK:
          http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3

          Some specifics about implementing FileSystem:

          FileSystem.copy()

          • AmazonS3Client.copyObject((String sourceBucketName, String sourceKey, String destinationBucketName, String destinationKey)
          • max upload size is 5GB, which is probably fine to start, but need to use multipart upload to get full 5TB limit

          FileSystem.create()

          • AmazonS3Client.putObject(putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata)
          • max upload size is 5GB, which is probably fine to start, but need to use multipart upload to get full 5TB limit

          FileSystem.delete()

          • AmazonS3Client.deleteObjects(DeleteObjectsRequest deleteObjectsRequest)

          FileSystem.getScheme()

          • return "s3"

          FileSystem.match()

          • j.o.apache.beam.sdk.extensions.util.gcsfs.GcsPath and same.GcsUtil have some good ideas

          FileSystem.matchNewResource()

          • Look at GcsPath and GcsUtil

          FileSystem.open()

          • AmazonS3Client.getObject(String bucketName, String key)

          FileSystem.rename()

          • Can't find anything in AmazonS3Client; perhaps call FileSystem.copy(), then FileSystem.delete()

          I'm not clear about how to register the s3 FileSystem as mentioned in the FileSystemRegistrar Javadoc:

          "FileSystem creators have the ability to provide a registrar by creating a ServiceLoader entry and a concrete implementation of this interface.

          It is optional but recommended to use one of the many build time tools such as AutoService to generate the necessary META-INF files automatically."

          Show
          jmarble Jacob Marble added a comment - - edited I'm interested in implementing S3 support. Not being familiar Beam internals, and without committing myself to anything, perhaps someone can comment on my research notes. GCS is probably a good template. Implement FileSystem, ResourceId, FileSystemRegistrar, PipelineOptions, PipelineOptionsRegistrar: https://github.com/apache/beam/tree/master/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/extensions/gcp For interacting with S3, this is probably the preferred SDK: http://mvnrepository.com/artifact/com.amazonaws/aws-java-sdk-s3 Some specifics about implementing FileSystem: FileSystem.copy() AmazonS3Client.copyObject((String sourceBucketName, String sourceKey, String destinationBucketName, String destinationKey) max upload size is 5GB, which is probably fine to start, but need to use multipart upload to get full 5TB limit FileSystem.create() AmazonS3Client.putObject(putObject(String bucketName, String key, InputStream input, ObjectMetadata metadata) max upload size is 5GB, which is probably fine to start, but need to use multipart upload to get full 5TB limit FileSystem.delete() AmazonS3Client.deleteObjects(DeleteObjectsRequest deleteObjectsRequest) FileSystem.getScheme() return "s3" FileSystem.match() j.o.apache.beam.sdk.extensions.util.gcsfs.GcsPath and same.GcsUtil have some good ideas FileSystem.matchNewResource() Look at GcsPath and GcsUtil FileSystem.open() AmazonS3Client.getObject(String bucketName, String key) FileSystem.rename() Can't find anything in AmazonS3Client; perhaps call FileSystem.copy(), then FileSystem.delete() I'm not clear about how to register the s3 FileSystem as mentioned in the FileSystemRegistrar Javadoc: "FileSystem creators have the ability to provide a registrar by creating a ServiceLoader entry and a concrete implementation of this interface. It is optional but recommended to use one of the many build time tools such as AutoService to generate the necessary META-INF files automatically."
          Hide
          chamikara Chamikara Jayalath added a comment -

          Thanks for looking into this Jacob. I'll try to answer some of your questions.

          I agree that GCS is a good template. It's pretty well battle tested and used by all Dataflow pipelines.

          5GB limit for FileSystem.copy(), though might be a good start, might not be enough for production use. Beam can parallelize reading of a large files. So there might be many users who are used to reading a small number of large files from their pipelines. Note that we use copy() operation to finalize files written using FileBasedSink implementations. So we might need to copy large files there as well. It'll be good if copy() can be implemented using multipart upload as you mentioned.

          FileSystem.create() is to create an empty WrittableByteChannel that will be written to later. So we'll have to have a way to stream bytes into S3 (some implementation of WrittableByteChannel). I'm not sure if S3 client library already supports this.

          's3' for schema sounds good to me.

          For efficient parallelized reading FileSystem.open() should return an efficiently seekable SeekableByteChannel.

          Using copy() + delete() combination for rename() is fine.

          We might have to address issues related to providing credentials for accessing S3. See following JIRA where some details related to this were discussed when Dmitry Demeshchuk looked into adding a S3 file system for Python SDK.
          https://issues.apache.org/jira/browse/BEAM-2572

          Show
          chamikara Chamikara Jayalath added a comment - Thanks for looking into this Jacob. I'll try to answer some of your questions. I agree that GCS is a good template. It's pretty well battle tested and used by all Dataflow pipelines. 5GB limit for FileSystem.copy(), though might be a good start, might not be enough for production use. Beam can parallelize reading of a large files. So there might be many users who are used to reading a small number of large files from their pipelines. Note that we use copy() operation to finalize files written using FileBasedSink implementations. So we might need to copy large files there as well. It'll be good if copy() can be implemented using multipart upload as you mentioned. FileSystem.create() is to create an empty WrittableByteChannel that will be written to later. So we'll have to have a way to stream bytes into S3 (some implementation of WrittableByteChannel). I'm not sure if S3 client library already supports this. 's3' for schema sounds good to me. For efficient parallelized reading FileSystem.open() should return an efficiently seekable SeekableByteChannel. Using copy() + delete() combination for rename() is fine. We might have to address issues related to providing credentials for accessing S3. See following JIRA where some details related to this were discussed when Dmitry Demeshchuk looked into adding a S3 file system for Python SDK. https://issues.apache.org/jira/browse/BEAM-2572
          Hide
          lcwik Luke Cwik added a comment -

          Performing the multipart download/upload will become important as 5GiBs has limited use but start off implementing the simpler thing as multipart upload/download can come later.

          http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html
          Amazon supports an efficient copy operation if you specify "x-amz-copy-source" as a header where you don't need to upload the bytes and it just adds some metadata that points to the same set of bytes. Depending on which Amazon S3 Java library you use, they may or may not expose this flexibility.

          Show
          lcwik Luke Cwik added a comment - Performing the multipart download/upload will become important as 5GiBs has limited use but start off implementing the simpler thing as multipart upload/download can come later. http://docs.aws.amazon.com/AmazonS3/latest/API/RESTObjectCOPY.html Amazon supports an efficient copy operation if you specify "x-amz-copy-source" as a header where you don't need to upload the bytes and it just adds some metadata that points to the same set of bytes. Depending on which Amazon S3 Java library you use, they may or may not expose this flexibility.
          Hide
          jmarble Jacob Marble added a comment -

          Chamikara, thanks for your comment. I'll switch my implementation to multipart after I have something working, just got the simple 5GB version written. I'll also give closer consideration to the credentials question after I have the harder parts complete. For now, just using flags via PipelineOptions.

          So I have completed enough of this to test it out, except one problem. S3 requires the content length before writing any data, or else the client buffers the entire content in memory before writing. I have added contentLength to my S3CreateOptions, but how to set that value before S3FileSystem.create() is called?

          Show
          jmarble Jacob Marble added a comment - Chamikara, thanks for your comment. I'll switch my implementation to multipart after I have something working, just got the simple 5GB version written. I'll also give closer consideration to the credentials question after I have the harder parts complete. For now, just using flags via PipelineOptions. So I have completed enough of this to test it out, except one problem. S3 requires the content length before writing any data, or else the client buffers the entire content in memory before writing. I have added contentLength to my S3CreateOptions, but how to set that value before S3FileSystem.create() is called?
          Hide
          jmarble Jacob Marble added a comment -

          Luke, that put-copy API is called by the Java SDK, but it's weird because it still has the 5GB limit without switching to it's multipart form.

          Show
          jmarble Jacob Marble added a comment - Luke, that put-copy API is called by the Java SDK, but it's weird because it still has the 5GB limit without switching to it's multipart form.
          Hide
          jmarble Jacob Marble added a comment -

          Multipart upload could get us around the content length requirement, but it's awkward. An object can be 5TB, and a multipart upload can have 10,000 parts, so I could read 500MB at a time into memory, ship those chunks. Bad idea.

          Still can't see how Beam can indicate content length to a FileSystem sink. I'll move on to source stuff for a while.

          Show
          jmarble Jacob Marble added a comment - Multipart upload could get us around the content length requirement, but it's awkward. An object can be 5TB, and a multipart upload can have 10,000 parts, so I could read 500MB at a time into memory, ship those chunks. Bad idea. Still can't see how Beam can indicate content length to a FileSystem sink. I'll move on to source stuff for a while.
          Hide
          lcwik Luke Cwik added a comment -

          The GCS implementation uses a fixed size buffer of 32 or 64mbs (don't remember which) so buffering in memory is a common practice. 500 mbs does seem like a lot but even if you went with 32 mbs, you could support files up to ~320 gb.

          Show
          lcwik Luke Cwik added a comment - The GCS implementation uses a fixed size buffer of 32 or 64mbs (don't remember which) so buffering in memory is a common practice. 500 mbs does seem like a lot but even if you went with 32 mbs, you could support files up to ~320 gb.
          Hide
          jmarble Jacob Marble added a comment -

          Luke, thanks! I may have found what you're referencing in AbstractGoogleAsyncWriteChannel.uploadBufferSize. Default is 64MB, or 8MB:

          GCS_UPLOAD_GRANULARITY = 8 * 1024 * 1024;
          UPLOAD_CHUNK_SIZE_DEFAULT =
          Runtime.getRuntime().maxMemory() < 512 * 1024 * 1024
          ? GCS_UPLOAD_GRANULARITY : 8 * GCS_UPLOAD_GRANULARITY;

          Show
          jmarble Jacob Marble added a comment - Luke, thanks! I may have found what you're referencing in AbstractGoogleAsyncWriteChannel.uploadBufferSize. Default is 64MB, or 8MB: GCS_UPLOAD_GRANULARITY = 8 * 1024 * 1024; UPLOAD_CHUNK_SIZE_DEFAULT = Runtime.getRuntime().maxMemory() < 512 * 1024 * 1024 ? GCS_UPLOAD_GRANULARITY : 8 * GCS_UPLOAD_GRANULARITY;
          Hide
          stevel@apache.org Steve Loughran added a comment -

          This is how Hadoop does its multipart upload

          OutputStream which switching to MPU once the amount of data > the block size

          https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ABlockOutputStream.java

          Option to use: Heap, ByteBuffer of HDD pool for storage
          https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ADataBlocks.java

          Default is HDD because the others are fussier about thread configuration, a mismatch between generation rate and upload bandwidth will cause OOM failures, which invariably happens on a distcp half way through.

          If you want to evolve Hadoop FS APIs for better blobstore integration, that's something to play with (HADOOP-9565 has discussed it for ages). Issue: broad set of differences between them and the lowest common denominator is too limited. Justification: making things look like a directory tree with operations like rename() is even worse —and there is no copy() in the API at present

          I'd go for the core set of verbs: PUT, LIST, COPY, plus the ability to query the FS for its semantics (consistency model, etc). A cross-store multipart upload would be trickier

          Show
          stevel@apache.org Steve Loughran added a comment - This is how Hadoop does its multipart upload OutputStream which switching to MPU once the amount of data > the block size https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ABlockOutputStream.java Option to use: Heap, ByteBuffer of HDD pool for storage https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ADataBlocks.java Default is HDD because the others are fussier about thread configuration, a mismatch between generation rate and upload bandwidth will cause OOM failures, which invariably happens on a distcp half way through. If you want to evolve Hadoop FS APIs for better blobstore integration, that's something to play with ( HADOOP-9565 has discussed it for ages). Issue: broad set of differences between them and the lowest common denominator is too limited. Justification: making things look like a directory tree with operations like rename() is even worse —and there is no copy() in the API at present I'd go for the core set of verbs: PUT, LIST, COPY, plus the ability to query the FS for its semantics (consistency model, etc). A cross-store multipart upload would be trickier
          Hide
          stevel@apache.org Steve Loughran added a comment -

          . So we'll have to have a way to stream bytes into S3 (some implementation of WrittableByteChannel). I'm not sure if S3 client library already supports this.

          yes, it takes an input stream through its xfer manager, but needs one supporting mark/restore if you want the manager to handle a transient failure of the write of a block of data.

          Show
          stevel@apache.org Steve Loughran added a comment - . So we'll have to have a way to stream bytes into S3 (some implementation of WrittableByteChannel). I'm not sure if S3 client library already supports this. yes, it takes an input stream through its xfer manager, but needs one supporting mark/restore if you want the manager to handle a transient failure of the write of a block of data.
          Hide
          jmarble Jacob Marble added a comment -

          Here is a working implementation. I'm going to use it for a work project, kick the tires. Plenty of TODOs etc.

          https://github.com/Kochava/beam-s3

          Show
          jmarble Jacob Marble added a comment - Here is a working implementation. I'm going to use it for a work project, kick the tires. Plenty of TODOs etc. https://github.com/Kochava/beam-s3
          Hide
          stevel@apache.org Steve Loughran added a comment -

          looking at the code, the main thing I'd highlight is that using {{ Regions.fromName(region);}} means that you can't support any region which isn't explicitly handled in the version of the SDK you build & ship, or handle non-AWS implementations. Best just to let people declare the endpoint, and let them deal with the stack traces if they get it wrong.

          Show
          stevel@apache.org Steve Loughran added a comment - looking at the code, the main thing I'd highlight is that using {{ Regions.fromName(region);}} means that you can't support any region which isn't explicitly handled in the version of the SDK you build & ship, or handle non-AWS implementations. Best just to let people declare the endpoint, and let them deal with the stack traces if they get it wrong.
          Hide
          jmarble Jacob Marble added a comment -

          Thanks, Steve. Fixed.

          Show
          jmarble Jacob Marble added a comment - Thanks, Steve. Fixed.
          Hide
          jmarble Jacob Marble added a comment -

          I see this warning thousands of times when reading from S3:

          Sep 20, 2017 8:45:04 AM com.amazonaws.services.s3.internal.S3AbortableInputStream close
          WARNING: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use.

          It looks like TextIO requests bytes n through m, but only consumes less than m-n bytes, then closes the channel (channel wraps a stream). Am I wrong? Is m-n predictably small enough that I should drain the stream at close?

          Show
          jmarble Jacob Marble added a comment - I see this warning thousands of times when reading from S3: Sep 20, 2017 8:45:04 AM com.amazonaws.services.s3.internal.S3AbortableInputStream close WARNING: Not all bytes were read from the S3ObjectInputStream, aborting HTTP connection. This is likely an error and may result in sub-optimal behavior. Request only the bytes you need via a ranged GET or drain the input stream after use. It looks like TextIO requests bytes n through m, but only consumes less than m-n bytes, then closes the channel (channel wraps a stream). Am I wrong? Is m-n predictably small enough that I should drain the stream at close?
          Hide
          jmarble Jacob Marble added a comment - - edited

          Hmm, I think there's a bug in my code. Please ignore for now.

          Show
          jmarble Jacob Marble added a comment - - edited Hmm, I think there's a bug in my code. Please ignore for now.

            People

            • Assignee:
              Unassigned
              Reporter:
              lcwik Luke Cwik
            • Votes:
              1 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

              • Created:
                Updated:

                Development