[HADOOP-13560] S3ABlockOutputStream to support huge (many GB) file writes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 2.9.0
Fix Version/s: 2.8.0, 3.0.0-alpha2
Component/s: fs/s3
Labels:
None

Target Version/s:

2.9.0
Hadoop Flags:

Incompatible change
Release Note:
This mechanism replaces the (experimental) fast output stream of Hadoop 2.7.x, combining better scalability options with instrumentation. Consult the S3A documentation to see the extra configuration operations.

Description

There's two output stream mechanisms in Hadooop 2.7.x, neither of which handle massive multi-GB files that well.

"classic": buffer everything to HDD until to the close() operation; time to close becomes O(data); as is available disk space. Fails to exploit exploit idle bandwidth, and on EC2 VMs with not much HDD capacity (especially completing with HDFS storage), can fill up the disk.

S3AFastOutputStream uploads data in partition-sized blocks, buffering via byte arrays. Avoids disk problems and as it writes as soon as the first partition is ready, close() time is O(outstanding-data). However: needs tuning to reduce amount of data buffered. Get it wrong, and the first clue you get may be that the process goes OOM or is killed by YARN. Which is a shame, as get it right and operations which generates lots of data, complete much faster, including distcp.

This patch proposes a new output stream, a successor to both, S3ABlockOutputStream.

uses block upload model of S3AFastOutputStream
supports buffering via: HDD, heap and (recycled) byte buffer, offering a choice between memory and HDD use. HDD: no OOM problems on small JVMs/need to tune.
Uses the fast output stream mechanism of limiting queue size for data to upload. Even when buffering via HDD, you may need to limit that use.
lots of instrumentation to see what's being written.
good defaults out the box (e.g buffer to HDD, partition size to strike a good balance of early upload and scaleability)
robust against transient failures. The AWS SDK retries a PUT on failure; the entire block may need to be replayed, so HDD input cannot be buffered via java.io.BufferedInputStream. It has also surfaced in testing that if the final commit of a multipart option fails, it isn't retried —at least in the current SDK in use. Do that ourselves.
use roundrobin directory allocation, for most effective disk use
take an AWS SDK com.amazonaws.event.ProgressListener for progress callbacks, giving more detail on the operation. (It actually takes a org.apache.hadoop.util.Progressable, but if that also implements the AWS interface, that is used instead.

All of this to come with scale tests

generate large files using all buffer mechanisms
Do a large copy/rname and verify that the copy really works, including metadata
be configurable with sizes up to muti-GB, which also means that the test timeouts need to be configurable to match the time it can take.
As they are slow, make them optional, using the -Dscale switch to enable.

Verifying large file rename is important on its own, as it is needed for very large commit operations for committers using rename

The goal here is to implement a single, object stream which can be used for all outputs, tuneable as to whether to use disk or memory, and on queue sizes, but otherwise be all that's needed. We can do future development on this, remove its predecessor S3AFastOutputStream, so keeping docs and testing down, and leave the original S3AOutputStream alone for regression testing/fallback.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HADOOP-13560-branch-2-001.patch
01/Sep/16 17:18
53 kB
Steve Loughran
HADOOP-13560-branch-2-002.patch
06/Sep/16 20:00
107 kB
Steve Loughran
HADOOP-13560-branch-2-003.patch
20/Sep/16 19:18
161 kB
Steve Loughran
HADOOP-13560-branch-2-004.patch
21/Sep/16 22:17
164 kB
Steve Loughran

Issue Links

breaks

HADOOP-13735 ITestS3AFileContextStatistics.testStatistics() failing

Resolved

HADOOP-14028 S3A BlockOutputStreams doesn't delete temporary files in multipart uploads or handle part upload failures

Resolved

contains

HADOOP-13567 S3AFileSystem to override getStorageStatistics() and so serve up its statistics

Resolved

HADOOP-13569 S3AFastOutputStream to take ProgressListener in file create()

Resolved

depends upon

HADOOP-13703 S3ABlockOutputStream to pass Yetus & Jenkins

Resolved

duplicates

HADOOP-13703 S3ABlockOutputStream to pass Yetus & Jenkins

Resolved

incorporates

HADOOP-12979 IOE in S3a: ${hadoop.tmp.dir}/s3a not configured

Resolved

is depended upon by

SPARK-19111 S3 Mesos history upload fails silently if too large

Resolved

IMPALA-3394 Writes to S3 require equivalent local disk space

Resolved

relates to

HADOOP-11183 Memory-based S3AOutputstream

Closed

supercedes

HADOOP-12387 Improve S3AFastOutputStream memory management

Resolved

HADOOP-13262 set multipart delete timeout to 5 * 60s in S3ATestUtils.createTestFileSystem

Resolved

HADOOP-13281 S3A to track multipart upload count, size, duration

Resolved

HADOOP-13531 S3A output streams to share a single LocalDirAllocator for round-robin drive use

Resolved

HADOOP-13566 NPE in S3AFastOutputStream.write

Resolved

links to

GitHub Pull Request #125

GitHub Pull Request #130

(1 duplicates, 1 incorporates, 2 is depended upon by, 1 relates to, 5 supercedes, 2 links to)

Activity

People

Assignee:: Steve Loughran

Reporter:: Steve Loughran

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 30/Aug/16 12:31

Updated:: 01/Apr/17 12:39

Resolved:: 18/Oct/16 20:19