[IMPALA-3394] Writes to S3 require equivalent local disk space - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Won't Fix
Affects Version/s: Impala 2.6.0
Fix Version/s: None
Component/s: Backend
Labels:
- s3
- supportability

Target Version:

Product Backlog

Description

How S3 writes work through Impala is the following:

1) PlanFragmentExecutor calls sink->Open() which initializes the table writer(s) (parquet, text, etc.)

2) PlanFragmentExecutor calls sink->Send() which goes through the writer for the corressponding file format (HdfsTextTableWriter, HdfsParquetTableWriter, etc.). These writers ultimately call HdfsTableWriter::Write().

3) HdfsTableWriter::Write() calls hdfsWrite() which is a libHDFS function.

4) libHDFS determines which filesystem it's writing to and calls the appropriate write() function. In the S3A case, it uses S3AFileSystem.java:
https://github.com/apache/hadoop/blob/2e1d0ff4e901b8313c8d71869735b94ed8bc40a0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

5) S3AFileSystem uses S3AOutputStream which buffers all the writes to a file(s) in the local disk:
https://github.com/apache/hadoop/blob/2e1d0ff4e901b8313c8d71869735b94ed8bc40a0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java#L99

6) When our table writer calls Close(), it ultimately ends up calling S3AOutputStream::close(), which only then uploads to S3. S3AOutputStream::write() only writes to the local disk.
https://github.com/apache/hadoop/blob/2e1d0ff4e901b8313c8d71869735b94ed8bc40a0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java#L120

The problem here is that the local disk might not have enough space to buffer all these writes, which will cause the INSERT to fail. (When writing a 50GB file, we don't want to impose that the node must have 50GB free).

Problem with the HdfsTextTableWriter:

It buffers everything into one file, no matter how large.

HdfsParquetTableWriter splits the write such that it creates multiple files with a default size of 256MB. So this is not as bad as HdfsTextTableWriter as each file is closed once it reaches 256MB (or whatever we set the default parquet file size to).

Solutions:
1) Patch libHDFS to modify the S3AOutputStream so that we can stream writes to S3 instead of writing it all at once during Close().

2) Think of a longer term more permanent fix (like not using libHDFS for S3 and using the AWS SDK directly).

Attachments

Issue Links

depends upon

HADOOP-13560 S3ABlockOutputStream to support huge (many GB) file writes

Resolved

Activity

People

Assignee:: Sahil Takiar

Reporter:: Sailesh Mukil

Votes:: 0 Vote for this issue

Watchers:: 9 Start watching this issue

Dates

Created:: 21/Apr/16 18:29

Updated:: 21/Jan/20 21:34

Resolved:: 21/Jan/20 21:34