How S3 writes work through Impala is the following:
1) PlanFragmentExecutor calls sink->Open() which initializes the table writer(s) (parquet, text, etc.)
2) PlanFragmentExecutor calls sink->Send() which goes through the writer for the corressponding file format (HdfsTextTableWriter, HdfsParquetTableWriter, etc.). These writers ultimately call HdfsTableWriter::Write().
3) HdfsTableWriter::Write() calls hdfsWrite() which is a libHDFS function.
4) libHDFS determines which filesystem it's writing to and calls the appropriate write() function. In the S3A case, it uses S3AFileSystem.java:
5) S3AFileSystem uses S3AOutputStream which buffers all the writes to a file(s) in the local disk:
6) When our table writer calls Close(), it ultimately ends up calling S3AOutputStream::close(), which only then uploads to S3. S3AOutputStream::write() only writes to the local disk.
The problem here is that the local disk might not have enough space to buffer all these writes, which will cause the INSERT to fail. (When writing a 50GB file, we don't want to impose that the node must have 50GB free).
Problem with the HdfsTextTableWriter:
- It buffers everything into one file, no matter how large.
HdfsParquetTableWriter splits the write such that it creates multiple files with a default size of 256MB. So this is not as bad as HdfsTextTableWriter as each file is closed once it reaches 256MB (or whatever we set the default parquet file size to).
1) Patch libHDFS to modify the S3AOutputStream so that we can stream writes to S3 instead of writing it all at once during Close().
2) Think of a longer term more permanent fix (like not using libHDFS for S3 and using the AWS SDK directly).