Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3394

Writes to S3 require equivalent local disk space

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Won't Fix
    • Impala 2.6.0
    • None
    • Backend

    Description

      How S3 writes work through Impala is the following:

      1) PlanFragmentExecutor calls sink->Open() which initializes the table writer(s) (parquet, text, etc.)

      2) PlanFragmentExecutor calls sink->Send() which goes through the writer for the corressponding file format (HdfsTextTableWriter, HdfsParquetTableWriter, etc.). These writers ultimately call HdfsTableWriter::Write().

      3) HdfsTableWriter::Write() calls hdfsWrite() which is a libHDFS function.

      4) libHDFS determines which filesystem it's writing to and calls the appropriate write() function. In the S3A case, it uses S3AFileSystem.java:
      https://github.com/apache/hadoop/blob/2e1d0ff4e901b8313c8d71869735b94ed8bc40a0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AFileSystem.java

      5) S3AFileSystem uses S3AOutputStream which buffers all the writes to a file(s) in the local disk:
      https://github.com/apache/hadoop/blob/2e1d0ff4e901b8313c8d71869735b94ed8bc40a0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java#L99

      6) When our table writer calls Close(), it ultimately ends up calling S3AOutputStream::close(), which only then uploads to S3. S3AOutputStream::write() only writes to the local disk.
      https://github.com/apache/hadoop/blob/2e1d0ff4e901b8313c8d71869735b94ed8bc40a0/hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3AOutputStream.java#L120

      The problem here is that the local disk might not have enough space to buffer all these writes, which will cause the INSERT to fail. (When writing a 50GB file, we don't want to impose that the node must have 50GB free).

      Problem with the HdfsTextTableWriter:

      • It buffers everything into one file, no matter how large.

      HdfsParquetTableWriter splits the write such that it creates multiple files with a default size of 256MB. So this is not as bad as HdfsTextTableWriter as each file is closed once it reaches 256MB (or whatever we set the default parquet file size to).

      Solutions:
      1) Patch libHDFS to modify the S3AOutputStream so that we can stream writes to S3 instead of writing it all at once during Close().

      2) Think of a longer term more permanent fix (like not using libHDFS for S3 and using the AWS SDK directly).

      Attachments

        Issue Links

          Activity

            People

              stakiar Sahil Takiar
              sailesh Sailesh Mukil
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: