Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3452

S3: Disable Impala staging for INSERTs via flag for speedup

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • Impala 2.6.0
    • Impala 2.6.0
    • Backend

    Description

      INSERTs on S3 are slow because we do double buffering before the final write:

      How writes work for S3 is:
       - Impala does hdfsWrite() to the staging directory (_impala_insert_staging).
       - hdfsWrite() writes to local disk (i.e. HDFS code does it's own local staging).    <- First stage of buffering
       - When Impala does hdfsClose(), the local file is closed and then sent to S3 all at once.    <- Second stage of buffering
       - Coordinator on Impala does hdfsRename() to get file from _impala_insert_staging to the final location. 
       - However, S3 does not support rename(), so the files are copied.    <- Third and final write.
      

      We can introduce a flag which gets rid of the Impala staging phase since we already do local buffering during INSERT queries, and instead have the table sinks write directly to the final location.

      P.S: We cannot currently do this for INSERT OVERWRITEs because we delete files in the final location before copying the staged files there. However, these deletes are done by the coordinator only after the table sinks have done their writes, which means the old and the new files will be in the same path and we do not have a straightforward way of distinguishing between them to know what files to delete.

      Attachments

        Activity

          People

            sailesh Sailesh Mukil
            sailesh Sailesh Mukil
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: