Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4835 HDFS scans should operate with a constrained number of I/O buffers
  3. IMPALA-5307

Consider always copying-out Disk I/O buffers instead of attaching to RowBatches

Attach filesAttach ScreenshotVotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      IMPALA-4835 would be greatly simplified if we don't have to attach disk I/O buffers to RowBatches and handle the resultant complexity.

      Disk I/O buffers currently need to be attached to RowBatches if the row batches directly reference var-len data in the buffer. The cases when this can occur are as follows:

      • The column being read contains strings
      • The string data is not dictionary encoded in Parquet (since we copy out the dictionary data in Parquet)
      • The string data is not compressed with a general-purpose compression algorithm (GZip, snappy, etc).

      This includes the following cases: plain-encoded strings in uncompressed Parquet; any strings in uncompressed text, RCFile, Avro, or sequence file.

      In those cases the copy avoidance could provide some performance benefits. However it's unclear that any of those file formats are/should be used in performance-critical use cases, because the storage density of uncompressed strings is almost always terrible.

      We should evaluate the performance impact of the additional copies, but I suspect that it is not severe and does not impact any important use cases.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            tarmstrong Tim Armstrong
            tarmstrong Tim Armstrong
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment