Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-9316

Consider coalescing S3 scans

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • Backend
    • None
    • ghx-label-8

    Description

      We should consider coalescing S3 reads. IIUC the current DiskIoMgr code for S3A does not do anything special for scheduling S3 scan ranges. It simply round-robin assigns scans to IO threads.

      I think there might be a smarter algorithm we could employ when scheduling S3 reads. A few things to consider:

      • With the migration to hdfsPreadFully, each S3 scan range should correspond to a single HTTP GET request (assuming the 8 MB limit is not hit, see below)
      • read_size limits the size of a read to 8 MB (I believe if a scan range exceeds this limit, the reads are just done on the same IO thread, but sequentially - they are broken up into multiple HTTP GET requests)
      • S3A has a readahead option that defaults to 64 KB, however, it only applies in certain situations
        • If fs.s3a.experimental.input.fadvise=random (which is the recommended value when reading Parquet / ORC data), the readahead applies if (1) it won't cause the read to go past the end of the file, and (2) the request read length is under 64 KB (it reads up to Math.max(requested-read-length, 64 KB)) (so the readahead most likely applies for small reads)

      Coalescing reads would allow Impala to combine multiple, small HTTP GET requests into fewer, larger HTTP GET requests. There may be some data that needs to be skipped over, but the cost of reading that extra data might outweigh the cost of issuing multiple HTTP requests. Since each HTTP request requires a round-trip to S3, issuing a lot of GET requests can be costly, especially if each only reads a small amount of data.

      Some implementation factors to consider:

      • There should probably be a limit on the maximum size of a read request (is 8 MB the right value for S3?)
      • Since S3A uses a default of 64 KB for their readahead, we can probably use a similar value
      • Should the number of disk IO threads be considered when coalescing reads? e.g. by default there are 16 IO threads, if there are 16 small scan ranges, does it make more sense to coalesce them into a single large scan range, or would we get better throughput by issuing all 16 in parallel

      Attachments

        Activity

          People

            drorke David Rorke
            stakiar Sahil Takiar
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated: