[IMPALA-6112] Improve the thread pool size detection logic while loading partitioned table block metadata - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Open
Priority: Major
Resolution: Unresolved
Affects Version/s: Impala 2.11.0
Fix Version/s: None
Component/s: Catalog
Labels:
- metadata
- performance

Epic Color:
ghx-label-4

Description

~~IMPALA-5429~~ added a thread pool based block metadata loading logic to increase the block metadata loading throughput.

  /**
   * Returns the thread pool size to load the file metadata of this table.
   * 'numPaths' is the number of paths for which the file metadata should be loaded.
   *
   * We use different thread pool sizes for HDFS and non-HDFS tables since the latter
   * supports much higher throughput of RPC calls for listStatus/listFiles. For
   * simplicity, the filesystem type is determined based on the table's root path and
   * not for each partition individually. Based on our experiments, S3 showed a linear
   * speed up (up to ~100x) with increasing number of loading threads where as the HDFS
   * throughput was limited to ~5x in un-secure clusters and up to ~3.7x in secure
   * clusters. We narrowed it down to scalability bottlenecks in HDFS RPC implementation
   * (HADOOP-14558) on both the server and the client side.
   */
  private int getLoadingThreadPoolSize(int numPaths) throws CatalogException {
    Preconditions.checkState(numPaths > 0);
    FileSystem tableFs;
    try {
      tableFs  = (new Path(getLocation())).getFileSystem(CONF);
    } catch (IOException e) {
      throw new CatalogException("Invalid table path for table: " + getFullName(), e);
    }
    int threadPoolSize = FileSystemUtil.supportsStorageIds(tableFs) ?
        MAX_HDFS_PARTITIONS_PARALLEL_LOAD : MAX_NON_HDFS_PARTITIONS_PARALLEL_LOAD;
    // Thread pool size need not exceed the number of paths to be loaded.
    return Math.min(numPaths, threadPoolSize);
  }

As the method comment says, we choose the thread pool size based on table base directory FS type. Given we support multiple filesystem types in the same partitioned table, this may not always be optimal. For example if the base table file system is on HDFS, but 90% of the partitions are on S3, the above method returns a smaller thread pool size which is sub-optimal. This jira is to fix that behavior.

Attachments

Issue Links

is related to

IMPALA-5429 Use a thread pool to load block metadata in parallel

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Bharath Vissapragada

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 26/Oct/17 22:34

Updated:: 07/Feb/23 11:09