Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-6112

Improve the thread pool size detection logic while loading partitioned table block metadata

Attach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • Impala 2.11.0
    • None
    • Catalog
    • ghx-label-4

    Description

      IMPALA-5429 added a thread pool based block metadata loading logic to increase the block metadata loading throughput.

        /**
         * Returns the thread pool size to load the file metadata of this table.
         * 'numPaths' is the number of paths for which the file metadata should be loaded.
         *
         * We use different thread pool sizes for HDFS and non-HDFS tables since the latter
         * supports much higher throughput of RPC calls for listStatus/listFiles. For
         * simplicity, the filesystem type is determined based on the table's root path and
         * not for each partition individually. Based on our experiments, S3 showed a linear
         * speed up (up to ~100x) with increasing number of loading threads where as the HDFS
         * throughput was limited to ~5x in un-secure clusters and up to ~3.7x in secure
         * clusters. We narrowed it down to scalability bottlenecks in HDFS RPC implementation
         * (HADOOP-14558) on both the server and the client side.
         */
        private int getLoadingThreadPoolSize(int numPaths) throws CatalogException {
          Preconditions.checkState(numPaths > 0);
          FileSystem tableFs;
          try {
            tableFs  = (new Path(getLocation())).getFileSystem(CONF);
          } catch (IOException e) {
            throw new CatalogException("Invalid table path for table: " + getFullName(), e);
          }
          int threadPoolSize = FileSystemUtil.supportsStorageIds(tableFs) ?
              MAX_HDFS_PARTITIONS_PARALLEL_LOAD : MAX_NON_HDFS_PARTITIONS_PARALLEL_LOAD;
          // Thread pool size need not exceed the number of paths to be loaded.
          return Math.min(numPaths, threadPoolSize);
        }
      

      As the method comment says, we choose the thread pool size based on table base directory FS type. Given we support multiple filesystem types in the same partitioned table, this may not always be optimal. For example if the base table file system is on HDFS, but 90% of the partitions are on S3, the above method returns a smaller thread pool size which is sub-optimal. This jira is to fix that behavior.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            bharathv Bharath Vissapragada

            Dates

              Created:
              Updated:

              Slack

                Issue deployment