Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-4789

Slow metadata loading with many partitions that have inconsistent HDFS path qualification

    Details

      Description

      The fix for IMPALA-4172/IMPALA-3653 introduced a performance regression for loading tables that have many partitions with:
      1. inconsistent HDFS path qualification or
      2. a custom location (not under the table root dir)

      For the first issue consider a table whose root path is at 'hdfs://localhost:8020/warehouse/tbl/'.
      A partition with an unqualified location '/warehouse/tbl/p=1' will not be recognized as being a descendant of the table root dir by FileSystemUtil.isDescendentPath() because of how Path.equals() behaves, even if 'hdfs://localhost:8020' is the default filesystem.

      Such partitions are incorrectly recognized as having a custom location and are treated specially. The treatment of such partitions is very inefficient, as show in the following code snippets:

      HdfsTable.loadAllPartitions():

      ...
              if (!dirsToLoad.contains(partDir) &&
                  !FileSystemUtil.isDescendantPath(partDir, tblLocation)) { <--- this condition will fail
                // This partition has a custom filesystem location. Load its file/block
                // metadata separately by adding it to the list of dirs to load.
                dirsToLoad.add(partDir);
              }
      ...
      

      HdfsTable.loadMetadataAndDiskIds() calls HdfsTable.loadBlockMetadata() once for every location:

        private void loadMetadataAndDiskIds(List<Path> locations,
            HashMap<Path, List<HdfsPartition>> partsByPath) {
          LOG.info(String.format("Loading file and block metadata for %s partitions: %s",
              partsByPath.size(), getFullName()));
          for (Path location: locations) { loadBlockMetadata(location, partsByPath); }
          LOG.info(String.format("Loaded file and block metadata for %s partitions: %s",
              partsByPath.size(), getFullName()));
        }
      

      HdfsTable.loadBlockMetadata():

      ...
            // Clear the state of partitions under dirPath since they are now updated based
            // on the current snapshot of files in the directory.
            for (Map.Entry<Path, List<HdfsPartition>> entry: partsByPath.entrySet()) { <--- partsByPath has an entry for every partition in the table
              Path partDir = entry.getKey();
              if (!FileSystemUtil.isDescendantPath(partDir, dirPath)) continue;
              for (HdfsPartition partition: entry.getValue()) {
                partition.setFileDescriptors(new ArrayList<FileDescriptor>());
              }
            }
      ...
      

      As a result, it means that we will call isDescendentPath() roughly #numLocations * #totalPartitions times which can add up fast for tables with many partitions.

      There are two issues to fix here:
      1. The bug in recognizing partitions under the root table dir (for inconsistent qualification of table/partition locations)
      2. The expensive loop for partitions with custom locations (even if legitimately custom)

        Issue Links

          Activity

          Hide
          jbapple Jim Apple added a comment -
          Show
          jbapple Jim Apple added a comment - Patch available: https://gerrit.cloudera.org/#/c/5743/
          Hide
          alex.behm Alexander Behm added a comment -

          commit 7b8ffd35534c11ae3caa048229effc97613cd34f
          Author: Alex Behm <alex.behm@cloudera.com>
          Date: Thu Jan 19 00:22:47 2017 -0800

          IMPALA-4789: Fix slow metadata loading due to inconsistent paths.

          The fix for IMPALA-4172/IMPALA-3653 introduced a performance
          regression for loading tables that have many partitions with:
          1. Inconsistent HDFS path qualification or
          2. A custom location (not under the table root dir)

          For the first issue consider a table whose root path is at
          'hdfs://localhost:8020/warehouse/tbl/'.
          A partition with an unqualified location '/warehouse/tbl/p=1'
          will not be recognized as being a descendant of the table root
          dir by FileSystemUtil.isDescendentPath() because of how
          Path.equals() behaves, even if 'hdfs://localhost:8020' is the
          default filesystem. Such partitions are incorrectly recognized
          as having a custom location and are loaded separately.

          There were two performance issues:
          1. The code for loading the files/blocks of partitions with
          truly custom locations was inefficient with an O(N^2)
          loop for determining the target partitions.
          2. Each partition that is incorrectly identified as having
          a custom path (e.g. due to inconsistent qualification),
          is going to have its files/blocks loaded twice. Once
          when the table root path is processed, and once when the
          'custom' partition is processed.

          This patch fixes the detection of partitions with custom
          locations, and improves the speed of loading partitions
          with custom locations.

          Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55
          Reviewed-on: http://gerrit.cloudera.org:8080/5743
          Reviewed-by: Alex Behm <alex.behm@cloudera.com>
          Tested-by: Impala Public Jenkins

          Show
          alex.behm Alexander Behm added a comment - commit 7b8ffd35534c11ae3caa048229effc97613cd34f Author: Alex Behm <alex.behm@cloudera.com> Date: Thu Jan 19 00:22:47 2017 -0800 IMPALA-4789 : Fix slow metadata loading due to inconsistent paths. The fix for IMPALA-4172 / IMPALA-3653 introduced a performance regression for loading tables that have many partitions with: 1. Inconsistent HDFS path qualification or 2. A custom location (not under the table root dir) For the first issue consider a table whose root path is at 'hdfs://localhost:8020/warehouse/tbl/'. A partition with an unqualified location '/warehouse/tbl/p=1' will not be recognized as being a descendant of the table root dir by FileSystemUtil.isDescendentPath() because of how Path.equals() behaves, even if 'hdfs://localhost:8020' is the default filesystem. Such partitions are incorrectly recognized as having a custom location and are loaded separately. There were two performance issues: 1. The code for loading the files/blocks of partitions with truly custom locations was inefficient with an O(N^2) loop for determining the target partitions. 2. Each partition that is incorrectly identified as having a custom path (e.g. due to inconsistent qualification), is going to have its files/blocks loaded twice. Once when the table root path is processed, and once when the 'custom' partition is processed. This patch fixes the detection of partitions with custom locations, and improves the speed of loading partitions with custom locations. Change-Id: I8c881b7cb155032b82fba0e29350ca31de388d55 Reviewed-on: http://gerrit.cloudera.org:8080/5743 Reviewed-by: Alex Behm <alex.behm@cloudera.com> Tested-by: Impala Public Jenkins
          Hide
          mmokhtar Mostafa Mokhtar added a comment -

          IMPALA-5042 makes loading of custom and regular partitions on par.

          Show
          mmokhtar Mostafa Mokhtar added a comment - IMPALA-5042 makes loading of custom and regular partitions on par.

            People

            • Assignee:
              alex.behm Alexander Behm
              Reporter:
              alex.behm Alexander Behm
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development