Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21040

msck does unnecessary file listing at last level of directory tree

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4.0, 4.0.0, 3.2.0
    • Component/s: None
    • Labels:
      None

      Description

      Here is the code snippet which is run by msck to list directories

      final Path currentPath = pd.p;
            final int currentDepth = pd.depth;
            FileStatus[] fileStatuses = fs.listStatus(currentPath, FileUtils.HIDDEN_FILES_PATH_FILTER);
            // found no files under a sub-directory under table base path; it is possible that the table
            // is empty and hence there are no partition sub-directories created under base path
            if (fileStatuses.length == 0 && currentDepth > 0 && currentDepth < partColNames.size()) {
              // since maxDepth is not yet reached, we are missing partition
              // columns in currentPath
              logOrThrowExceptionWithMsg(
                  "MSCK is missing partition columns under " + currentPath.toString());
            } else {
              // found files under currentPath add them to the queue if it is a directory
              for (FileStatus fileStatus : fileStatuses) {
                if (!fileStatus.isDirectory() && currentDepth < partColNames.size()) {
                  // found a file at depth which is less than number of partition keys
                  logOrThrowExceptionWithMsg(
                      "MSCK finds a file rather than a directory when it searches for "
                          + fileStatus.getPath().toString());
                } else if (fileStatus.isDirectory() && currentDepth < partColNames.size()) {
                  // found a sub-directory at a depth less than number of partition keys
                  // validate if the partition directory name matches with the corresponding
                  // partition colName at currentDepth
                  Path nextPath = fileStatus.getPath();
                  String[] parts = nextPath.getName().split("=");
                  if (parts.length != 2) {
                    logOrThrowExceptionWithMsg("Invalid partition name " + nextPath);
                  } else if (!parts[0].equalsIgnoreCase(partColNames.get(currentDepth))) {
                    logOrThrowExceptionWithMsg(
                        "Unexpected partition key " + parts[0] + " found at " + nextPath);
                  } else {
                    // add sub-directory to the work queue if maxDepth is not yet reached
                    pendingPaths.add(new PathDepthInfo(nextPath, currentDepth + 1));
                  }
                }
              }
              if (currentDepth == partColNames.size()) {
                return currentPath;
              }
            }
      

      You can see that when the currentDepth at the maxDepth it still does a unnecessary listing of the files. We can improve this call by checking the currentDepth and bailing out early.

      This can improve the performance of msck command significantly especially when there are lot of files in each partitions on remote filesystems like S3 or ADLS

        Attachments

        1. HIVE-21040.01.patch
          3 kB
          Vihang Karajgaonkar
        2. HIVE-21040.02.patch
          11 kB
          Vihang Karajgaonkar
        3. HIVE-21040.03.patch
          11 kB
          Vihang Karajgaonkar
        4. HIVE-21040.04.patch
          11 kB
          Vihang Karajgaonkar
        5. HIVE-21040.05.branch-3.patch
          10 kB
          Vihang Karajgaonkar
        6. HIVE-21040.06.branch-2.patch
          10 kB
          Vihang Karajgaonkar

          Activity

            People

            • Assignee:
              vihangk1 Vihang Karajgaonkar
              Reporter:
              vihangk1 Vihang Karajgaonkar
            • Votes:
              1 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: