Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-5431

Calling FileSystem.Exists() twice in a row for the same partition adds unnecessary latency to metadata loading

    XMLWordPrintableJSON

Details

    • ghx-label-5

    Description

      FileSystem.exists() is called in loadPartitionFileMetadata then again in refreshFileMetadata which seems redundant.

      When dealing with a large number of partitions this can become a bottleneck.

       private void loadPartitionFileMetadata(StorageDescriptor storageDescriptor,
            HdfsPartition partition) throws Exception {
          Preconditions.checkNotNull(storageDescriptor);
          Preconditions.checkNotNull(partition);
          Path partDirPath = new Path(storageDescriptor.getLocation());
          FileSystem fs = partDirPath.getFileSystem(CONF);
         if (!fs.exists(partDirPath)) return;
          refreshFileMetadata(partition);
        }
      
      private void refreshFileMetadata(HdfsPartition partition) throws CatalogException {
          Path partDir = partition.getLocationPath();
          Preconditions.checkNotNull(partDir);
          try {
            FileSystem fs = partDir.getFileSystem(CONF);
            if (!fs.exists(partDir)) {
              partition.setFileDescriptors(new ArrayList<FileDescriptor>());
              return;
            }
            if (!FileSystemUtil.supportsStorageIds(fs)) {
              synthesizeBlockMetadata(fs, partition);
              return;
            }
            // Index the partition file descriptors by their file names for O(1) look ups.
            ImmutableMap<String, FileDescriptor> fileDescsByName = Maps.uniqueIndex(
                partition.getFileDescriptors(), new Function<FileDescriptor, String>() {
                  public String apply(FileDescriptor desc) {
                    return desc.getFileName();
                  }
                });
      

      Before and after Java profiles attached, the number of socket reads goes down from 1,639 to 1,046. For a table with 80 partitions and 250K files this gave a 15-20% speedup.

      Attachments

        1. Baseline.jfr
          181 kB
          Mostafa Mokhtar
        2. After removing redundant fs.exists().jfr
          214 kB
          Mostafa Mokhtar

        Issue Links

          Activity

            People

              bharathv Bharath Vissapragada
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: