Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
Impala 2.9.0
-
ghx-label-5
Description
FileSystem.exists() is called in loadPartitionFileMetadata then again in refreshFileMetadata which seems redundant.
When dealing with a large number of partitions this can become a bottleneck.
private void loadPartitionFileMetadata(StorageDescriptor storageDescriptor, HdfsPartition partition) throws Exception { Preconditions.checkNotNull(storageDescriptor); Preconditions.checkNotNull(partition); Path partDirPath = new Path(storageDescriptor.getLocation()); FileSystem fs = partDirPath.getFileSystem(CONF); if (!fs.exists(partDirPath)) return; refreshFileMetadata(partition); }
private void refreshFileMetadata(HdfsPartition partition) throws CatalogException { Path partDir = partition.getLocationPath(); Preconditions.checkNotNull(partDir); try { FileSystem fs = partDir.getFileSystem(CONF); if (!fs.exists(partDir)) { partition.setFileDescriptors(new ArrayList<FileDescriptor>()); return; } if (!FileSystemUtil.supportsStorageIds(fs)) { synthesizeBlockMetadata(fs, partition); return; } // Index the partition file descriptors by their file names for O(1) look ups. ImmutableMap<String, FileDescriptor> fileDescsByName = Maps.uniqueIndex( partition.getFileDescriptors(), new Function<FileDescriptor, String>() { public String apply(FileDescriptor desc) { return desc.getFileName(); } });
Before and after Java profiles attached, the number of socket reads goes down from 1,639 to 1,046. For a table with 80 partitions and 250K files this gave a 15-20% speedup.
Attachments
Attachments
Issue Links
- is related to
-
IMPALA-5429 Use a thread pool to load block metadata in parallel
- Resolved