Details
-
Task
-
Status: Resolved
-
Blocker
-
Resolution: Fixed
-
None
Description
Change #1
public static List<String> getAllPartitionPaths(FileSystem fs, String basePathStr, boolean useFileListingFromMetadata, boolean verifyListings, boolean assumeDatePartitioning) throws IOException { if (assumeDatePartitioning) { return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr); } else { HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", useFileListingFromMetadata, verifyListings, false, false); return tableMetadata.getAllPartitionPaths(); } }
is the current implementation, where `HoodieTableMetadata.create()` always creates `HoodieBackedTableMetadata`. Instead we should create `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. This helps address https://github.com/apache/hudi/pull/2398/files#r550709687
Change #2
On master, we have the `HoodieEngineContext` abstraction, which allows for parallel execution. We should consider moving it to `hudi-common` (its doable) and then have `FileSystemBackedTableMetadata` redone such that it can do parallelized listings using the passed in engine. either HoodieSparkEngineContext or HoodieJavaEngineContext. HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized code. We should take one pass and see if that can be redone a bit as well. Food for thought: https://github.com/apache/hudi/pull/2398#discussion_r550711216
Change #3
There are places, where we call fs.listStatus() directly. We should make them go through the HoodieTable.getMetadata()... route as well. Essentially, all listing should be concentrated to `FileSystemBackedTableMetadata`