Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1292 [Umbrella] RFC-15 : File Listing and Query Planning Optimizations
  3. HUDI-1479

Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()

    XMLWordPrintableJSON

    Details

      Description

      Change #1

      public static List<String> getAllPartitionPaths(FileSystem fs, String basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
                                                        boolean assumeDatePartitioning) throws IOException {
          if (assumeDatePartitioning) {
            return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
          } else {
            HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", useFileListingFromMetadata,
                verifyListings, false, false);
            return tableMetadata.getAllPartitionPaths();
          }
       }
      

      is the current implementation, where `HoodieTableMetadata.create()` always creates `HoodieBackedTableMetadata`. Instead we should create `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. This helps address https://github.com/apache/hudi/pull/2398/files#r550709687

      Change #2

      On master, we have the `HoodieEngineContext` abstraction, which allows for parallel execution. We should consider moving it to `hudi-common` (its doable) and then have `FileSystemBackedTableMetadata` redone such that it can do parallelized listings using the passed in engine. either HoodieSparkEngineContext or HoodieJavaEngineContext. HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized code. We should take one pass and see if that can be redone a bit as well.  Food for thought: https://github.com/apache/hudi/pull/2398#discussion_r550711216

       

      Change #3

      There are places, where we call fs.listStatus() directly. We should make them go through the HoodieTable.getMetadata()... route as well. Essentially, all listing should be concentrated to `FileSystemBackedTableMetadata`

        Attachments

        1. image-2021-01-05-10-00-35-187.png
          181 kB
          Vinoth Chandar

          Issue Links

            Activity

              People

              • Assignee:
                uditme Udit Mehrotra
                Reporter:
                vinoth Vinoth Chandar
              • Votes:
                0 Vote for this issue
                Watchers:
                1 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: