Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1479

Replace FSUtils.getAllPartitionPaths() with HoodieTableMetadata#getAllPartitionPaths()

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      Change #1

      public static List<String> getAllPartitionPaths(FileSystem fs, String basePathStr, boolean useFileListingFromMetadata, boolean verifyListings,
                                                        boolean assumeDatePartitioning) throws IOException {
          if (assumeDatePartitioning) {
            return getAllPartitionFoldersThreeLevelsDown(fs, basePathStr);
          } else {
            HoodieTableMetadata tableMetadata = HoodieTableMetadata.create(fs.getConf(), basePathStr, "/tmp/", useFileListingFromMetadata,
                verifyListings, false, false);
            return tableMetadata.getAllPartitionPaths();
          }
       }
      

      is the current implementation, where `HoodieTableMetadata.create()` always creates `HoodieBackedTableMetadata`. Instead we should create `FileSystemBackedTableMetadata` if useFileListingFromMetadata==false anyways. This helps address https://github.com/apache/hudi/pull/2398/files#r550709687

      Change #2

      On master, we have the `HoodieEngineContext` abstraction, which allows for parallel execution. We should consider moving it to `hudi-common` (its doable) and then have `FileSystemBackedTableMetadata` redone such that it can do parallelized listings using the passed in engine. either HoodieSparkEngineContext or HoodieJavaEngineContext. HoodieBackedTableMetadata#getPartitionsToFilesMapping has some parallelized code. We should take one pass and see if that can be redone a bit as well.  Food for thought: https://github.com/apache/hudi/pull/2398#discussion_r550711216

       

      Change #3

      There are places, where we call fs.listStatus() directly. We should make them go through the HoodieTable.getMetadata()... route as well. Essentially, all listing should be concentrated to `FileSystemBackedTableMetadata`

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            uditme Udit Mehrotra
            vinoth Vinoth Chandar
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment