Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1697

A parallel scan needed for FS.

    XMLWordPrintableJSON

Details

    Description

      I am running Hudi with GCS as a backend. It takes way too long to update the file system view for several hundred partitions. I think it can be done in parallel, so the process could be speed up significantly.

      Here is a small cut from the logs where I notice the slow processing. The original one is much longer and takes several minutes to complete.

      ```
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: #files found in partition (2020/05/12) =66, Time taken =45
      21/03/16 20:02:56 INFO HoodieTableFileSystemView: Adding file-groups for partition :2020/05/12, #FileGroups=22
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=66, NumFileGroups=22, FileGroupsCreationTime=3, StoreTimeTaken=1
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: Time to load partition (2020/05/12) =76
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: Took 1 ms to read 0 instants, 0 replaced file groups
      21/03/16 20:02:56 INFO ClusteringUtils: Found 0 files in pending clustering operations
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: Building file system view for partition (2020/03/25)
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: #files found in partition (2020/03/25) =36, Time taken =36
      21/03/16 20:02:56 INFO HoodieTableFileSystemView: Adding file-groups for partition :2020/03/25, #FileGroups=12
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=36, NumFileGroups=12, FileGroupsCreationTime=1, StoreTimeTaken=1
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: Time to load partition (2020/03/25) =62
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
      21/03/16 20:02:56 INFO ClusteringUtils: Found 0 files in pending clustering operations
      21/03/16 20:02:56 INFO AbstractTableFileSystemView: Building file system view for partition (2020/10/15)
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: #files found in partition (2020/10/15) =201, Time taken =100
      21/03/16 20:02:57 INFO HoodieTableFileSystemView: Adding file-groups for partition :2020/10/15, #FileGroups=128
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=201, NumFileGroups=128, FileGroupsCreationTime=6, StoreTimeTaken=1
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: Time to load partition (2020/10/15) =148
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
      21/03/16 20:02:57 INFO ClusteringUtils: Found 0 files in pending clustering operations
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: Building file system view for partition (2021/01/11)
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: #files found in partition (2021/01/11) =311, Time taken =71
      21/03/16 20:02:57 INFO HoodieTableFileSystemView: Adding file-groups for partition :2021/01/11, #FileGroups=302
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=311, NumFileGroups=302, FileGroupsCreationTime=9, StoreTimeTaken=1
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: Time to load partition (2021/01/11) =110
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
      21/03/16 20:02:57 INFO ClusteringUtils: Found 0 files in pending clustering operations
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: Building file system view for partition (2019/07/08)
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: #files found in partition (2019/07/08) =2, Time taken =40
      21/03/16 20:02:57 INFO HoodieTableFileSystemView: Adding file-groups for partition :2019/07/08, #FileGroups=1
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: addFilesToView: NumFiles=2, NumFileGroups=1, FileGroupsCreationTime=0, StoreTimeTaken=1
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: Time to load partition (2019/07/08) =63
      21/03/16 20:02:57 INFO AbstractTableFileSystemView: Took 0 ms to read 0 instants, 0 replaced file groups
      21/03/16 20:02:57 INFO ClusteringUtils: Found 0 files in pending clustering operations
      ```

      Attachments

        Activity

          People

            Unassigned Unassigned
            vburenin Volodymyr Burenin
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: