Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1469

Faster initialization for larger datasets

    XMLWordPrintableJSON

Details

    • Task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.7.0
    • None

    Description

      For very large tables (200+ partitions and 100K+ files), the current initialization code in HoodieBackedTableMetadataWriter is slow as it uses a sequential listing to list all partitions and files.

      Also, the above code is inefficient as it list each directory twice - first for getting list of partitions and later for getting list of files. This can be done together.

      Attachments

        Issue Links

          Activity

            People

              pwason Prashant Wason
              pwason Prashant Wason
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - 4h
                  4h
                  Remaining:
                  Remaining Estimate - 4h
                  4h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified