Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-3069

Improve HoodieMergedLogRecordScanner avoid putting unnecessary hoodie records

    XMLWordPrintableJSON

Details

    Description

      I found that when the compact plan is generated, the delta log files under each filegroup are arranged in the natural order of instant time. in the majority of cases,We can think that the latest data is in the latest delta log file, so we sort it from large to small according to the instance time, which can largely avoid rewriting the data in the compact process, and then optimize the compact time.

      In addition, when reading the delta log file, we compare the data in the external spillablemap with the delta log data. If oldrecord is selected, there is no need to rewrite the data in the external spillablemap. Rewriting data will waste a lot of resources when data is spill to disk

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            sucx scx
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: