Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-6990

Configurable clustering task parallelism

    XMLWordPrintableJSON

Details

    Description

      Spark executes clustering job will read clustering plan which contains multiple groups. Each group process many base files or log files. When we config param `
      hoodie.clustering.plan.strategy.sort.columns`, we read those files through spark's parallelize method, every file read will generate one sub task. It's unreasonable.

      Attachments

        1. after-subtasks.png
          83 kB
          Askwang
        2. before-subtasks.png
          57 kB
          Askwang

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ksmou Askwang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: