Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-7906

improve the parallelism deduce in rdd write

Log workAgile BoardRank to TopRank to BottomAttach filesAttach ScreenshotBulk Copy AttachmentsBulk Move AttachmentsVotersWatch issueWatchersCreate sub-taskConvert to sub-taskLinkCloneLabelsUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.16.0, 1.0.0
    • None

    Description

      as https://github.com/apache/hudi/issues/11274 and https://github.com/apache/hudi/pull/11463 describe, there has two case question.

      1. if the rdd is input rdd without shuffle, the partitiion number is too bigger or too small
      2. user need can not control it easy
        1. in some case user can set `spark.default.parallelism` change it.
        2. in some case user can not change because hard-code
        3. and in spark, the better way is use `spark.default.parallelism` or `spark.sql.shuffle.partitions` can control it, other is advanced in hudi.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            KnightChess KnightChess Assign to me
            KnightChess KnightChess
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment