VotersWatch issueWatchersLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Sub-task
    • Status: Closed
    • Major
    • Resolution: Not A Problem
    • None
    • None
    • None

    Description

      It will be useful for Apache Beam if we support the following SqlHint:

      SELECT * FROM t
      GROUP BY t.key_column /* + hot_key(key1=fanout_factor, ...) */)

      The hot key strategy works on aggregation and it provides a list of hot keys with fanout factor for a column. The fanout factor says how many partition should be created for that specific key, such that we can have a per partition aggregate and then have a final aggregate. One example to explain it:

      SELECT * FROM t
      GROUP BY t.key_column /* + hot_key("value1"=2) */)

      // for the key_column, there is a "value1" which appear so many times (so it's hot), please consider split it into two partition and process separately.

      Such problem is common for big data processing, where hot key creates slowest machine which either slow down the whole pipeline or make retries. In such case, one common resolution is to split data to multiple partition and aggregate per partition, and then have a final combine.

      Usually execution engine won't know what is the hot key(s). SqlHint provides a good way to tell the engine which key is useful to deal with it.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            amaliujia Rui Wang
            amaliujia Rui Wang
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 0.5h
                0.5h

                Slack

                  Issue deployment