Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2979

Simple scheduler : Reservoir sampling doesn't provide enough randomization

    Details

    • Docs Text:
      This change introduces the replica_preference query option, which needs to be added to the docs.
    • Target Version:

      Description

      It appears with random_non_cached_tiebreak enabled the reservoir sampling doesn't provide enough randomization which can result in hot spots.

        Issue Links

          Activity

          Hide
          lv Lars Volker added a comment -

          IMPALA-2979: Fix scheduling on remote hosts

          Also fixes: IMPALA-2400, IMPALA-3043

          This change fixes scheduling scan-ranges on remote hosts by adding
          remote backend selection capability to SimpleScheduler. Prior to this
          change the scheduler would try to select a local backend even when
          remote scheduling was requested.

          This change also allows pseudo-randomized remote backend selection to
          prevent convoying, which could happen when different independent
          schedulers had the same internal state, e.g. after a cluster restart. To
          enable the new behavior set the query option SCHEDULE_RANDOM_REPLICA to
          true.

          This change also fixes IMPALA-2400: Unpredictable locality behavior
          for reading Parquet files

          This change also fixes IMPALA-3043: SimpleScheduler does not handle
          hosts with multiple IP addresses correctly

          This change also does some clean-up in scheduler.h and
          simple-scheduler.

          {h,cc}

          .

          Change-Id: I044f83806fcde820fcb38047cf6b8e780d803858
          Reviewed-on: http://gerrit.cloudera.org:8080/3771
          Reviewed-by: Lars Volker <lv@cloudera.com>
          Reviewed-by: Sailesh Mukil <sailesh@cloudera.com>
          Tested-by: Internal Jenkins

          Show
          lv Lars Volker added a comment - IMPALA-2979 : Fix scheduling on remote hosts Also fixes: IMPALA-2400 , IMPALA-3043 This change fixes scheduling scan-ranges on remote hosts by adding remote backend selection capability to SimpleScheduler. Prior to this change the scheduler would try to select a local backend even when remote scheduling was requested. This change also allows pseudo-randomized remote backend selection to prevent convoying, which could happen when different independent schedulers had the same internal state, e.g. after a cluster restart. To enable the new behavior set the query option SCHEDULE_RANDOM_REPLICA to true. This change also fixes IMPALA-2400 : Unpredictable locality behavior for reading Parquet files This change also fixes IMPALA-3043 : SimpleScheduler does not handle hosts with multiple IP addresses correctly This change also does some clean-up in scheduler.h and simple-scheduler. {h,cc} . Change-Id: I044f83806fcde820fcb38047cf6b8e780d803858 Reviewed-on: http://gerrit.cloudera.org:8080/3771 Reviewed-by: Lars Volker <lv@cloudera.com> Reviewed-by: Sailesh Mukil <sailesh@cloudera.com> Tested-by: Internal Jenkins
          Hide
          lv Lars Volker added a comment -

          Scheduler needs proper tests to asses the fix for this issue.

          Show
          lv Lars Volker added a comment - Scheduler needs proper tests to asses the fix for this issue.

            People

            • Assignee:
              lv Lars Volker
              Reporter:
              mmokhtar Mostafa Mokhtar
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development