Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-23365

Put RS deduplication optimization under cost based decision

    XMLWordPrintableJSON

Details

    Description

      Currently, RS deduplication is always executed whenever it is semantically correct. However, it could be beneficial to leave both RS operators in the plan, e.g., if the NDV of the second RS is very low. Thus, we would like this decision to be cost-based. We could use a simple heuristic that would work fine for most of the cases without introducing regressions for existing cases, e.g., if NDV for partition column is less than estimated parallelism in the second RS, do not execute deduplication.

      Attachments

        1. HIVE-23365.01.patch
          38 kB
          Stamatis Zampetakis
        2. HIVE-23365.02.patch
          38 kB
          Stamatis Zampetakis
        3. HIVE-23365.03.patch
          38 kB
          Stamatis Zampetakis
        4. HIVE-23365.04.patch
          51 kB
          Stamatis Zampetakis
        5. HIVE-23365.05.patch
          51 kB
          Stamatis Zampetakis

        Issue Links

          Activity

            People

              zabetak Stamatis Zampetakis
              jcamachorodriguez Jesus Camacho Rodriguez
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1h
                  1h