Uploaded image for project: 'Hive'
  1. Hive
  2. HIVE-21037

Replicate column statistics for Hive tables

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • HiveServer2
    • None

    Description

      Statistics is important for query optimizations and thus keeping those up-to-date on replica is important from query performance perspective. The statistics are collected by scanning a table entirely. Thus when the data is replicated a. we could update the statistics by scanning it on replica or b. we could just replicate the statistics also. For following reasons we desire to go by the second approach instead of the first.

      1. Scanning the data on replica isn’t a good option since it wastes CPU cycles and puts load during replication, which can be significant.
      2. Storages like S3 may not have compute capabilities and thus when we are replicating from on-prem to cloud, we can not rely on the target to gather statistics.
      3. For ACID tables, the statistics should be associated with the snapshot. This means the statistics collection on target should sync with the write-id on the source since target doesn't generate target ids of its own.

      Attachments

        Issue Links

          Activity

            People

              ashutosh.bapat Ashutosh Bapat
              ashutosh.bapat Ashutosh Bapat
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 72h Original Estimate - 72h
                  72h
                  Remaining:
                  Time Spent - 14h 10m Remaining Estimate - 72h
                  72h
                  Logged:
                  Time Spent - 14h 10m Remaining Estimate - 72h
                  14h 10m