Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-3563

Evaluate using global opposed to per partition statistics

    XMLWordPrintableJSON

Details

    Description

      Impala and other SQL on Hadoop solutions use per partition statistics which creates a metadata scalability problem which I reckon outweighs benefits of having more accurate statistics.

      This is the proposal is for a partitioned table :

      • "Compute statistics" computes and stores per partition HLL same as before
      • Catalog merges the HLL(s) for all partitions and stores/persists global statistics
      • Impalad(s) never request per partition statics only global stats
      • The only time the catalog needs to read the per partition HLL is when regenerating the global stats as part of adding/removing partitions

      In other words during planning the partitioned table looks very similar to a non-partitioned table.

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mmokhtar Mostafa Mokhtar
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated: