Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-2649 improve incremental stats scalability
  3. IMPALA-3563

Evaluate using global opposed to per partition statistics

    XMLWordPrintableJSON

    Details

      Description

      Impala and other SQL on Hadoop solutions use per partition statistics which creates a metadata scalability problem which I reckon outweighs benefits of having more accurate statistics.

      This is the proposal is for a partitioned table :

      • "Compute statistics" computes and stores per partition HLL same as before
      • Catalog merges the HLL(s) for all partitions and stores/persists global statistics
      • Impalad(s) never request per partition statics only global stats
      • The only time the catalog needs to read the per partition HLL is when regenerating the global stats as part of adding/removing partitions

      In other words during planning the partitioned table looks very similar to a non-partitioned table.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                mmokhtar Mostafa Mokhtar
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated: